ITS836 GGA Week 4 The Concept of Advanced Clustering Algorithm Paper Upon completion of the week 4 lesson, you will be able to: Explain the concept of Adv

ITS836 GGA Week 4 The Concept of Advanced Clustering Algorithm Paper Upon completion of the week 4 lesson, you will be able to:

Explain the concept of Advanced Clustering algorithm
Discuss Applications of Clustering algorithms
Evaluation of Clustering algorithms

This week reading assignment is the course textbook chapter 4 (EMC Education Service (Eds). (2015) Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing, and Presenting Data, Indianapolis, IN: John Wiley & Sons, Inc).

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

Let’s understand this with an example. Suppose, you are the head of a rental store and wish to understand preferences of your costumers to scale up your business. Is it possible for you to look at details of each costumer and devise a unique business strategy for each one of them? Definitely not. But, what you can do is to cluster all of your costumers into say 10 groups based on their purchasing habits and use a separate strategy for costumers in each of these 10 groups. And this is what we call clustering.

Read Chapter 4 on Clustering Methods in Textbook:

EMC Education Service (Eds). (2015) Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing, and Presenting Data, Indianapolis, IN: John Wiley & Sons, Inc.

Also I recommend reading up on the Clustering methods from the following book:

An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani,

http://www-bcf.usc.edu/~gareth/ISL/index.html School of Computer &
Information Sciences
ITS 836 Data Science and Big Data Analytics
ITS 836
1
Week 4 What is ML?
ITS 836
2
Supervised vs. Unsupervised Learning
• Supervised Learning: both X and Y
are known
• Unsupervised Learning: only X
Supervised Learning
ITS 836
Unsupervised Learning
3
ML Algorithms – Applications
ITS 836
4
Summary of Algorithms Studied
Categorization (unsupervised)
1.K-means clustering
2. Association Rules
Regression – supervised)
3. Linear
4. Logistic
Classification supervised
5.Naïve Bayesian
classifier
6. Decision Trees
7. Time Series Analysis
8. Text Analysis
ITS 836
5
• Clustering
Clustering
– set of techniques for finding subgroups, clusters, in a data set.
– A good clustering is one when the observations within a group
are similar but between groups are very different
• For example
– suppose we collect p measurements on each of n breast cancer
patients.
– may be different unknown types of cancer which we could
discover by clustering the data
ITS 836
6
Clustering Methods
• Different types of
clustering methods
• most commonly
used approaches
To perform K-means clustering:
• Specify the desired number of clusters K
• K-means algorithm assigns each
observation to exactly one of the K clusters
– K-Means Clustering
– Hierarchical
Clustering
ITS 836
7
– it produces a tree based
representation of the observations,
10. Unsupervised Learning
called a Dendogram
398
9
0.5
2.5
9
2
7
8
3
6
−1.5
7
5
8
6
1
5
2
−1.0
−0.5
X2
0.0
2.0
1.5
1.0
0.5
– Hierarchical
Clustering
0.0
• choose the number
of clusters.
3.0
– K-Means Clustering
3
• Most commonly
used approaches
• Hierarchical Clustering is an
alternative to k-Means
4
Clustering Methods
1
4
−1.5
−1.0
−0.5
0.0
0.5
1.0
X1
F I G U R E 10.10. An illustrati on of how to proper ly interpret a dendrogram with
ni ne obser vations in two-dimensi onal space. Left : A dendrogram generated usi ng
Euclidean distance and complete linkage. Observations 5 and 7 are qui te simi lar
to each other, as are obser vati ons 1 and 6. However, observation 9 is no more
similar t o observation 2 than it is to observati ons 8, 5, and 7, even though obser vations
ITS 8369 and 2 are close together in terms of hor izontal di stance. T hi s is because8
observations 2, 8, 5, and 7 al l fuse with obser vation 9 at the same height, approx-
K-Means Algorithm
• Initial Step: Randomly assign each observation to one of
K clusters
• Iterate until the cluster assignments stop changing:
– For each of the K clusters, compute the cluster centroid. The kth
cluster centroid if the mean of the observations assigned to the
kth cluster
– Assign each observation to the cluster whose centroid is closest
(where “closest” is defined using Euclidean distance.
ITS 836
9
the K-Means
Algorithm
Random
Assignment of
points
Compute cluster
centers from initial
assignments
Computer
new cluster
centers
Assign points to
closest cluster
center
Now there is
no further
change so
stop
ITS 836
10
Local Optimums
Bad
Solution
• The K-means
algorithm can get
stuck in “local
optimums” and not
find the best solution
• Hence, it is important
to run the algorithm
multiple times with
random starting
points to find a good
solution
ITS 836
Good
Solution
11
Hierarchical Clustering – Dendograms
398
10. Unsupervised Learning
• First join closest points (5 and
7)
• Height of fusing/merging (on
vertical axis) indicates how
similar the points are
• After the points are fused
they are treated as a single
F I G U R E 10.10. An illustrati on of how to proper ly interpret a dendrogram with
observation and the
ni ne obser vations in two-dimensi onal space. Left : A dendrogram generated usi ng
Euclidean distance and complete linkage. Observations 5 and 7 are qui te simi lar
algorithm continues
to each other, as are obser vati ons 1 and 6. However, observation 9 is no more
7
8
3
6
−1.5
7
5
6
8
4
1
0.0
5
2
−1.0
−0.5
2
3
0.5
1.0
9
1.5
X2
0.0
2.0
2.5
0.5
3.0
9
1
4
−1.5
−1.0
−0.5
0.0
0.5
1.0
X1
similar t o observation 2 than it is to observati ons 8, 5, and 7, even though obser vations 9 and 2 are close together in terms of hor izontal di stance. T hi s is because
observations 2, 8, 5, and 7 al l fuse with obser vation 9 at the same height, approximately 1.8. Right : T he raw data used to generate the dendrogram can be used to
confirm that indeed, observation 9 is no more similar to observation 2 than it i s
836vati ons 8, 5, and 7.
12
toITS
obser
396
10. Unsupervised Learning
2
−2
0

Each “leaf” of the dendogram represents
one of the 45 observations
At the bottom of the dendogram, each
observation is a distinct leaf. However, as
we move up the tree, some leaves begin
to fuse. These correspond to observations
that are similar to each other.
F I G U R E 10.8. 45 observati ons generated in 2-di mensional space. I n reali ty
As we move higher up the tree, an
are three distinct classes, shown i n separate colors. However, we wi l l treat
increasing number of observations have there
these class labels as unknown and wil l seek to cluster the observations in order to
fused. The earlier (lower in the tree) two di scover the classes from the data.
observations fuse, the more similar they t hree-class model; t he t rue class labels for each observat ion are shown in
are to each other.
dist inct colors. However, suppose t hat t he dat a were observed wit hout t he
class labels, and t hat we want ed t o perform hierarchical clust ering of t he
Observations that fuse later are quite
dat a. Hierarchical clust ering (wit h complet e linkage, t o be discussed lat er)
yields t he result shown in t he left -hand panel of Figure 10.9. How can we
different
int erpret t his dendrogram?
X2

4
Interpretation
−6
−4
−2
0
2
X1


In t he left -hand panel of Figure 10.9, each leaf of t he dendrogram represent s one of t he 45 observat ions in Figure 10.8. However, as we move
up t he t ree, some leaves begin t o fuse int o branches. T hese correspond t o
observat ions t hat are similar t o each ot her. As we move higher up t he t ree,
t hemselves fuse, eit her wit h leaves or ot her branches. T he earlier
ITS branches
836
(lower in t he t ree) fusions occur, t he more similar t he groups of observa-
13
Choosing Clusters
• To choose clusters we draw lines across the
dendogram
• We can form any number of clusters depending on
where we draw the break point.
One Cluster
ITS 836
Two Clusters
Three Clusters
14
400
10. Unsupervised Learning
An Example
3
−0.5
X2
5
7
8
X2
7
8
−0.5
0.0
0.0
0.5
9
0.5
9
−1.0
−1.0
2
1
6
−1.5
−1.5
6
4
−0.5
0.0
0.5
−0.5
0.0
X1
9
9
0.5
1.0
0.0
0.0
3
7
8
X2
7
5
−0.5
X2
−1.0
X1
8
−0.5
4
−1.5
1.0
0.5
−1.0
0.5
−1.5
5
3
2
1
−1.0
2
1
1
6
−1.5
6
4
−1.5
−1.0
5
3
2
−1.0
Start with 9 clusters
Fuse 5 and 7
Fuse 6 and 1
Fuse the (5,7) cluster
with 8.
• Continue until all
observations are
fused.
−1.5




−0.5
0.0
X1
0.5
1.0
4
−1.5
−1.0
−0.5
0.0
0.5
1.0
X1
F I G U R E 10.11. An i llustration of the first few steps of the hi erarchi cal
clusteri ng algori thm, usi ng the data from Figure 10.10, wi th complete linkage
ITS
836
and Euclidean distance. Top Left : I nitial ly, there are ni ne di sti nct clusters,
15
Example – Online Retailer
The unique selling point of this retailer is that they make the
“returns” simple with an assumption that this policy
encourages use and “frequent customers are more valuable”.
To validate this assumption sample set of customers clustered
on purchase frequency, return rate, and lifetime customer
value (LTV)
Groups:
1: Visit less frequently, low return rate, moderate LTV (LTV- 3rd)
2: Visit often, return a lot of their purchases. (Lowest LTV)
3: Visit often, return things moderately, High LTV (LTV 2nd)
4: Visit rarely, don’t return purchases. (Highest avg LTV)
We apply Clustering to analyze retail data!
LTV: lifetime customer value
ITS 836
16
Text Book Lecture Review
EMC Education Service (Eds). (2015) Data Science
and Big Data Analytics: Discovering, Analyzing,
Visualizing, and Presenting Data, Indianapolis, IN:
John Wiley & Sons, Inc.
4.1 Overview of Clustering
• Clustering is the use of unsupervised techniques for
grouping similar objects
– Supervised methods use labeled objects
– Unsupervised methods use unlabeled objects
• Clustering looks for hidden structure in the data,
similarities based on attributes
– Often used for exploratory analysis
– No predictions are made
ITS 836
18
4.2 K-means Algorithm
• Given a collection of objects
each with n measurable
attributes and a chosen value k
of the number of clusters, the
algorithm identifies the k
clusters of objects based on the
objects proximity to the centers
of the k groups.
• The algorithm is iterative with
the centers adjusted to the
mean of each cluster’s ndimensional vector of attributes


Clustering is often used as a lead-in to
classification, where labels are applied to
the identified clusters
Some applications
– Image processing
• With security images, successive frames are
examined for change
– Medical
• Patients can be grouped to identify naturally
occurring clusters
– Customer segmentation
• Marketing and sales groups identify
customers having similar behaviors and
spending patterns
ITS 836
19
4.2.2 Overview of the Method
1.
2.
3.
4.
Four Steps
Choose the value of k and the
initial guesses for the centroids
Compute the distance from each
data point to each centroid, and
assign each point to the closest
centroid
Compute the centroid of each
newly defined cluster from step 2
Repeat steps 2 and 3 until the
algorithm converges (no changes
occur)
ITS 836
Step 1: Set k = 3 and initial
clusters centers
20
4.2.2 Overview of the Method
1.
2.
3.
4.
Four Steps
Choose the value of k and the
initial guesses for the centroids
Compute the distance from each
data point to each centroid, and
assign each point to the closest
centroid
Compute the centroid of each
newly defined cluster from step 2
Repeat steps 2 and 3 until the
algorithm converges (no changes
occur)
ITS 836
Step 2: Points are assigned to
the closest centroid
21
4.2.2 Overview of the Method
1.
2.
3.
4.
Four Steps
Choose the value of k and the
initial guesses for the centroids
Compute the distance from each
data point to each centroid, and
assign each point to the closest
centroid
Compute the centroid of each
newly defined cluster from step 2
Repeat steps 2 and 3 until the
algorithm converges (no changes
occur)
ITS 836
Step 2: Compute Centroid
22
4.2.2 Overview of the Method
Four Steps
1. Choose the value of k and the initial
• Step 4: Repeat steps 2 and 3 until
guesses for the centroids
convergence
• Convergence occurs when the
2. Compute the distance from each data
centroids do not change or when the
point to each centroid, and assign each
centroids oscillate back and forth
point to the closest centroid
• This can occur when one or more
3. Compute the centroid of each newly
points have equal distances from the
defined cluster from step 2
centroid centers
4. Repeat steps 2 and 3 until the algorithm
converges (no changes occur)
ITS 836
23
4.2.3 Determining Number of Clusters



Reasonable guess
Predefined requirement
Use heuristic – e.g., Within Sum of
Squares (WSS)
Example of WSS vs #Clusters curve
– WSS metric is the sum of the squares
of the distances between each data
point and the closest centroid
– The process of identifying the
appropriate value of k is referred to as
finding the “elbow” of the WSS curve
The elbow of the curve appears to occur at k = 3.
ITS 836
24
4.2.4 Diagnostics
• When the number of clusters
is small, plotting the data
helps refine the choice of k
• The following questions
should be considered



Example of distinct clusters
Are the clusters well separated from each other?
Do any of the clusters have only a few points
Do any of the centroids appear to be too close to
each other?
ITS 836
25
4.2.4 Diagnostics
Example of less obvious clusters
Six clusters from points of less
obvious cluster
ITS 836
26
4.2.5 Reasons to Choose and Cautions
• Decisions the practitioner must
make
– What object attributes should be
included in the analysis?
– What unit of measure should be
used for each attribute?
– Do the attributes need to be
rescaled?
– What other considerations might
apply?
• Important to understand what
attributes will be known at the
time a new object is assigned to
a cluster


ITS 836
E.g., customer satisfaction may be available
for modeling but not available for potential
customers
Best to reduce number of attributes when
possible



Too many attributes minimize the impact of
key variables
Identify highly correlated attributes for
reduction
Combine several attributes into one: e.g.,
debt/asset ratio
27
4.2.5 Reasons to Choose and CautionsObject attributes: scatterplot matrix for
seven attributes
• Decisions the practitioner must
make
– What object attributes should be
included in the analysis?
– What unit of measure should be
used for each attribute?
– Do the attributes need to be
rescaled?
– What other considerations might
apply?
ITS 836
28
4.2.5 Reasons to Choose and Cautions
• Decisions the practitioner must
make
– What object attributes should be
included in the analysis?
– What unit of measure should be
used for each attribute?
– Do the attributes need to be
rescaled?
– What other considerations might
apply?
ITS 836
• K-means algorithm will identify
different clusters depending on
the units of measure
29
4.2.5 Reasons to Choose and Cautions
• Decisions the practitioner must
make
– What object attributes should be
included in the analysis?
– What unit of measure should be
used for each attribute?
– Do the attributes need to be
rescaled?
– What other considerations might
apply?
ITS 836
• K-means algorithm will identify
different clusters depending on
the units of measure
Age dominates
k=2
30
4.2.5 Reasons to Choose and Cautions


• Decisions the practitioner must
make
– What object attributes should be
included in the analysis?
– What unit of measure should be
used for each attribute?
– Do the attributes need to be
rescaled?
– What other considerations might
apply?
ITS 836
Rescaling can reduce domination effect
E.g., divide each variable by the
appropriate standard deviation
Rescaled attributes
31
4.2.5 Reasons to Choose and Cautions
• Decisions the practitioner must
make

– What object attributes should be
included in the analysis?
– What unit of measure should be
used for each attribute?
– Do the attributes need to be
rescaled?
– What other considerations might
apply?
ITS 836


K-means sensitive to starting seeds
– Important to rerun with several
seeds – R has the nstart option
Could explore distance metrics other
than Euclidean
– E.g., Manhattan, Mahalanobis,
etc.
K-means is easily applied to numeric
data and does not work well with
nominal attributes
– E.g., color
32
Reasons
to Chose
and
Cautions
ITS 836
33
Questions?
ITS 836
34
School of Computer &
Information Sciences
ITS 836 Data Science and Big Data Analytics
1
Lecture 04: Clustering – Homework
Homework 1: Review the Iris Data set perform
clustering via “kmeans” (see next slides)
Homework 2: Perform hierarchical clustering on
the iris data set: https://cran.rproject.org/web/packages/dendextend/vignett
es/Cluster_Analysis.html
Homework 3: Use the data set USArrests, to
cluster the US States according to
https://uc-r.github.io/kmeans_clustering
Deliverable should be a powerpoint or
Homework 4: Review Section 4_2R Exercise
Word, Use existing homework template
Homework 5 Clustering on a data set
With your name, ID and date.
Homework 6: Continue R for Datascience
ITS 836
2
exercises
Iris Dataset Source
Goal: Predict the types of iris in Hawaii
R.A. Fisher, 1936
• Attributes: sepal length, sepal width, petal length, petal width
– All flowers contain a sepal and a petal
– For the iris flowers three categories (Versicolor, Setosa, Virginica) different measurements
ITS 836
3
View Iris Data Set
Iris data comes with R install
str(iris)
‘data.frame’:
150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 …
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 …
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 …
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 …
$ Species : Factor w/ 3 levels “setosa”,”versicolor”,..: 1 1 1 1 1 1 1 1 1 1 …
Species attribute: division of the species of flowers is 50-50-50.
table(iris$Species)
setosa versicolor virginica
50
50
50
ITS 836
4
Visualize iris data
library(ggplot2)
ggplot(data = iris, aes(x = Sepal.Length, y =
Sepal.Width, col = Species)) + geom_point()
• Observe:
– High correlation between
the sepal length and the
sepal width of the Setosa
iris
– Lesser correlation for
Virginica and Versicolor
• the data points are more
spread out over the graph
and don’t form a cluster
like you can see in the case
of the Setosa flowers.
ITS 836
5
Visualize iris data
ggplot(data = iris, aes(x = Petal.Length, y =
Petal.Width, col = Species)) + geom_point()
• Positive correlation:
– between the petal
length and the petal
width for all species
ITS 836
6
correlations
library(GGally)
ggpairs(iris)
• As shown correlation
between Petal Width &:
– Petal Length (0.963)
– Sepal Length (0.818)
• And b/w Petal Length
– Sepal Length (0.872)
ITS 836
7
Lecture 4 Homework 1:
Clustering with k-means
head(iris)
#remove last column
iris_2
Purchase answer to see full
attachment