ITS836 GGA Week 4 The Concept of Advanced Clustering Algorithm Paper Upon completion of the week 4 lesson, you will be able to:

Explain the concept of Advanced Clustering algorithm

Discuss Applications of Clustering algorithms

Evaluation of Clustering algorithms

This week reading assignment is the course textbook chapter 4 (EMC Education Service (Eds). (2015) Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing, and Presenting Data, Indianapolis, IN: John Wiley & Sons, Inc).

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

Let’s understand this with an example. Suppose, you are the head of a rental store and wish to understand preferences of your costumers to scale up your business. Is it possible for you to look at details of each costumer and devise a unique business strategy for each one of them? Definitely not. But, what you can do is to cluster all of your costumers into say 10 groups based on their purchasing habits and use a separate strategy for costumers in each of these 10 groups. And this is what we call clustering.

Read Chapter 4 on Clustering Methods in Textbook:

EMC Education Service (Eds). (2015) Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing, and Presenting Data, Indianapolis, IN: John Wiley & Sons, Inc.

Also I recommend reading up on the Clustering methods from the following book:

An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani,

http://www-bcf.usc.edu/~gareth/ISL/index.html School of Computer &

Information Sciences

ITS 836 Data Science and Big Data Analytics

ITS 836

1

Week 4 What is ML?

ITS 836

2

Supervised vs. Unsupervised Learning

• Supervised Learning: both X and Y

are known

• Unsupervised Learning: only X

Supervised Learning

ITS 836

Unsupervised Learning

3

ML Algorithms – Applications

ITS 836

4

Summary of Algorithms Studied

Categorization (unsupervised)

1.K-means clustering

2. Association Rules

Regression – supervised)

3. Linear

4. Logistic

Classification supervised

5.Naïve Bayesian

classifier

6. Decision Trees

7. Time Series Analysis

8. Text Analysis

ITS 836

5

• Clustering

Clustering

– set of techniques for finding subgroups, clusters, in a data set.

– A good clustering is one when the observations within a group

are similar but between groups are very different

• For example

– suppose we collect p measurements on each of n breast cancer

patients.

– may be different unknown types of cancer which we could

discover by clustering the data

ITS 836

6

Clustering Methods

• Different types of

clustering methods

• most commonly

used approaches

To perform K-means clustering:

• Specify the desired number of clusters K

• K-means algorithm assigns each

observation to exactly one of the K clusters

– K-Means Clustering

– Hierarchical

Clustering

ITS 836

7

– it produces a tree based

representation of the observations,

10. Unsupervised Learning

called a Dendogram

398

9

0.5

2.5

9

2

7

8

3

6

−1.5

7

5

8

6

1

5

2

−1.0

−0.5

X2

0.0

2.0

1.5

1.0

0.5

– Hierarchical

Clustering

0.0

• choose the number

of clusters.

3.0

– K-Means Clustering

3

• Most commonly

used approaches

• Hierarchical Clustering is an

alternative to k-Means

4

Clustering Methods

1

4

−1.5

−1.0

−0.5

0.0

0.5

1.0

X1

F I G U R E 10.10. An illustrati on of how to proper ly interpret a dendrogram with

ni ne obser vations in two-dimensi onal space. Left : A dendrogram generated usi ng

Euclidean distance and complete linkage. Observations 5 and 7 are qui te simi lar

to each other, as are obser vati ons 1 and 6. However, observation 9 is no more

similar t o observation 2 than it is to observati ons 8, 5, and 7, even though obser vations

ITS 8369 and 2 are close together in terms of hor izontal di stance. T hi s is because8

observations 2, 8, 5, and 7 al l fuse with obser vation 9 at the same height, approx-

K-Means Algorithm

• Initial Step: Randomly assign each observation to one of

K clusters

• Iterate until the cluster assignments stop changing:

– For each of the K clusters, compute the cluster centroid. The kth

cluster centroid if the mean of the observations assigned to the

kth cluster

– Assign each observation to the cluster whose centroid is closest

(where “closest” is defined using Euclidean distance.

ITS 836

9

the K-Means

Algorithm

Random

Assignment of

points

Compute cluster

centers from initial

assignments

Computer

new cluster

centers

Assign points to

closest cluster

center

Now there is

no further

change so

stop

ITS 836

10

Local Optimums

Bad

Solution

• The K-means

algorithm can get

stuck in “local

optimums” and not

find the best solution

• Hence, it is important

to run the algorithm

multiple times with

random starting

points to find a good

solution

ITS 836

Good

Solution

11

Hierarchical Clustering – Dendograms

398

10. Unsupervised Learning

• First join closest points (5 and

7)

• Height of fusing/merging (on

vertical axis) indicates how

similar the points are

• After the points are fused

they are treated as a single

F I G U R E 10.10. An illustrati on of how to proper ly interpret a dendrogram with

observation and the

ni ne obser vations in two-dimensi onal space. Left : A dendrogram generated usi ng

Euclidean distance and complete linkage. Observations 5 and 7 are qui te simi lar

algorithm continues

to each other, as are obser vati ons 1 and 6. However, observation 9 is no more

7

8

3

6

−1.5

7

5

6

8

4

1

0.0

5

2

−1.0

−0.5

2

3

0.5

1.0

9

1.5

X2

0.0

2.0

2.5

0.5

3.0

9

1

4

−1.5

−1.0

−0.5

0.0

0.5

1.0

X1

similar t o observation 2 than it is to observati ons 8, 5, and 7, even though obser vations 9 and 2 are close together in terms of hor izontal di stance. T hi s is because

observations 2, 8, 5, and 7 al l fuse with obser vation 9 at the same height, approximately 1.8. Right : T he raw data used to generate the dendrogram can be used to

conﬁrm that indeed, observation 9 is no more similar to observation 2 than it i s

836vati ons 8, 5, and 7.

12

toITS

obser

396

10. Unsupervised Learning

2

−2

0

•

Each “leaf” of the dendogram represents

one of the 45 observations

At the bottom of the dendogram, each

observation is a distinct leaf. However, as

we move up the tree, some leaves begin

to fuse. These correspond to observations

that are similar to each other.

F I G U R E 10.8. 45 observati ons generated in 2-di mensional space. I n reali ty

As we move higher up the tree, an

are three distinct classes, shown i n separate colors. However, we wi l l treat

increasing number of observations have there

these class labels as unknown and wil l seek to cluster the observations in order to

fused. The earlier (lower in the tree) two di scover the classes from the data.

observations fuse, the more similar they t hree-class model; t he t rue class labels for each observat ion are shown in

are to each other.

dist inct colors. However, suppose t hat t he dat a were observed wit hout t he

class labels, and t hat we want ed t o perform hierarchical clust ering of t he

Observations that fuse later are quite

dat a. Hierarchical clust ering (wit h complet e linkage, t o be discussed lat er)

yields t he result shown in t he left -hand panel of Figure 10.9. How can we

different

int erpret t his dendrogram?

X2

•

4

Interpretation

−6

−4

−2

0

2

X1

•

•

In t he left -hand panel of Figure 10.9, each leaf of t he dendrogram represent s one of t he 45 observat ions in Figure 10.8. However, as we move

up t he t ree, some leaves begin t o fuse int o branches. T hese correspond t o

observat ions t hat are similar t o each ot her. As we move higher up t he t ree,

t hemselves fuse, eit her wit h leaves or ot her branches. T he earlier

ITS branches

836

(lower in t he t ree) fusions occur, t he more similar t he groups of observa-

13

Choosing Clusters

• To choose clusters we draw lines across the

dendogram

• We can form any number of clusters depending on

where we draw the break point.

One Cluster

ITS 836

Two Clusters

Three Clusters

14

400

10. Unsupervised Learning

An Example

3

−0.5

X2

5

7

8

X2

7

8

−0.5

0.0

0.0

0.5

9

0.5

9

−1.0

−1.0

2

1

6

−1.5

−1.5

6

4

−0.5

0.0

0.5

−0.5

0.0

X1

9

9

0.5

1.0

0.0

0.0

3

7

8

X2

7

5

−0.5

X2

−1.0

X1

8

−0.5

4

−1.5

1.0

0.5

−1.0

0.5

−1.5

5

3

2

1

−1.0

2

1

1

6

−1.5

6

4

−1.5

−1.0

5

3

2

−1.0

Start with 9 clusters

Fuse 5 and 7

Fuse 6 and 1

Fuse the (5,7) cluster

with 8.

• Continue until all

observations are

fused.

−1.5

•

•

•

•

−0.5

0.0

X1

0.5

1.0

4

−1.5

−1.0

−0.5

0.0

0.5

1.0

X1

F I G U R E 10.11. An i llustration of the ﬁrst few steps of the hi erarchi cal

clusteri ng algori thm, usi ng the data from Figure 10.10, wi th complete linkage

ITS

836

and Euclidean distance. Top Left : I nitial ly, there are ni ne di sti nct clusters,

15

Example – Online Retailer

The unique selling point of this retailer is that they make the

“returns” simple with an assumption that this policy

encourages use and “frequent customers are more valuable”.

To validate this assumption sample set of customers clustered

on purchase frequency, return rate, and lifetime customer

value (LTV)

Groups:

1: Visit less frequently, low return rate, moderate LTV (LTV- 3rd)

2: Visit often, return a lot of their purchases. (Lowest LTV)

3: Visit often, return things moderately, High LTV (LTV 2nd)

4: Visit rarely, don’t return purchases. (Highest avg LTV)

We apply Clustering to analyze retail data!

LTV: lifetime customer value

ITS 836

16

Text Book Lecture Review

EMC Education Service (Eds). (2015) Data Science

and Big Data Analytics: Discovering, Analyzing,

Visualizing, and Presenting Data, Indianapolis, IN:

John Wiley & Sons, Inc.

4.1 Overview of Clustering

• Clustering is the use of unsupervised techniques for

grouping similar objects

– Supervised methods use labeled objects

– Unsupervised methods use unlabeled objects

• Clustering looks for hidden structure in the data,

similarities based on attributes

– Often used for exploratory analysis

– No predictions are made

ITS 836

18

4.2 K-means Algorithm

• Given a collection of objects

each with n measurable

attributes and a chosen value k

of the number of clusters, the

algorithm identifies the k

clusters of objects based on the

objects proximity to the centers

of the k groups.

• The algorithm is iterative with

the centers adjusted to the

mean of each cluster’s ndimensional vector of attributes

•

•

Clustering is often used as a lead-in to

classification, where labels are applied to

the identified clusters

Some applications

– Image processing

• With security images, successive frames are

examined for change

– Medical

• Patients can be grouped to identify naturally

occurring clusters

– Customer segmentation

• Marketing and sales groups identify

customers having similar behaviors and

spending patterns

ITS 836

19

4.2.2 Overview of the Method

1.

2.

3.

4.

Four Steps

Choose the value of k and the

initial guesses for the centroids

Compute the distance from each

data point to each centroid, and

assign each point to the closest

centroid

Compute the centroid of each

newly defined cluster from step 2

Repeat steps 2 and 3 until the

algorithm converges (no changes

occur)

ITS 836

Step 1: Set k = 3 and initial

clusters centers

20

4.2.2 Overview of the Method

1.

2.

3.

4.

Four Steps

Choose the value of k and the

initial guesses for the centroids

Compute the distance from each

data point to each centroid, and

assign each point to the closest

centroid

Compute the centroid of each

newly defined cluster from step 2

Repeat steps 2 and 3 until the

algorithm converges (no changes

occur)

ITS 836

Step 2: Points are assigned to

the closest centroid

21

4.2.2 Overview of the Method

1.

2.

3.

4.

Four Steps

Choose the value of k and the

initial guesses for the centroids

Compute the distance from each

data point to each centroid, and

assign each point to the closest

centroid

Compute the centroid of each

newly defined cluster from step 2

Repeat steps 2 and 3 until the

algorithm converges (no changes

occur)

ITS 836

Step 2: Compute Centroid

22

4.2.2 Overview of the Method

Four Steps

1. Choose the value of k and the initial

• Step 4: Repeat steps 2 and 3 until

guesses for the centroids

convergence

• Convergence occurs when the

2. Compute the distance from each data

centroids do not change or when the

point to each centroid, and assign each

centroids oscillate back and forth

point to the closest centroid

• This can occur when one or more

3. Compute the centroid of each newly

points have equal distances from the

defined cluster from step 2

centroid centers

4. Repeat steps 2 and 3 until the algorithm

converges (no changes occur)

ITS 836

23

4.2.3 Determining Number of Clusters

•

•

•

Reasonable guess

Predefined requirement

Use heuristic – e.g., Within Sum of

Squares (WSS)

Example of WSS vs #Clusters curve

– WSS metric is the sum of the squares

of the distances between each data

point and the closest centroid

– The process of identifying the

appropriate value of k is referred to as

finding the “elbow” of the WSS curve

The elbow of the curve appears to occur at k = 3.

ITS 836

24

4.2.4 Diagnostics

• When the number of clusters

is small, plotting the data

helps refine the choice of k

• The following questions

should be considered

–

–

–

Example of distinct clusters

Are the clusters well separated from each other?

Do any of the clusters have only a few points

Do any of the centroids appear to be too close to

each other?

ITS 836

25

4.2.4 Diagnostics

Example of less obvious clusters

Six clusters from points of less

obvious cluster

ITS 836

26

4.2.5 Reasons to Choose and Cautions

• Decisions the practitioner must

make

– What object attributes should be

included in the analysis?

– What unit of measure should be

used for each attribute?

– Do the attributes need to be

rescaled?

– What other considerations might

apply?

• Important to understand what

attributes will be known at the

time a new object is assigned to

a cluster

–

•

ITS 836

E.g., customer satisfaction may be available

for modeling but not available for potential

customers

Best to reduce number of attributes when

possible

–

–

–

Too many attributes minimize the impact of

key variables

Identify highly correlated attributes for

reduction

Combine several attributes into one: e.g.,

debt/asset ratio

27

4.2.5 Reasons to Choose and CautionsObject attributes: scatterplot matrix for

seven attributes

• Decisions the practitioner must

make

– What object attributes should be

included in the analysis?

– What unit of measure should be

used for each attribute?

– Do the attributes need to be

rescaled?

– What other considerations might

apply?

ITS 836

28

4.2.5 Reasons to Choose and Cautions

• Decisions the practitioner must

make

– What object attributes should be

included in the analysis?

– What unit of measure should be

used for each attribute?

– Do the attributes need to be

rescaled?

– What other considerations might

apply?

ITS 836

• K-means algorithm will identify

different clusters depending on

the units of measure

29

4.2.5 Reasons to Choose and Cautions

• Decisions the practitioner must

make

– What object attributes should be

included in the analysis?

– What unit of measure should be

used for each attribute?

– Do the attributes need to be

rescaled?

– What other considerations might

apply?

ITS 836

• K-means algorithm will identify

different clusters depending on

the units of measure

Age dominates

k=2

30

4.2.5 Reasons to Choose and Cautions

•

•

• Decisions the practitioner must

make

– What object attributes should be

included in the analysis?

– What unit of measure should be

used for each attribute?

– Do the attributes need to be

rescaled?

– What other considerations might

apply?

ITS 836

Rescaling can reduce domination effect

E.g., divide each variable by the

appropriate standard deviation

Rescaled attributes

31

4.2.5 Reasons to Choose and Cautions

• Decisions the practitioner must

make

•

– What object attributes should be

included in the analysis?

– What unit of measure should be

used for each attribute?

– Do the attributes need to be

rescaled?

– What other considerations might

apply?

ITS 836

•

•

K-means sensitive to starting seeds

– Important to rerun with several

seeds – R has the nstart option

Could explore distance metrics other

than Euclidean

– E.g., Manhattan, Mahalanobis,

etc.

K-means is easily applied to numeric

data and does not work well with

nominal attributes

– E.g., color

32

Reasons

to Chose

and

Cautions

ITS 836

33

Questions?

ITS 836

34

School of Computer &

Information Sciences

ITS 836 Data Science and Big Data Analytics

1

Lecture 04: Clustering – Homework

Homework 1: Review the Iris Data set perform

clustering via “kmeans” (see next slides)

Homework 2: Perform hierarchical clustering on

the iris data set: https://cran.rproject.org/web/packages/dendextend/vignett

es/Cluster_Analysis.html

Homework 3: Use the data set USArrests, to

cluster the US States according to

https://uc-r.github.io/kmeans_clustering

Deliverable should be a powerpoint or

Homework 4: Review Section 4_2R Exercise

Word, Use existing homework template

Homework 5 Clustering on a data set

With your name, ID and date.

Homework 6: Continue R for Datascience

ITS 836

2

exercises

Iris Dataset Source

Goal: Predict the types of iris in Hawaii

R.A. Fisher, 1936

• Attributes: sepal length, sepal width, petal length, petal width

– All flowers contain a sepal and a petal

– For the iris flowers three categories (Versicolor, Setosa, Virginica) different measurements

ITS 836

3

View Iris Data Set

Iris data comes with R install

str(iris)

‘data.frame’:

150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 …

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 …

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 …

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 …

$ Species : Factor w/ 3 levels “setosa”,”versicolor”,..: 1 1 1 1 1 1 1 1 1 1 …

Species attribute: division of the species of flowers is 50-50-50.

table(iris$Species)

setosa versicolor virginica

50

50

50

ITS 836

4

Visualize iris data

library(ggplot2)

ggplot(data = iris, aes(x = Sepal.Length, y =

Sepal.Width, col = Species)) + geom_point()

• Observe:

– High correlation between

the sepal length and the

sepal width of the Setosa

iris

– Lesser correlation for

Virginica and Versicolor

• the data points are more

spread out over the graph

and don’t form a cluster

like you can see in the case

of the Setosa flowers.

ITS 836

5

Visualize iris data

ggplot(data = iris, aes(x = Petal.Length, y =

Petal.Width, col = Species)) + geom_point()

• Positive correlation:

– between the petal

length and the petal

width for all species

ITS 836

6

correlations

library(GGally)

ggpairs(iris)

• As shown correlation

between Petal Width &:

– Petal Length (0.963)

– Sepal Length (0.818)

• And b/w Petal Length

– Sepal Length (0.872)

ITS 836

7

Lecture 4 Homework 1:

Clustering with k-means

head(iris)

#remove last column

iris_2

Purchase answer to see full

attachment