“Introduction to Data Mining” by PangNing Tan, Michael Steinbach and Vipin Kumar has been recommended to me as a great book to learn about Data Mining. Read the first chapter and it felt very inspiring as though it was beckoning me to keep at it and read more.
Thought I will capture notes as I read. The way I usually capture notes is to make a list of all the headings, go over them once I finish a chapter or at a later time. Brain just needs a key to fetch information. The headings serve as good keys.
The best thing about the book is that sample chapters and lecture notes (ppt) are available on the publisher’s website.
 Main Page : http://wwwusers.cs.umn.edu/~kumar/dmbook/index.php
 Power Point slides : http://wwwusers.cs.umn.edu/~kumar/dmbook/index.php#item4
 Sample Chapters : http://wwwusers.cs.umn.edu/~kumar/dmbook/index.php#item2
Chapter 1  Introduction
Data Mining : Nontrivial extraction of implicit, previously unknown and potentially useful information from data. Exploration & analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns.
 1 Introduction
 Fields  business, medicine, science & engineering
 1.1 What is Data Mining
 Different from Information Retrieval
 Integral part of KDD(Knowledge discovery in databases)
 1.2 Motivating Challenges
 Scalability, High Dimensionality, Heterogenous and complex data, Data Ownership and distribution, Nontraditional analysis
 1.3 The Origins of Data Mining
 Relates to fields of Statistics, AI, Machine Learning & Pattern Recognition
 Built on Database technology, Parallel Computing, Distributed computing
 1.4 Data Mining Tasks
 Predictive tasks, Descriptive tasks
 Predictive modeling  classification (discrete target variables), regression ( continuous target variables)
 Association analysis
 Cluster analysis
 Anomaly detection
 1.5 Scope and Organizations of the Book
Ref : Introduction to Data Mining  Chapter 1.
http://wwwusers.cs.umn.edu/~kumar/dmbook/dmslides/chap1_intro.pdf
Chapter 2  Data
Data is a collection of objects and their attributes. An attribute is a property or characteristic of an object. A collection of attributes describe an object.
 2 Data

2.1. Types of data
 2.1.1. attributes & measurement
 operations used to describe attributes  distinctness, order, addition, multiplication
 4 types of attributes  nominal, ordinal, interval, ratio
 Nominal and Ordinal are Categorical (Qualitative)
 Interval and Ratio are Numeric (Quantitative)
 Nominal (ids, zip codes) → distinctness
 Ordinal (rankings, grades) → distinctness + order
 Interval (dates, temp in C) → distinctness + order + addition
 Ratio (temp in K, length, time) → distinctness + order + addition + multiplication
 attribute categorization by values → discrete, continuous
 2.1.2. Types of data sets
 Characteristics of data sets → dimensionality, sparsity, resolution
 Types → record, graph, ordered
 Record data : a. record data, b. transaction data, c. data matrix, d. documentterm matrix
 Graph based data : a. data with relationship among objects (www), b. data with objects that are graphs (molecular structure)
 Ordered data : a. sequential data (timestamps), b. sequence data (genetic code), c. time series data, d. spatial data
 2.1.1. attributes & measurement

2.2. Data Quality
 2.2.1. Measurement and data quality issues
 Noise & artifacts → precision (std deviation), bias (meanactual), accuracy (closeness to true value)
 Outliers
 Missing values → how to solve : eliminate data objects, eliminate missing value attributes, ignore missing value during analysis
 Inconsistent values
 Duplicate data
 2.2.2. Issues related to applications
 timeliness, relevance, knowledge about the data
 2.2.1. Measurement and data quality issues

2.3. Data Preprocessing
 2.3.1. Aggregation
 Purpose : data reduction (smaller dataset means lesser memory), change of scale, more ‘stable’ data
 2.3.2. Sampling
 Processing all records is very expensive. but sampling needs to be representative
 Sampling approaches :
 Simple random sampling
 Sampling without replacement
 Sampling with replacement
 Stratified sampling
 Progressive sampling
 2.3.3. Dimensionality reduction
 Curse of dimensionality (COD)
 Purpose : avoid COD, reduce processing time, better visualization, reduce noise
 Techniques : PCA (Principal Components Analysis), SVD (Singular Value Decomposition)
 2.3.4. Feature subset reduction
 Remove redundant features, irrelevant features
 Techniques : brute force, embedded, filter, wrapper
 2.3.5. Feature Creation
 Create new attributes from existing
 Methods : feature extraction, mapping data to new space (Fourier transformation, wavelet trans’n), feature construction
 2.3.6. Discretization & Binarization
 discretization, binarization, discretization of continuous attributes
 Unsupervised discretization (no class labels) : equal width, equal frequency → Kmeans
 Supervised discretization (using class labels) : entropy
 2.3.7. Variable Transformation
 Simple functions, normalization
 2.3.1. Aggregation

2.4. Measures of Similarity & Dissimilarity

2.4.1. Basics
 definitions  similarity, dissimilarity
 transformations → e.g $s_{′}=(s−min_{s})/(max_{s}−min_{s})$

2.4.2. S & D b/w simple attributes
 nominal : $d=0or1ifx=yorx!=y;s=1or0$
 ordinal : $d=∣x−y∣/(n−1);s=1−d$
 interval or ratio : $d=∣x−y∣;s=−d,s=1/(1+d)$

2.4.3. Dissimilarities b/w data objects
 Distances
 Euclidean distance : $f(x)=∑_{k=1}(x_{k}−y_{k})_{2} $
 Minkowski distance (Generalization) :
$f(x)=(∑_{k=1}(x_{k}−y_{k})_{r})_{1/r}$
 Hamming distance (city blocks) : $r=1$
 Euclidean distance : $r=2$
 Supremum distance : $r=∞$
 Common properties : a. Positivity, b. Symmetry, c. Triangle inequality
 ’metric’ satisfies all common properties
 Distances

2.4.4 Similarities b/w data objects
 typical properties
 $s(x,y)=1onlyifx=y[maxsimilarity]$

2.4.5 Examples of proximity measures
 Similarity coefficients
 Simple Matching : $SMC=(f_{1}1+f_{0}0)/(f_{0}1+f_{1}0+f_{1}1+f_{0}0)$
 Jaccard : $JC=f_{1}1/(f_{0}1+f_{1}0+f_{1}1+f_{0}0)$
 Cosine : $cos(x,y)=x.y/∣∣x∣∣.∣∣y∣∣$
 x.y is the vector dot product = $∑_{k=1}(x_{k}y_{k})$
 x.y is the vector dot product = $∑_{k=1}(x_{k}y_{k})$
 x = $x.x $
 Extended Jaccard : $EJ=x.y/(∣∣x∣∣_{2}+∣∣y∣∣_{2}−x.y)$
 Correlation : $P_{k}$ → Pearson’s Correlation
 $P_{k}(x,y)=covariance(x,y)/SD(x).SD(y)$
 $covariance(x,y)=1/(n−1)∑_{k=1}(x_{k}−xˉ)(y_{k}−xˉ)$
 $=1/n∑_{k=1}(x_{k})$
 $xˉ=$ mean of x $=1/n∑_{k=1}(x_{k})$
 $yˉ =$ mean of y
 $SD(x)=Standarddeviation=1/(n−1)∑_{k=1}(x_{k}−xˉ)_{2} $
 $=1/n∑_{k=1}(x_{k})$
 Bregman divergence
 $D(x,y)=ϕ(x)−ϕ(y)−⟨∇ϕ(y),(x−y)⟩$
 $D(x,y)=ϕ(x)−ϕ(y)−⟨∇ϕ(y),(x−y)⟩$
 Similarity coefficients

2.4.6 Issues in Proximity calculation
 a. different scales, b. different attributes, c. different weights
 Standardization
 $mahalanobis(x,y)=(x−y)∑(x−y)_{T}$
 Combining Similarities
 $Similarity(x,y)=∑_{k=1}w_{k}δ_{k}s_{k}(x,y)/∑_{k=1}δ_{k}$

2.4.7 Selecting the right proximity measure
 Type of proximity should fit type of data
 Use Euclidean distance for dense continuous data
 Usually expressed as differences
 Sparse data → ignore 00 matches → use Jaccard, cosine, Extended Jaccard.
 Time series → use Euclidean distance
 Time series with different magnitudes → use Correlation


That was a pretty long chapter. Lots learnt ! Best thing that I learnt outside the chapter was ‘how to write mathematical equations in a blog’! Used a Javascript library named MathJax which renders LaTeX notation at runtime into HTML.
Ref : http://docs.mathjax.org/en/latest/start.html
Putting this snippet in section worked
<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js" type="text/javascript">
MathJax.Hub.Config({
extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js"],
jax: ["input/TeX","output/HTMLCSS"],
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true,
},
"HTMLCSS": { availableFonts: ["TeX"] }
});
</script>
Ref : http://narnia.cs.ttu.edu/drupal/node/129
References for this chapter :
 Text : Introduction to Data Mining  Chapter 2
 ppt : http://wwwusers.cs.umn.edu/~kumar/dmbook/dmslides/chap2_data.pdf
Chapter 3  Exploring Data
Data exploration is a preliminary exploration of the data to better understand its characteristics.
 3 Exploring Data
 3.1 The Iris Data Set
 3.2 Summary Statistics
 3.2.1 Frequencies and the Mode
 3.2.2 Percentiles
 3.2.3 Measures of Location : Mean and Median
 3.2.4 Measure of Spread : Range and Variance
 3.2.5 Multivariate Summary Statistics
 3.2.6 Other ways to Summarize the Data
 3.3 Visualization
 3.3.1 Motivations for Visualization
 3.3.2 General Concepts
 3.3.3 Techniques
 3.3.4 Visualizing Higher dimensional data
 3.3.5 Do’s and Don’ts
 3.4 OLAP and Multidimensional Analysis
 Representing Iris data as a multidimensional array
 Multidimensional data : The General Case
 Analyzing multidimensional data
 Final comments on multidimensional data
April 2014
Chapter 4  Classification
Classification is the task of assigning objects to one of several predefined categories. E.g  detecting spam email messages based upon the message header and content, categorizing cells as malignant or benign based upon the results of MRI scans, and classifying galaxies based upon their shapes ..etc
 4 Classification
 4.1 Preliminaries
 4.2 General Approach to solving a classification problem
 4.3 Decision Tree Induction
 4.3.1 How a Decision Tree works
 4.3.2 How to build a Decision Tree
 4.3.3 Methods for expressing attribute test conditions
 4.3.4 Measure for selecting the Best Split
 4.3.5 Algorithm for Decision Tree induction
 4.3.6 An Example : Web Robot Detection
 4.3.7 Characteristics of Decision Tree Induction
 4.4 Model Overfitting
 4.4.1 Overfitting due to presence of noise
 4.4.2 Overfitting due to lack of representative samples
 4.4.3 Overfitting and the multiple comparison procedure
 4.4.4 Estimation of Generalization Errors
 4.4.5 Handling overfitting in Decision Tree Induction
 4.5 Evaluating the performance of a Classifier
 4.5.1 Holdout method
 4.5.2 Random subsampling
 4.5.3 Cross Validation
 4.5.4 Bootstrap
 4.6 Methods for Comparing Classifiers
 Estimating a confidence interval for accuracy
 Comparing the performance of two models
 Comparing the performance of two classifiers
May 2014
Chapter 8  Cluster Analysis
Cluster analysis could be defined as “Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.” In short, cluster analysis divides the data into groups(clusters) that are meaningful, useful or both. There are many applications of cluster analysis to practical problems.

Clustering for Understanding
 Biology  creating taxonomy
 Information Retrieval  grouping search results from the web
 Climate  finding patterns in the atmosphere
 Psychology and medicine  identify illness
 Business  segment customers

Clustering for Utility
 Summarization  apply costly algorithms to cluster prototypes instead of the entire dataset.
 Compression  prototypes are used when data objects are similar and loss is acceptable.
 Efficiently finding nearest neighbors  use prototypes to reduce number of pairwise distance computations.

8.1 Overview
 8.1.1 What is Cluster Analysis
 Defined above.
 Groups data objects based only on information found in the data that describes the objects and their relationships.
 Notion of a cluster is ambiguous
 Classification is “Supervised Classification” and Cluster analysis is “Unsupervised Classification"
 "Segmentation” and “Partitioning” are also sometimes synonymous with Clustering
 What is not Cluster Analysis
 Supervised Classification
 Simple segmentation
 Results of a query
 Graph partitioning
 8.1.2 Different types of Clusterings
 Hierarchical vs Partitional
 Hierarchical (nested)  leaves are singletons
 Partitional (unnested)  non overlapping subsets
 Exclusive vs Overlapping vs Fuzzy
 Exclusive  assign each object to a single cluster
 Overlapping  objects can simultaneously belong to more than one cluster
 Fuzzy  every object belongs to every cluster with a membership weight between 0(absolutely doesn’t belong) and 1(abs belongs)
 caveat : it doesn’t address true multiclass situations since the weights need to add up to 1.
 Complete vs Partial
 Complete  assigns every object to a cluster
 Partial  doesn’t assign every object to a cluster
 Heterogeneous vs Homogeneous
 Hierarchical vs Partitional
 8.1.3 Different types of Clusters
 Well Separated  any point in a cluster is closer (more similar) to every other point in the cluster than to any point not in the cluster. Can have any shape.
 Prototype based (center based)  an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster.
 for data with continuous attributes, center is often the centroid. when centroid is not meaningful (data with categorical attributes), centre is often the medoid.
 tends to be globular.
 Graph based  group of objects that are connected to one another but have no connection to objects outside the group. (Connected Components)
 Contiguitybased  2 objects are connected only if they are within a specified distance of each other. Useful when clusters are irregular or intertwined.
 Clique  another example of graph based cluster.
 Density based  a cluster is a dense region of objects that is surrounded by a region of low density. Used when clusters are irregular or intertwined and when noise and outliers are present.
 Shared property (conceptual clustering)  Finds clusters that share some common property or represent a particular concept.
 Objective function  Finds clusters that minimize or maximize an objective function.
 Characteristics of Input data
 Type of proximity/density measure
 Sparseness
 Attribute Type
 Type of data
 Dimensionality
 Noise and Outliers
 Type of Distribution
 Clustering Algorithms
 Kmeans  prototype based, partitional clustering technique  finds k clusters represented by their centroids.
 Hierarchical Clustering  leaves are singletons, top down or bottom up.
 DBSCAN  density based, partial that produces a partitional clustering. number of clusters is automatically determined by the algorithm.
 8.1.1 What is Cluster Analysis

8.2 Kmeans
 Partitional clustering technique in which each cluster is associated with a centroid and each point is assigned to a cluster with the closest centroid. Number of clusters K, must be given as input.
 8.2.1 The basic Kmeans algorithm
 Initial centroids are often chosen randomly.
 The centroid is generally the mean of the points in the cluster
 Closeness is measured by Euclidean distance, cosine similarity, correlation ..etc.
 Most of the convergence happens the first few iterations
 Complexity is O(I * K * m * n)
 Assigning points to the closest centroid  Manhattan(L1) distance can be used for Euclidean data, Jaccard measure for document data
 Centroids and Objective function  e.g for objective function : minimize the squared distance of each point to its closest centroid.
 Data in Euclidean space  Objective function used is SSE(Sum of Squared Error) also known as ‘scatter’.
 where dist is the standard Euclidean (L2) distance b/w objects.
 Centroid that minimizes the SSE of a cluster is the mean
 Given 2 different sets of clusters, we prefer one with lower SSE.
 Document data
 represented as a DTM(document term matrix)
 objective is to maximize similarity of the objects in the cluster to cluster centroid.
 The analogous quantity to SSE is ‘total cohesion’
 The General case
 There are many choices for the proximity function, centroid and objective function that can be used in basic kmeans algorithm and that are guaranteed to converge.
 Bregman divergence includes squared Euclidean distance, the Mahalanobis distance and cosine similarity.
 Choosing initial centroids
 Eg. 1  Poor initial centroids : Randomly selected initial centroids may provide suboptimal clustering with a higher SSE.
 Eg. 2  Limits of random initialization
 If there are K ‘real’ clusters, chance of selecting one centroid from each is very small.
 Solutions to Initial centroids problem
 Perform multiple runs
 One way to minimize the effect or a poor random initial choice of centroids is to perform multiple runs.
 But this also need not give optimal clustering. If a centroid falls between a pair of clusters, it will be stuck there by making one cluster out of 2.
 Use hierarchical clustering on a sample
 One effective approach to counter the limitations of random poor initial centroids is to take a sample of points and cluster them using hierarchical clustering technique and use those centroids as initial centroids.
 But this is practical only if sample is relatively small and k is small compared to sample size.
 Iterative approach
 Select first point at random (or take centroid of all points)
 for each successive initial centroid, select a point that is farthest from the previously selected centroids.
 the downside is that this could select outliers.
 Select more than K centroids
 Post processing
 Bisecting Kmeans
 Perform multiple runs
 Time and Space complexity
 Storage required : O((m + K)n) , m = number of points, n = number of attributes
 Time required : O(I * K * m * n) , I = number of iterations
 8.2.2 Kmeans: Additional issues
 Handling empty clusters
 Basic Kmeans algo can yield empty clusters.
 Choose a replacement otherwise SSE will be large
 Approach 1 : Choose a replacement point that is farthest from any current centroid
 Approach 2 : Choose a replacement point from the centroid that has highest SSE. This would split the cluster and reduce overall SSE.
 Outliers
 If there are outliers, centroids may not be as representative as they ought to be and SSE would be higher as well.
 One approach is to discover outliers and eliminate them beforehand. But there could be situations where outliers are important (compression, financial data)
 Another approach is to identify outliers during preprocessing. Identify points with high contributions to SSE.
 Also we could eliminate small clusters
 Reducing the SSE with post processing
 One way to reduce SSE is to increase number of clusters (larger K). 2 strategies that decrease SSE by increasing K are 
 Split a cluster  Choose a cluster with largest SSE or largest standard deviation for an attribute.
 Introduce a new centroid  One approach is to choose a point farthest from any cluster center. Another is to choose a point at random.
 2 strategies that decrease K and try to minimize the SSE increase are 
 Disperse a cluster  removing a centroid and assigning the points to other clusters. Disperse a cluster that doesn’t contribute much to the total SSE.
 Merge 2 clusters  Clusters with closest centroids are usually chosen. Strategy is same as that used in hierarchical clustering.
 Commonly used approach is to alternate cluster splitting and merging phases.
 One way to reduce SSE is to increase number of clusters (larger K). 2 strategies that decrease SSE by increasing K are 
 Updating centroids incrementally
 In the basic Kmeans algorithm, centroids are updated after all points are assigned to a centroid. An alternative is to update the centroids after each assignment (incremental approach). Each assignment updates 0 or 2 centroids.
 Pros
 Guarantees that empty clusters are not produced.
 Relative weights may be adjusted. Better accuracy and faster convergence are possible if weights are used.
 Cons:
 It is more expensive
 Introduces an order dependency (clusters produced depends on the order in which points are processed) . Although this can be overcome by randomizing the order.
 Handling empty clusters
 8.2.3 Bisecting Kmeans
 A straightforward extension of the basic Kmeans algorithm that is based on a simple idea: to obtain K clusters, split the set of all points into two clusters, select one of these clusters to split, and so on, until K clusters have been produced.
 There are different ways to choose which cluster to split, say using size or SSE.
 Usually the the clusters are refined by using the resulting centroids as initial centroids for basic kmeans algorithm. This is useful since during bisecting kmeans we’re using kmeans algorithm ‘locally’
 Bisecting Kmeans is less susceptible to initialization problems.
 Also we can use bisecting kmeans for hierarchical clustering.
 8.2.4 Kmeans and different types of clusters
 Kmeans has difficulty in detecting natural clusters when clusters have nonspherical shapes or widely different sizes or densities.
 This could be overcome if the user is willing to increase K and have more clusters. So a cluster might have many sub clusters.
 8.2.5 Strengths and weaknesses
 Strengths : Used for wide variety of data types, Very efficient
 Weaknesses : Cannot handle nonglobular clusters, not good with data that has outliers.
 8.2.6 Kmeans as an optimization problem
 Gradient descent  involves picking an initial solution and then repeating the fol lowing two steps: compute the change to the solution that best optimizes the objective function and then update the solution.
 Derivation of Kmeans as an algorithm to minimize the SSE (sum of squared errors): Proof that the best centroid for minimizing the SSE of a cluster is the mean of the points in the cluster.
 Derivation of Kmeans for SAE (sum of absolute errors) : Ck is the median of the cluster.

8.3 Agglomerative Hierarchical Clustering
 2 approaches of hierarchical clustering 
 Agglomerative  Start with the points as individual clusters and, at each step, merge the closest pair of clusters.
 Divisive  Start with one, allinclusive cluster and, at each step, split a cluster until only singleton clusters of individual points remain.
 Often displayed graphically using a treelike diagram called dendogram, which displays subcluster relationships and also the order in which the clusters were merged.
 Don’t have to assume any particular number of clusters.
 May correspond to meaningful taxonomies.
 8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm
 Key operation is the computation of proximity
 Defining proximity b/w clusters
 Cluster proximity is typically defined with a particular type of cluster in mind. For example, many agglomerative hierarchical clustering techniques, such as MIN, MAX, and Group Average, come from a graphbased view of clusters.
 Time and Space complexity
 Proximity matrix requires the storage of 1/m^2 proximities. The space needed to keep track of the clusters is proportional to the number of clusters. Hence, the total space complexity is O(m^2).
 There are m steps and at each step the size, m^2, proximity matrix must be updated and searched. So time complexity is O(m^3). It can be reduced to O(m^2 log m) by keeping the data in a sorted list or heap
 8.3.2 Specific Techniques of similarity measures
 MIN (single link)
 Similarity of two clusters is based on the two most similar (closest) points in the different clusters. Determined by one pair of points, i.e., by one link in the proximity graph.
 Strength : Can handle noneliptical shapes.
 Weakness : Sensitive to noise and outliers.
 MAX (complete link)
 Similarity of two clusters is based on the two least similar (most distant) points in the different clusters. Determined by all pairs of points in the two clusters.
 Strength : Less susceptible to noise and outliers
 Weakness : Tends to break large clusters, Biased towards globular clusters
 Group Average
 Proximity of two clusters is the average of pairwise proximity between points in the two clusters.
 Compromise between Single and Complete Link
 Strength : Less susceptible to noise and outliers
 Weakness : Biased towards globular clusters
 Ward’s method
 Similarity of two clusters is based on the increase in squared error when two clusters are merged.
 Similar to group average if distance between points is distance squared.
 Hierarchical analogue of Kmeans – Can be used to initialize Kmeans.
 Strength : Less susceptible to noise and outliers
 Weakness : Biased towards globular clusters
 Distance between centroids : Similarity of two clusters is based on the distance between the centroids in the different clusters.
 MIN (single link)
 8.3.3 The LanceWilliams formula for Cluster Proximity
 A general formula for expressing hierarchical clustering proximity
 Any of the cluster proximities that we have discussed above can be viewed as a choice of different parameters in the LanceWilliams formula
 8.3.4 Key Issues in Hierarchical clustering
 Lack of a global objective function : It is greedy technique in which optimization is done locally at that specific moment.
 Ability to handle different cluster sizes : In order to take care of the relative sizes of the clusters that are merged, we could take weights into account.
 UPGMA  Unweighted Pair Group Method using Arithmetic averages.
 WPGMA  Weighted Pair Group Method using Arithmetic averages.
 Merging decisions are final: Once the decision is made, it cannot be undone at a later time.
 8.3.5 Strengths and Weaknesses : Already discussed above.
 2 approaches of hierarchical clustering 

8.4 DBSCAN
 Densitybased clustering locates regions of high density that are separated from one another by regions of low density. DBSCAN is a simple and effective densitybased clustering algorithm.
 8.4.1 Traditional Density: Centerbased approach
 In the centerbased approach, density is estimated for a particular point in the data set by counting the number of points within a specified radius, Eps, of that point. This includes the point itself.
 Classification of Points According to CenterBased Density
 Core point: A point is a core point if it has more than a specified number of points (MinPts) within Eps.
 Border point: A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point.
 Noise point: A noise point is any point that is neither a core point nor a border point.
 8.4.2 The DBSCAN algorithm
 Time and Space complexity
 The basic time complexity of the DBSCAN algorithm is O(m × time to find points in the Epsneighborhood), where m is the number of points. In the worst case, this complexity is O(m^2). If a different data structure is used, it could be as low as O(m logm).
 The space requirement is O(m).
 Selection of DBSCAN Parameters
 The big question for DBSCAN is “how to determine the parameters Eps and MinPts ?”
 The basic approach is to look at the behavior of the distance from a point to its kth nearest neighbor(kdist).
 Usually points within cluster will have small kdist and noise points will have large kdist. So if we compute the kdist for all the data points for some k, sort them in increasing order, and then plot the sorted values, we expect to see a sharp change at the value of kdist that corresponds to a suitable value of Eps.
 Clusters of Varying Density
 DBSCAN can have trouble with density if the density of clusters varies widely.
 If the Eps threshold is low and if the densities vary, the denser clusters will be picked up and the dilute ones might be categorized as noise.
 Time and Space complexity
 8.4.3 Strengths and Weaknesses
 Strengths
 Resistant to Noise
 Can handle clusters of different shapes and sizes
 Weaknesses
 Doesn’t work well for clusters with varying densities
 Doesn’t work well for Highdimensional data
 Strengths

8.5 Cluster Evaluation
 For supervised classification we have a variety of measures to evaluate how good our model is – Accuracy, precision, recall
 For cluster analysis, the analogous question is “how to evaluate ‘the goodness’ of the resulting clusters?”
 The difficulty  Many times clustering is done as part of exploratory data analysis. So validation may not be needed. Also, different clustering algorithms might need different evaluation techniques  SSE might be suited for Kmeans but not for DBSCAN.
 8.5.1 Overview
 Aspects of Cluster Validation
 Determining the clustering tendency of a set of data, (distinguishing whether nonrandom structure actually exists in the data).
 Determining the correct number of clusters.
 Evaluating how well the results of a cluster analysis fit the data without reference to external information.
 Comparing the results of a cluster analysis to externally known results, such as externally provided class labels.
 Comparing two sets of clusters to determine which is better.
 Items 1, 2, and 3 do not make use of any external information  they are unsupervised techniques  while item 4 requires external information. Item 5 can be performed in either a supervised or an unsupervised manner.
 The evaluation measures (indices), that are applied to judge various aspects of cluster validity are traditionally classified into the following three types 
 Unsupervised (Internal indices) 
 Measures the goodness of a clustering structure without referring to external information.
 e.g SSE
 Further divided into 2 classes  measures of cluster cohesion, cluster separation
 Supervised (External indices)
 Measures the extent to which the clustering structure discovered by a clustering algorithm matches some external structure.
 e.g Entropy (measures how well cluster labels match externally supplied class labels)
 Relative
 Compares different clusterings or clusters
 e.g Two Kmeans clusterings can be compared using either the SSE or entropy.
 Unsupervised (Internal indices) 
 Aspects of Cluster Validation
 8.5.2 Unsupervised Cluster Evaluation Using Cohesion and Separation
 Cluster validity for a set of K clusters is a weighted sum of the validity of individual clusters.
 The validity function can be cohesion, separation or some combination of these. The weights will vary depending on the validity measure. For cohesion, higher values are better, for separation, lower values are better.
 GraphBased View of Cohesion and Separation
 Cohesion of a cluster = sum of the weights of the links in the proximity graph that connect points within the cluster.
 Separation between two clusters = sum of the weights of the links from points in one cluster to points in the other cluster.
 The proximity function can be a similarity, a dissimilarity, or a simple function of these quantities
 PrototypeBased View of Cohesion and Separation
 Cohesion of a cluster = sum of the proximities with respect to the prototype (centroid or medoid) of the cluster.
 Separation between two clusters = proximity of the two cluster prototypes.
 There are two measures for separation because, the separation of cluster prototypes from an overall prototype is sometimes directly related to the separation of cluster prototypes from one another
 Overall Measures of Cohesion and Separation
 The previous measures (cohesion & separation) can be combined into an overall measure of cluster validity by using a weighted sum.
 Relationship between PrototypeBased Cohesion and GraphBased Cohesion
 For some proximity measures, the measures of cohesion and separation are equivalent for Graphbased and prototypebased approaches even though they seem distinct.
 For instance, for the SSE and points in Euclidean space, it can be shown that the average pairwise distance between the points in a cluster is equivalent to the SSE of the cluster
 Two Approaches to PrototypeBased Separation
 When proximity is measured by Euclidean distance, the traditional measure of separation between clusters is the between group sum of squares (SSB), which is the sum of the squared distance of a cluster centroid.
 The higher the total SSB of a clustering, the more separated the clusters are from one another.
 Relationship between Cohesion and Separation
 In some cases there is a strong relationship b/w cohesion and separation. It is possible to show that the sum of the total SSE and the total SSB is a constant.
 Total sum of Squares (TSS) = sum of squares of the distance of each point to the overall mean of the data.
 TSS = SSE + SSB
 The importance of this result is that minimizing SSE (cohesion) is equivalent to maximizing SSB (separation).
 Evaluating Individual Clusters and Objects
 Many of the measures that we studied above can be used to evaluate individual clusters and objects.
 We could rank individual clusters according to their specific value of cluster validity (cohesion or separation) and make some observations 
 Since higher value of cohesion may be better than lower, we could decide to split a cluster with lower value.
 If separation between clusters is low, we could decide to merge the clusters.
 Objects that contribute more to the cohesion and separation are probably near the interior of the cluster and others are probably near the edge.
 The Silhouette Coefficient
 How to compute Silhouette coefficient
 For the i th object, calculate its average distance to all other objects in its cluster. Call this value ai.
 For the i th object and any cluster not containing the object, calculate the object’s average distance to all the objects in the given cluster. Find the minimum such value with respect to all clusters; call this value bi.
 Silhouette coefficient, si = (bi − ai)/ max(ai, bi).
 The value of the silhouette coefficient can vary between −1 and 1.
 When $a_{i}=0,s_{i}=1$
 How to compute Silhouette coefficient
 8.5.3 Unsupervised Cluster Evaluation Using the Proximity Matrix
 Measuring Cluster Validity via Correlation
 Input → Similarity matrix for a data set and the cluster labels from a cluster analysis of the data set.
 Output → Evaluate “goodness” of the clustering by looking at the correlation between the similarity matrix and the cluster labels.
 While computing the correlation, since the matrices are symmetric, only the correlation between n (n  1) / 2 entries needs to be calculated.
 High correlation indicates that points that belong to the same cluster are close to each other.
 Judging a Clustering Visually by Its Similarity Matrix
 If we sort the rows and columns of the similarity matrix so that all objects belonging to the same class are together, then an ideal similarity matrix has a block diagonal structure.
 Wellseparated clusters show a very strong, block diagonal pattern in the reordered similarity matrix.
 Computation of the proximity matrix takes O(m ^ 2) time
 Measuring Cluster Validity via Correlation
 8.5.4 Unsupervised Evaluation of Hierarchical Clustering
 Previous approaches are for paritional clusterings.
 Cophenetic correlation is a popular evaluation measure for hierarchical clusterings
 Cophenetic distance between two objects = Proximity at which an agglomerative hierarchical clustering technique puts the objects in the same cluster for the first time.
 The CoPhenetic Correlation Coefficient (CPCC) is the correlation between the entries of the cophenic distance matrix and the original dissimilarity matrix and is a standard measure of how well a hierarchical clustering (of a particular type) fits the data.
 8.5.5 Determining the Correct Number of Clusters
 Plot SSE and Silhouette coefficient for different values of k.
 Choose k where there is a distinct knee in the SSE and a distinct peak in the silhouette coefficient
 8.5.6 Clustering Tendency
 Clustering algorithms produce clusters for any input data. But can the data even be clustered ?
 One approach is to evaluate the resulting clusters and only claim that a data set has clusters if at least some of the clusters are of ‘good’ quality. To make sure that the poor quality is not due to wrong choice of algorithm, we could use multiple algorithms and again evaluate the quality of the resulting clusters.
 Another approach is to try to evaluate whether a data set has clusters without clustering. Most common approach, especially for data in Euclidean space, has been to use statistical tests for spatial randomness
 8.5.7 Supervised Measures of Cluster Validity
 If we have external information about data, say in the form of class labels for the data objects, we could measure the degree of correspondence between the cluster labels and the class labels.
 We could have different approaches to measure the extent to which two objects that are in the same class are in the same cluster. There are 2 main approaches  classification oriented and similarity oriented.
 ClassificationOriented Measures of Cluster Validity
 Uses measures from classification, such as entropy, purity, and the Fmeasure. These measures evaluate the extent to which a cluster contains objects of a single class.
 Entropy : The degree to which each cluster consists of objects of a single class
 Purity: Another measure of the extent to which a cluster contains objects of
 a single class.
 Precision : The fraction of a cluster that consists of objects of a specified class.
 Recall : The extent to which a cluster contains all objects of a specified class.
 Fmeasure : A combination of both precision and recall that measures the extent to which a cluster contains only objects of a particular class and all objects of that class.
 SimilarityOriented Measures of Cluster Validity
 We can view this approach to cluster validity as involving the comparison of two matrices:
 (1) the ideal cluster similarity matrix discussed previously, which has a 1 if two objects are in the same cluster and 0 otherwise, and
 (2) an ideal class similarity matrix defined with respect to class labels, which has 1 if objects are in the same class and 0 otherwise.
 Rand statistic and Jaccard coefficient are two of the most frequently used cluster validity measures.
 Cluster Validity for Hierarchical Clusterings
 Supervised evaluation of a hierarchical clustering is more difficult for a variety of reasons, including the fact that a preexisting hierarchical structure often does not exist
 To evaluate a hierarchical clustering, we compute,
 for each class, the Fmeasure for each cluster in the cluster hierarchy.
 For each class, we take the maximum F measure attained for any cluster.
 Finally, we calculate an overall Fmeasure for the hierarchical clustering by computing the weighted average of all perclass Fmeasures, where the weights are based on the class sizes.
 8.5.8 Assessing the Significance of Cluster Validity Measures
 How do we interpret the significance of the number given by a cluster evaluation technique ?
 Need to be aware of the basic definitions of evaluation measures used, like  a purity of 0 is bad, while a purity of 1 is good, an entropy of 0 is good, as is an SSE of 0.
 If we don’t have an absolute standard, a common approach is to interpret the value of our validity measure in statistical terms. e.g  plotting a histogram of SSE for random data sets.
Wow ! That was a really long chapter and took 1 full day to read through! Thanks to Tan, Steinbach and Kumar for an excellent narrative.
I really cannot forget this quote 
“Just as people can find patterns in clouds, data mining algorithms can find clusters in random data. While it is entertaining to find patterns in clouds, it is pointless and perhaps embarrassing to find clusters in noise.”
Ref :
 Power Point slides : http://wwwusers.cs.umn.edu/~kumar/dmbook/index.php#item4
 Sample Chapters : http://wwwusers.cs.umn.edu/~kumar/dmbook/index.php#item2