Cluster Analysis
- Overview
Cluster analysis, also known as clustering, is a data mining method that groups similar data points together. The goal of cluster analysis is to divide a dataset into groups (or clusters) such that the data points within each group are more similar to each other than to data points in other groups.
Clustering is used to identify groups of similar objects in datasets with two or more variable quantities. This data may be collected from marketing, biomedical, or geospatial databases, among many other places.
Cluster analysis is often used in conjunction with other analyses (such as discriminant analysis). The researcher must be able to interpret the cluster analysis based on their understanding of the data to determine if the results produced by the analysis are actually meaningful.
Cluster analysis is a common technique for statistical data analysis used in many fields, including: Machine learning, Data mining, Pattern recognition, Image analysis, Bioinformatics.
Please refer to the following for more information:
- Wikipedia: Cluster Analysis
- Clustering Algorithms
The notion of a "cluster" cannot be precisely defined, which is one of the reasons why there are so many clustering algorithms. There is a common denominator: a group of data objects.
However, different researchers employ different cluster models, and for each of these cluster models again different algorithms can be given.
The notion of a cluster, as found by different algorithms, varies significantly in its properties. Understanding these "cluster models" is key to understanding the differences between the various algorithms.
Here are some methods for cluster analysis:
- K-Means Cluster: A method to quickly cluster large data sets. The researcher defines the number of clusters in advance.
- Hierarchical Cluster: The most popular and widely used method to analyze social network data. In this method, nodes are compared with one another based on their similarity. Larger groups are built by joining groups of nodes based on their similarity.
- Fuzzy clustering: A variation of centroid-based clustering, lets points belong to more than one cluster. Points are assigned to one or more clusters based on membership coefficients representing similarity.
- Key Concepts and Terms in Clustering
Cluster analysis is an exploratory analysis that tries to identify structures within the data. Cluster analysis is also called segmentation analysis or taxonomy analysis. More specifically, it tries to identify homogenous groups of cases if the grouping is not previously known. Because it is exploratory, it does not make any distinction between dependent and independent variables.
Some key concepts and terms in clustering include:
- Data points: These are individual observations or instances in the dataset.
- Clusters: These are groups of similar data points.
- Centroids: These are the centers of the clusters, calculated as the mean of all data points in the cluster.
- Distance: This is the measure of dissimilarity between two data points.
- Linkage: This is the method used to combine the similarity between two data points to determine their distance from each other.
- Criteria: This is the objective function used to evaluate the quality of the clustering solution.
- Density: This is the degree to which a data point belongs to its assigned cluster.
- The Research Questions Answered by Cluster Analysis
Typical research questions answered by cluster analysis are as follows:
- Medicine - What is the medical-diagnostic cluster? To answer this question, researchers will develop a diagnostic questionnaire that includes possible symptoms (e.g., psychological, anxiety, depression, etc.). Cluster analysis can then identify groups of patients with similar symptoms.
- Marketing – What are the customer bases? To answer this question, market researchers might conduct a survey that covers customers' needs, attitudes, demographics, and behaviors. Researchers can then use cluster analysis to identify homogenous groups of customers with similar needs and attitudes.
- Education – Which student groups require special attention? Researchers can measure psychological, ability, and achievement characteristics. Cluster analysis can then identify which homogenous groups exist among students (for example, students who excel in all subjects, or students who excel in some subjects but fail in others).
- Biology - What are the classifications of species? Researchers can collect data sets of different plants and record different attributes of their phenotypes. Cluster analysis can group these observations into a series of clusters and help establish a taxonomy of groups and subgroups of similar plants.
[More to come ...]