Clustering

Overview

Clustering is an unsupervised learning method that groups similar data points based on their features. Clustering aims to find natural groupings within a dataset without prior knowledge of the class labels.

Partitional clustering divides a dataset into a predefined number of non-overlapping clusters. Partitional clustering algorithms, such as K-means, require a specified number of clusters (or k) beforehand. The algorithm then iteratively assigns data points to their nearest cluster center based on a distance metric until convergence. Examples of possible distance metrics used in partitional clustering include Euclidean, Manhattan, and cosine similarity.

Euclidean distance is perhaps the most intuitive distance metric. It is defined as the L-2 norm of the difference between two vectors:

$||\mathbf{x} - \mathbf{y}||_2 = \sqrt{\sum_{i=1}^{n}(x_i-y_i)^2}$

Manhattan distance is defined as the L-1 norm of the difference between two vectors:

$||\mathbf{x} - \mathbf{y}||_1 = \sqrt{\sum_{i=1}^{n}|x_i-y_i|}$

Cosine similarity is the cosine of the angle between two vectors. It is defined as:

Alternatively, hierarchical clustering creates a hierarchy of clusters. Hierarchical clustering is more flexible than partitional clustering because it doesn’t require a specified number of clusters beforehand. It is, however, computationally expensive and not well suited for large datasets. While, partitional clustering algorithms assign outliers to the nearest cluster center, which may not represent their true identity, hierarchical clustering algorithms allow outliers to form single clusters, making the algorithm more robust to outliers.

**Data after partitional clustering (a) and hierarchical clustering (b)**

This project will utilize clustering to uncover potential similarities between the observations (countries). Hopefully, clustering can help determine how donor countries allocate foreign aid and what kinds of recipient countries are more appealing to aid donors.

Data Prep

Clustering algorithms can only be used with unlabeled numeric data. As an unsupervised method, clustering does not train using labels. Instead, it generates clusters based on a specified distance metric. These distance metrics are only suitable for numeric data, which can be represented as vectors. For this analysis, all qualitative variables in the dataset have been removed as well as the label: ‘Aid Level’.

Image of data after removing qualitative variables and label

Download Clean Clustering Data

Code

Download k-means python file

Download k-means jupyter notebook

Download hierarchical clustering r file

Results

K-Means

For K-means, a value of k needs to be specified. By testing k values ranging from 1 to 10, the “Elbow method” determined that a k of 3 is optimal for K-means clustering, as seen below.

The silhouette plot shows that the ‘1’ cluster dominates the data. The width of the silhouettes for clusters ‘2’ and ‘3’ are much thinner than that of cluster ‘1’, with ‘2’ being practically non-existent. Additionally, the lengths of the silhouettes vary widely, implying a sub-optimal clustering.

Hierarchical Clustering

Hierarchical clustering suggests about 3 or 4 clusters, which is consistent with the results from the K-means section above.

Conclusions

The results of this analysis suggest that cluster may not be the best method for this research topic. In the case of K-means, the algorithm performed sub-optimally, in which an overwhelming number of data points clustered into a single group. While hierarchical clustering did cluster the countries as intended, it is difficult to determine how the hierarchical clusters can help answer the underlying research question. These results are not a knock on the method itself but rather indicate that the data do not naturally cluster into groups, and some other method or algorithm may prove to be more fitting for this analysis.