K-Means Clustering is an unsupervised learning algorithm for data grouping and categorized as partitioning clustering. K-means method divide instances based on similarities into multiple groups. The algorithm iteratively adjusts the position of cluster centers such that the distance between cluster centers (centroids) and instances assigned are minimal. Generally, clustering data in standard K-means methods follow these steps:
Define k number of cluster centers randomly
Assign every instance to a cluster with the smallest distance to the cluster center
Adjust new cluster centers ‘attributes’ by computing the mean from all assigned instances
Repeat steps 2 and 3 until each centroid remains unchanged after adjustment (step 3).
Source: heartbeat.fritz.ai/understanding-the-mathematics-behind-k-means-clustering-40e1d55e2f4c
Similar to K-Nearest Neighbor, K is a hyperparameter without any exact approach for finding optimal value and subsequently small or large value could lead to poor performance. Moreover, the result of clustering also strongly depends on the initial values of centroids which well-assigned starting points may lead to great results and vice versa.
Therefore, multiple methods are necessary for providing insights on choosing the k value and determining the centroids’ initial value such as Domain Expertise, K-means++, Elbow Methods, etc.
The aims of K-Means Clustering are grouping data such that it is interpretable and provides vital information. Intra-cluster distance and Inter-cluster distance are suitable for measuring the effectiveness of K-Means Clustering. Intra-cluster distance measures the total distances of all points inside one cluster;
Hence, the smaller the Intra-cluster distance value, the better the clustering performance is. On the other hand, Inter-cluster distance measures the distance between cluster centers thus the higher this value, the better the clustering result is.
K-Means Clustering is an easy-to-implement and adaptable algorithm for grouping data with multiple attributes. Moreover, modest time and space complexity make the algorithm applicable for large multidimensional data.
Comments