K-Means Clustering Algorithm

K-Means is a popular unsupervised learning algorithm used to partition data into K distinct clusters based on feature similarity. It is iterative and aims to minimize within-cluster variance by assigning each data point to the nearest cluster center (centroid) and then updating centroids until convergence.

What You’ll Learn

How K-Means works (intuition and algorithm steps)
Choosing the number of clusters (K)
Python implementation using scikit-learn
Evaluating clustering quality
Advantages and limitations
Practical tips for both beginners and professionals

How K-Means Works

Initialization
- Select K initial centroids (randomly or via methods like K-Means++).
Assignment Step
- For each data point, calculate the distance to each centroid (usually Euclidean).
- Assign the point to the nearest centroid’s cluster.
Update Step
- Compute the new centroid of each cluster by averaging the points assigned to it.
Repeat
- Repeat Assignment and Update steps until centroids no longer move (convergence) or a maximum number of iterations is reached.

Choosing K: The Elbow Method

Run K-Means for a range of K values (e.g., 1 through 10).
Compute the Sum of Squared Errors (SSE) (also called within-cluster sum of squares) for each K.
Plot SSE vs. K.
Look for the “elbow” point where SSE improvement significantly slows; that K is a good choice.

Python Implementation

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# 1. Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# 2. Use the Elbow Method to find optimal K
sse = []
k_values = range(1, 11)
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    sse.append(kmeans.inertia_)  # inertia_ is the SSE

plt.plot(k_values, sse, marker='o')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Sum of Squared Errors (SSE)")
plt.title("Elbow Method for Choosing K")
plt.show()

# 3. Apply K-Means with chosen K (e.g., 4)
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
labels = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_

# 4. Visualize the clusters and centroids
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30)
plt.scatter(centroids[:, 0], centroids[:, 1], color='red', marker='x', s=100, linewidths=2)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title(f"K-Means Clustering (K={optimal_k})")
plt.show()

Output-

Explanation

Data Generation: make_blobs creates a dataset with 4 centers.
Elbow Method: We plot SSE (inertia) for K from 1 to 10. The “elbow” indicates the optimal K.
Fit K-Means: We choose K = 4, fit the model, and get cluster labels and centroids.
Visualization: Clusters are colored by label; centroids marked with a red “×”.

Evaluating Clustering Quality

Inertia (SSE)
- Lower inertia indicates tighter clusters—but cannot directly compare across different K without context.
Silhouette Score
- Measures how similar a point is to its own cluster compared to other clusters
  where a = average intra-cluster distance, b = average nearest-cluster distance.
- Values range from -1 to 1; higher is better.

from sklearn.metrics import silhouette_score

score = silhouette_score(X, labels)
print("Silhouette Score:", score)

Output-

Silhouette Score: 0.8756469540734731

Advantages

Simple to understand and implement.
Scales to large datasets (time complexity roughly O(n × k × i), where i = iterations).
Works well when clusters are roughly spherical and balanced in size.

Limitations

Must specify K in advance.
Sensitive to initial centroid placement (K-Means++ can help).
Not robust to clusters of different shapes or densities.
Sensitive to outliers.

Tips for Beginners

Visualize data before applying K-Means to estimate reasonable K values.
Preprocess data: scale features so that distance metrics aren’t dominated by high-variance features.
Use K-Means++ initialization (init='k-means++') to improve convergence.

Tips for Professionals

If clusters have different shapes or densities, consider alternatives like DBSCAN or Gaussian Mixture Models.
For high-dimensional data, apply dimensionality reduction (e.g., PCA) before K-Means to improve performance.
Monitor the convergence tolerance (tol parameter) and maximum iterations (max_iter) for large datasets.
Use Mini-Batch K-Means (sklearn.cluster.MiniBatchKMeans) for very large datasets to reduce computation time.

Summary

K-Means partitions data into K clusters by minimizing within-cluster variance.
The Elbow Method and Silhouette Score help choose an appropriate K.
Preprocessing (scaling, dimensionality reduction) and initialization strategies (K-Means++) improve results.
K-Means is effective for roughly spherical, equally sized clusters but has limitations with complex structures.

Discussion

Introduction to Unsupervised Learning Hierarchical Clustering