K-Means is a popular unsupervised learning algorithm used to partition data into K distinct clusters based on feature similarity. It is iterative and aims to minimize within-cluster variance by assigning each data point to the nearest cluster center (centroid) and then updating centroids until convergence.
What You’ll Learn
How K-Means works (intuition and algorithm steps)
Choosing the number of clusters (K)
Python implementation using scikit-learn
Evaluating clustering quality
Advantages and limitations
Practical tips for both beginners and professionals
How K-Means Works
Initialization
Select K initial centroids (randomly or via methods like K-Means++).
Assignment Step
For each data point, calculate the distance to each centroid (usually Euclidean).
Assign the point to the nearest centroid’s cluster.
Update Step
Compute the new centroid of each cluster by averaging the points assigned to it.
Repeat
Repeat Assignment and Update steps until centroids no longer move (convergence) or a maximum number of iterations is reached.
Choosing K: The Elbow Method
Run K-Means for a range of K values (e.g., 1 through 10).
Compute the Sum of Squared Errors (SSE) (also called within-cluster sum of squares) for each K.
Plot SSE vs. K.
Look for the “elbow” point where SSE improvement significantly slows; that K is a good choice.
Python Implementation
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# 1. Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)
# 2. Use the Elbow Method to find optimal K
sse = []
k_values = range(1, 11)
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
sse.append(kmeans.inertia_) # inertia_ is the SSE
plt.plot(k_values, sse, marker='o')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Sum of Squared Errors (SSE)")
plt.title("Elbow Method for Choosing K")
plt.show()
# 3. Apply K-Means with chosen K (e.g., 4)
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
labels = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_
# 4. Visualize the clusters and centroids
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30)
plt.scatter(centroids[:, 0], centroids[:, 1], color='red', marker='x', s=100, linewidths=2)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title(f"K-Means Clustering (K={optimal_k})")
plt.show()
Output-
Explanation
Data Generation: make_blobs creates a dataset with 4 centers.
Elbow Method: We plot SSE (inertia) for K from 1 to 10. The “elbow” indicates the optimal K.
Fit K-Means: We choose K = 4, fit the model, and get cluster labels and centroids.
Visualization: Clusters are colored by label; centroids marked with a red “×”.
Evaluating Clustering Quality
Inertia (SSE)
Lower inertia indicates tighter clusters—but cannot directly compare across different K without context.
Silhouette Score
Measures how similar a point is to its own cluster compared to other clusters where a = average intra-cluster distance, b = average nearest-cluster distance.