Feature extraction is a critical step in machine learning pipelines where raw data (text, images, etc.) is converted into numerical representations that algorithms can understand.
We'll break it into two major sections:
We will cover:
CountVectorizerfrom sklearn.feature_extraction.text import CountVectorizer
# Sample text data
corpus = [
'Machine learning is fascinating',
'Learning algorithms can be powerful',
'Text data needs preprocessing'
]
# Convert text to numeric features
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(corpus)
# Feature names
print(vectorizer.get_feature_names_out())
print(X_bow.toarray())Output-
['algorithms' 'be' 'can' 'data' 'fascinating' 'is' 'learning' 'machine'
'needs' 'powerful' 'preprocessing' 'text']
[[0 0 0 0 1 1 1 1 0 0 0 0]
[1 1 1 0 0 0 1 0 0 1 0 0]
[0 0 0 1 0 0 0 0 1 0 1 1]]TfidfVectorizerfrom sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
# Display results
print(tfidf.get_feature_names_out())
print(X_tfidf.toarray())Output -
['algorithms' 'be' 'can' 'data' 'fascinating' 'is' 'learning' 'machine'
'needs' 'powerful' 'preprocessing' 'text']
[[0. 0. 0. 0. 0.52863461 0.52863461
0.40204024 0.52863461 0. 0. 0. 0. ]
[0.46735098 0.46735098 0.46735098 0. 0. 0.
0.35543247 0. 0. 0.46735098 0. 0. ]
[0. 0. 0. 0.5 0. 0.
0. 0. 0.5 0. 0.5 0.5 ]]gensim
from gensim.models import Word2Vec
sentences = [s.lower().split() for s in corpus]
model = Word2Vec(sentences, vector_size=50, window=5, min_count=1, workers=4)
# Access vector for a word
print(model.wv['learning'])Output-
[-1.0724545e-03 4.7286271e-04 1.0206699e-02 1.8018546e-02
-1.8605899e-02 -1.4233618e-02 1.2917745e-02 1.7945977e-02
-1.0030856e-02 -7.5267432e-03 1.4761009e-02 -3.0669428e-03
-9.0732267e-03 1.3108104e-02 -9.7203208e-03 -3.6320353e-03
5.7531595e-03 1.9837476e-03 -1.6570430e-02 -1.8897636e-02
1.4623532e-02 1.0140524e-02 1.3515387e-02 1.5257311e-03
1.2701781e-02 -6.8107317e-03 -1.8928028e-03 1.1537147e-02
-1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
1.6154874e-02 -1.1861792e-02 9.0324880e-05 -9.5074680e-03
-1.9207101e-02 1.0014586e-02 -1.7519170e-02 -8.7836506e-03
-7.0199967e-05 -5.9236289e-04 -1.5322480e-02 1.9229487e-02
9.9641159e-03 1.8466286e-02]We will cover:
import cv2
import numpy as np
# Load image (convert to grayscale)
image = cv2.imread('sample.jpg', cv2.IMREAD_GRAYSCALE)
image = cv2.resize(image, (64, 64))
# Flatten to 1D array
features = image.flatten()
print("Feature shape:", features.shape)Output -
Feature shape: (4096,)
from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing import image
import numpy as np
# Load model without top classifier
model = VGG16(weights='imagenet', include_top=False)
# Load and preprocess image
img = image.load_img('sample.jpg', target_size=(224, 224))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = preprocess_input(img_data)
# Extract features
features = model.predict(img_data)
print("Extracted features shape:", features.shape)Output -
Extracted features shape: (1, 7, 7, 512)
Sign in to join the discussion and post comments.
Sign inSupervised Learning
Discover what Supervised Learning is, how it works, and what you'll learn in this hands-on tutorial series covering top ML algorithms like Linear Regression, Decision Trees, SVM, and more.
Unsupervised Learning
Explore Unsupervised Learning techniques to uncover patterns, structures, and relationships in unlabeled data.