The previous chapter ended with an uncomfortable fact: the best models are trained on amounts of data and compute that almost no individual or small team can match. The good news is that you don't have to. Transfer learning lets you take a model that someone else trained on a giant dataset and adapt it to your own, much smaller problem. It is probably the single most useful practical technique in modern deep learning, and it is how most real projects actually get built.
Think back to the CNN chapter, where the early layers learned generic features like edges and textures, and only the later layers learned anything specific to the task. Those early features are not really about cats or cars; they are about images in general, and they are just as useful for a medical scanner as for a photo classifier. The same holds for language. A model trained on a huge text corpus has already learned grammar, word meaning, and a great deal about the world, none of which you want to learn again from your few thousand examples.
Transfer learning takes that pretrained knowledge and reuses it. You keep the general understanding the model already has and only teach it the part that is specific to your task. Because you are starting from something that already works, you need far less data and far less training time, and you often end up with a better model than you could have trained from scratch.
There are two main approaches, and the right one depends mostly on how much data you have.
The first is feature extraction. You take the pretrained model, freeze all of its existing layers so their weights don't change, and attach a small new section on top that you train on your data. The pretrained model acts as a fixed feature detector and you only learn how to map its features to your specific labels. This is fast, needs little data, and is the sensible first thing to try.
The second is fine-tuning. Here you also unfreeze some or all of the pretrained layers and continue training them on your data, usually at a much lower learning rate so you nudge the existing knowledge rather than overwrite it. Fine-tuning can reach higher accuracy because the model adapts its internal features to your domain, but it needs more data and more care to avoid destroying what the model already knew. A common recipe is to do feature extraction first, then unfreeze and fine-tune gently once the new head has settled.
This loads a CNN pretrained on a large image dataset, freezes it, and trains a new classifier on top:
from tensorflow.keras import layers, models
from tensorflow.keras.applications import MobileNetV2
base = MobileNetV2(include_top=False, weights='imagenet',
input_shape=(224, 224, 3))
base.trainable = False # feature extraction: freeze the base
model = models.Sequential([
base,
layers.GlobalAveragePooling2D(),
layers.Dense(128, activation='relu'),
layers.Dense(5, activation='softmax') # your 5 classes
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
To fine-tune afterwards, set base.trainable = True, recompile with a small learning rate using an optimizer like AdamW from the optimizers chapter, and train for a few more epochs.
The same logic drives nearly all modern language work. You start with a pretrained Transformer and adapt it: fine-tune a BERT-style model for classifying your support tickets, or fine-tune a generative model so it answers in your product's voice. The base model brings general language ability; you supply the task.
One of the most common questions when building with large language models is whether to fine-tune a model on your data or to use retrieval-augmented generation, where you keep your data in a searchable store and feed relevant pieces to the model at query time. They solve different problems. Fine-tuning is good for teaching a model a style, a format, or a skill, the kind of thing that shows up across all your examples. Retrieval is the better choice when you need the model to use specific, changing facts, such as the contents of your latest documents, because you can update the store without retraining anything. In many real systems you use both. This trade-off is worked through in detail in the RAG field manual, which is the natural companion to this chapter once you move from theory into building applications.
Whether you train from scratch, fine-tune, or do feature extraction, you will face a set of choices that no model makes for you: how fast to learn, how big to build, how hard to fight overfitting. These are the hyperparameters, and choosing them well is the subject of the final chapter.
Sign in to join the discussion and post comments.
Sign in