Do I Need to Train the Whole Model?

5 min readMay 1, 2022

“Do I really need to train the full model?” Is something I’ve been wondering every single time I go to finetune or train any model. In this post I’ll be training only the last layer and seeing how it affects performance.

I decided for this project I’m going to do things the “proper” way and use the SCIENTIFIC METHOD! to reach a conclusion. I looked online and the first graphic that showed up that caught my eye was from sciencebuddies.org so that’s what I’m going to be using.

1.) Ask a Question

This one’s simple, “Do I really need to train the full model?” (note the “I” as this is mainly and experiment for me, because there are a lot of applications (secret) that I don’t know if I want to go through the effort for (yes it is a lot of effort for these specific applications, do not ask, I will not tell you why))

2.) Do Background Research

Believe me on this, I tried real hard to search for results, but either there are none or I really don’t know what to search for this. This is the kind of results I would get while searching.

It really isn’t related at all to the kinds of questions I’m asking (for reference, this is “how much of a model has to be trainable”). Don’t bash me I’m 99% sure there’s actual professional research about this but I couldn’t find any in my … 2 minutes of googling.

3.) Construct a Hypothesis

Provable by this experiment

Large models only training one layer can reach the same or similar performance as smaller models training every layer.

Not 100% provable with this experiment but equally important I feel, for analyzing the data later on.

A wider (more connections per layer) network will provide a higher chance of having meaningful connections (higher performance)
A deeper (more layers) network will provide more complex connections at a cost of reducing the chance of having meaningful (complex) connections.

High level, those both just boil down to “make it big it will do good.”

4.) Test with an Experiment

I use wandb to log results from my sessions of tweaking parameters and running.

Stage One

I create a control run which is just a 3 layer MLP, training to classify MNIST digits.
I create several experiment runs where I only tune the last layer of the MLP and tweak settings, logging the results.

Stage Two

I create more experiment runs, but this time optimizing the last layer as if it was an ELM (extreme learning machine). I feel this would be intriguing to log data from as it only takes a single step, the downside is I cannot create a specialized control for this run, but will still compare it to Stage One’s control run anyway.

4b.) Procedure Working?

Only thing I can’t get working is stage two, I can’t figure out on my own how to create a multi-class ELM, so instead I’m going to just squash the classes between 0 and 1 (0.1 for 1, 0.2 for 2, so on) instead, which now this isn’t really representative of the original experiment but I still would like to see how it performs.

The procedure for testing the ELM idea goes like this:

create all the layers except the last as a Sequential object in pytorch
forward through them
take the dot product of the pseudoinverse of that result and the labels this is our “beta” in elm terms but it’s also technically our last layer
forward again to get the but then compute the dot product of that and beta to get our results

both stages will use torch.nn.CrossEntropyLoss()

5.) Analyze Data / Communicate Results

I feel as though my hypothesis is backed up, a wide enough network will reach the same or similar performance only tuning the last layer (see run that was automatically titled sleek-bee-10).

For the results of stage two however, you’ll have to just trust me that I’m not lying about the results (even though they aren’t much) (pushing them to wandb felt useless because it’s just one single step and wandb is all about graphs)

(depth (3 layers in control), features (64 in control), train accuracy, test accuracy, train loss, test loss)

Same depth / same features per layer: 79.13%, 13.4%, 1.667, 1.656
Same depth / 4x features per layer: 87.05%, 14.57%, 1.590, 1.586
2x depth / same features per layer: 79.20%, 13.4%, 1.669, 1.655
2x depth / 4x features per layer: 87.80%, 14.77%, 1.583, 1.574
2x depth / 8x features per layer: 90.55%, 15.09%, 1.555, 1.555
2x depth / 16x features per layer: 93.76%, 15.6%, 1.523, 1.525
4x depth / 16x features per layer: 87.84%, 1.8%, 1.582, 2.347
4x depth / 64x features per layer:

I crashed google colab

The good: these perform amazingly on the training set (the top one outperforming the control, which was 92.02% accuracy, so any circumstance where you want overfit, these are your guys. They also take seconds (mostly) to fit!

The bad: They’re ELM’s they overfit, you can see that in the terrible test accuracy, but if you need that for something then definitely have a go at it. There’s also ways to make ELMs generalize better but I’m not going into that now.

I also think all these results back up the secondary hypotheses but not necessarily prove them, as well as giving a resounding “maybe” to the primary hypothesis (stage one is a better candidate for that).

an elm (tree, not extreme learning machine)

El Fin

Things I learned:

I’m terrible at following any sort of scientific method

Oh also that “YES” I do need backprop if I want things to work in a more reasonable time and perform well but “no..?” in very select circumstances