Llama-2, Mo’ Lora

maxine
4 min readJul 20, 2023

Meta just released LLaMA-2, a transformer trained on 2 TRILLION tokens of natural language data! And many are itching to be some of the first to finetune it, including me. So I did! But in a bit of a non-conventional manner, let me, or GPT-4 first, explain.

GPT-4 Written Introduction:

Deep learning models have increasingly relied on fine-tuning to achieve stellar performance in many tasks. The recently introduced LoRA (Low Ranked Adapters) is a fine example of the efficiency and utility of this fine-tuning trend. LoRA, by using two matrices A and B, offers a memory and speed friendly fine-tuning approach that also fares well with low-example count. However, our newly proposed method — MoLora (Mixture of Loras) — takes this a step further by employing multiple LoRA adapters, offering further surprising insights into the power and potential of fine-tuning.

In essence, MoLora applies the principles of LoRA but leverages multiple adapters, each trained on different ‘clusters’ within a data set. These clusters are computed using KMeans, a popular unsupervised learning algorithm. This lays the groundwork for a new adaptability to each prompt.

The innovative element in MoLora lies in its method to determine the weights of the LoRA adapters. Instead of having static weights, MoLora measures the similarity between a given model prompt and each of the data clusters. This similarity then becomes the basis for each LoRA adapter’s weights.

back to crumb

TLDR; we cluster the dataset with KMeans, train a LoRA adapted model on each cluster, then combine them based on the cosine similarity from your new prompt to each cluster.

I only trained, for this test run, on 2048 examples from each cluster for each LoRA. Each of them took 40 minutes to train on the free version of Google Colab, and I trained on a version of WizardLM’s EvolInstruct, which I clustered and made public here: crumb/Wizard-EvolInstruct70k-k4 · Datasets at Hugging Face They used rank=2 adapters and a learning rate of 1e-4, with a warmup of 1/10th the steps.

Here’s a comparison between normal Lora and MoLora for this first proof of concept:

Lora-Llama2–7b

Describe how Transformers (Vaswani et. al.) work.

How does a Transformer learn to recognize an image of a cat, for example? How does the Transformer learn to recognize an image of a dog? How does the Transformer learn to recognize an image of a person? How does the Transformer learn to recognize an image of a car? How does the Transformer learn to recognize an image of a house? How does the Transformer learn to recognize an image of a street? How does the Transformer learn to recognize an image of a person in a car? How does the Transformer learn to recognize an image of a person in a house? How does the Transformer learn to recognize an image of a person in a street? How does the Transformer learn to recognize an image of a person in a car in a house in a street? How does the Transformer learn to recognize an image of a person in a car in a house in a street in a person in a car in a house in a street? How does the Transformer learn to recognize an image of a person in a car in a house in a street in a person in a car in a house in a street in a person in a car in a house in a street in a person

MoLora-Llama2–7b (4 experts/clusters)

Describe how Transformers (Vaswani et. al.) work.

— Transformers are a type of neural network architecture that uses attention mechanisms to process text inputs.

— The architecture consists of two main components: an encoder and a decoder. The encoder takes in a text input and generates a sequence of vectors representing the input. The decoder then takes these vectors and generates a sequence of tokens that represent the text.

— The attention mechanism is the key component of the Transformer architecture. It allows the network to focus on specific words or phrases in the input text and generate more accurate results.

— The encoder consists of several layers of neural networks that take in the input text and generate a sequence of vectors. The attention mechanism then allows the network to focus on specific words or phrases in the input text.

— The decoder then takes these vectors and generates a sequence of tokens that represent the text. The attention mechanism again allows the network to focus on specific words or phrases in the input text and generate more accurate results.

— The Transformer architecture is able to generate more accurate results than traditional neural network architectures because of its attention mechanism. It allows the network to focus on specific words or phrases in the input text and generate more accurate results.

This is a promising method that, once developed further (and oh boy it will, I have plans for the 70b model and tons of datasets to cluster for it), could lead to incredible results.

You can try out the code for this proof of concept here for free in Google Colab, MoLora_7b_(PROOF_OF_CONCEPT).ipynb — Colaboratory (google.com)

This method is not complete, but I am reporting preliminary results to hopefully build some anticipation around the models I’m building. This version was trained entirely in Google Colab on the free-tier, and I have a workstation coming soon to do a much bigger version of this.

--

--