XPapers — Paying ‘Attention’ to your Dataset Mixture

maxine
2 min readOct 30, 2023

let A represent a vector where each A_i is # of texts from each domain i, train small on mix A, find every L_i (loss) for test sets for each domain, A=A+sm(L)*A*r where r is an “augmentation rate”.. uhh.. repeat until happy

TweetPapers is a new series of paper ideas I have that, instead of writing a paper, I write within the character-limit of a Tweet. The source for this XPaper is https://twitter.com/aicrumb/status/1718796468935418094. An extended (and frankly, more readible) version by GPT-4 was automatically generated and included.

Additional notes from author:

btdubs with this method if you have “code” as one of your subsets it will be prioritized heavily which is something to look out for. you can also replace the second instance of A with like [1k,1k,1k,1k…] to be multiplied by sm(L) instead of by A for a linear(?) growth. but using A, instead of the constant, works like a sort of momentum which is coolyou might also want to schedule the augmentation rate to lower over time, i also use an augment temperature t for A+A*sm(L*t)*r which is >1 to increase the ‘sharpness’ of the steps

Title: A Structured Approach to Dataset Mixture Optimization: A Tutorial

In this tutorial, we present a methodical approach to optimize the mixture of datasets during training by utilizing the respective domain losses to guide the augmentation. This process is aimed at iteratively enhancing the model’s performance across various domains.

Let’s formalize the steps:

  1. Initialization:
  • Define a vector A, where each element A[i] represents the number of samples drawn from domain i.
  • Define an augmentation rate r, which is a hyperparameter to control the extent of adjustment to A in each iteration.

2. Training Cycle:

  • Model Training: Train the model using the current dataset mixture as specified by vector A.
  • Loss Calculation: Compute the loss on each domain to obtain a loss vector L, where L[i] is the loss on domain i.
  • Softmax Normalization: Apply the softmax function to the loss vector L to get a normalized vector SM(L), where SM(L)[i] = exp(L[i]) / (sum from j=1 to n of exp(L[j])) for n domains.
  • Dataset Mixture Update: Update the dataset mixture vector A as follows:
  • A = A + (SM(L) * A) * r
  • Repeat the steps of Model Training, Loss Calculation, Softmax Normalization, and Dataset Mixture Update iteratively until a stopping criterion is met (e.g., a specified number of iterations, convergence of vector A, or satisfactory model performance).

Through this iterative procedure, the dataset mixture is dynamically adjusted based on the model’s performance on each domain, with the aim of allocating more samples to the domains where the model experiences higher loss, thereby promoting balanced improvement across all domains.

--

--