Meta’s LLaMA and Llama 2 have been game changers in LLM. People thought the models couldn’t go smaller and perform to the same capacity as their larger counterparts. Now, with a given small llama, people can tweak it and make it even smaller, so small that people are questioning how this is possible. The latest, TinyLlama, is actually the most impressive, even breaking the rules of scalability.
Zhang Peyuan, a research assistant at the University of Singapore, has begun training a 1.1 billion parameter model, named TinyLlama. Based on Llama 2, the ambitious part of this project is that Peiyuan aims to pre-train on 3 trillion tokens! The plan is to achieve this in 90 days using only 16 A100-40G GPUs with 24k tokens per GPU. For comparison, the estimated cost to deliver this training on an AWS server would be $40,000.
If it works, the model will set a new benchmark and provide applications that require limited computing resources as it only covers 1.1 billion weights. 550MB of RAM. But people are a bit skeptical about the project.
Chinchilla steps in
The dataset of 3 trillion tokens is a mix of 70% Slimpajama and 30% StarcodeData. “What would pre-training 1.1 billion models for such a long time achieve?” User on HackerNews said. “Isn’t this against the chinchilla scaling law?”
Chinchilla Scaling Law basically states that to train a transformer based language model, to achieve optimal-computation, the number of parameters and the number of tokens to train the model should scale approximately equally.
When it comes to models such as GPT or PaLM that are large in size, the saturation point may come much later as they have the ability to train themselves for a longer time and thus outperform others. According to OpenAI, “We expect that larger models should always perform better than smaller models.” The company believes that the fixed-size model will be capacity-limited.
In other terms, since smaller models have fewer multiples, they run and train faster. But according to this theory, these models eventually reach the limits of their ability to learn knowledge, slowing down the speed at which they learn. For example, training 2 trillion tokens on 7 billion models may still be better than training 3 trillion tokens on 1 billion models.
This question is about the TinyLlama model. Would it even be reasonable to pre-train the model on 3 trillion tokens if there is a saturation point? According to people, 3 trillion tokens is too much for a 1.1 billion model. But this is the point of the experiment.
But, Lama does not agree
There is an ongoing debate about whether bigger models are always better, and Meta with Llama continues to try to prove it wrong. According to the Lama2 paper, “We see that after pretraining on 2 trillion tokens, the models still show no sign of saturation.” This led Peiyuan to hint that training a model on 3 trillion tokens might still be a reasonable idea.
This begs the question – if Meta believes that the chinchilla scaling law is indeed getting a bit redundant, why hasn’t the company trained Llama 2 beyond 2 trillion tokens and released further updates to the model, possibly in a few weeks? The only reason for this could be that the expected benefit from it would be too small for the company to actually get anything out of it.
Or maybe the next llama will be even smaller and trained this way with a larger number of tokens. While Meta is letting its open source community test its capabilities, it may be doing the same behind closed doors.
Small models must reach a limit to the amount of information we are fitting. This project aims to prove otherwise. While we wait and check progress at the training stage, it will be interesting to note how the TinyLama Chinchilla actually beats the scaling law. As of the first checkpoint, TinyLlama is already competing with StableLM-Alpha-3B and Pythia-1B.
If this is achieved, running AI models on single devices will be a major achievement. If not, the chinchilla could be the winner. According to Peiyuan, “I have no idea. This is an open trial that makes no promises or targets. The only target is ‘1.1B at 3T’”.