The Rise of Small Language Models
Large language models work well because they’re so large. The latest models from OpenAI, Meta, and DeepSeek use hundreds of billions of “parameters”—the adjustable knobs that determine connections among data and get tweaked during the training process. With more parameters, the models are better able to identify patterns and connections, which in turn makes them more powerful and accurate.
The Cost of Large Models
But this power comes at a cost. Training a model with hundreds of billions of parameters takes huge computational resources. To train its Gemini 1.0 Ultra model, for example, Google reportedly spent $191 million. Large language models (LLMs) also require considerable computational power each time they answer a request, which makes them notorious energy hogs. A single query to ChatGPT consumes about 10 times as much energy as a single Google search, according to the Electric Power Research Institute.
The Emergence of Small Language Models
In response, some researchers are now thinking small. IBM, Google, Microsoft, and OpenAI have all recently released small language models (SLMs) that use a few billion parameters—a fraction of their LLM counterparts.
The Advantages of Small Models
Small models are not used as general-purpose tools like their larger cousins. But they can excel on specific, more narrowly defined tasks, such as summarizing conversations, answering patient questions as a health care chatbot, and gathering data in smart devices. “For a lot of tasks, an 8 billion–parameter model is actually pretty good,” said Zico Kolter, a computer scientist at Carnegie Mellon University. They can also run on a laptop or cell phone, instead of a huge data center.
Optimizing Training
To optimize the training process for these small models, researchers use a few tricks. Large models often scrape raw training data from the internet, and this data can be disorganized, messy, and hard to process. But these large models can then generate a high-quality data set that can be used to train a small model. The approach, called knowledge distillation, gets the larger model to effectively pass on its training, like a teacher giving lessons to a student. “The reason [SLMs] get so good with such small models and such little data is that they use high-quality data instead of the messy stuff,” Kolter said.
Pruning and Fine-Tuning
Researchers have also explored ways to create small models by starting with large ones and trimming them down. One method, known as pruning, entails removing unnecessary or inefficient parts of a neural network—the sprawling web of connected data points that underlies a large model.
Pruning: A Real-Life Inspiration
Pruning was inspired by a real-life neural network, the human brain, which gains efficiency by snipping connections between synapses as a person ages. Today’s pruning approaches trace back to a 1989 paper in which the computer scientist Yann LeCun, now at Meta, argued that up to 90 percent of the parameters in a trained neural network could be removed without sacrificing efficiency. He called the method “optimal brain damage.”
Fine-Tuning for Specific Tasks
Pruning can help researchers fine-tune a small language model for a particular task or environment. For example, a model trained on medical records could be pruned to focus on specific diagnoses or treatments.
Conclusion
The rise of small language models offers a new approach to natural language processing. While large models will continue to be useful for general-purpose applications, small models can excel in specific, targeted tasks while being more energy-efficient and cost-effective to train. As researchers continue to develop and refine small models, we can expect to see new applications and innovations in the field.
FAQs
Q: What is the difference between large language models and small language models?
A: Large language models use hundreds of billions of parameters, while small language models use a few billion parameters. Small models are designed for specific, targeted tasks and can be more energy-efficient and cost-effective to train.
Q: What are the advantages of small language models?
A: Small language models can excel on specific tasks, require less computational power and energy, and can be trained on a laptop or cell phone. They can also be fine-tuned for specific tasks or environments.
Q: How do small language models train?
A: Small language models can be trained using high-quality data sets generated by large models, a process called knowledge distillation. They can also be pruned from large models to remove unnecessary or inefficient parts.
Q: What are the limitations of small language models?
A: Small language models are not designed for general-purpose applications and may not be as powerful or accurate as large models. They are best suited for specific, targeted tasks and may require fine-tuning for specific environments or tasks.