The Lifecycle of a Large Language Model
March 27, 2024•2,456 words
Introduction
When most people use the term large language model (LLM) today, they are referring to a specific type of machine learning model type called a transformer. Transformers themselves don't have to be applied to human language, but it is the application of them in chatbots that led to the mainstream popularity of artificial intelligence as we know today.
This way of interacting with models through human language undoubtedly opens doors to new technological solutions.
Examples include programming assistance tools to recipe generators that provide a recipe based on the ingredients you have on-hand.
The thought of creating or customizing an LLM for a novel application can seem daunting, but the good news is that there exists many tools and techniques to reduce the barrier to entry for creating custom LLMs. This document will serve as a guide and roadmap for utilizing an LLM for a custom application.
Developing a custom model architecture can be a costly process, and there isn't much reason to go down this path unless you have very specific needs or are an artificial intelligence researcher. Therefore, this guide is primarily focused on being a roadmap for utilizing existing model architectures to serve the needs of a new application.
Starting Point: Do You Need Your Own Model?
While it's tempting to immediately start working towards your own customized model, we do need to ask if it's a worthwhile endeavour. There are a host of high-performance, publicly available models that have been pre-trained for specific tasks on huggingface. Many of these models also feature permissive licenses that allow commercial use, such as the Mistral-7B-Instruct model, which is capable of question and answering, text summarization and many other instructional tasks. Other closed models such as OpenAI's GPT can even be accessed through a simple HTTP API without the need to bother with hosting these models yourself. If time to market is the main concern, it's hard to beat integrating with a simple API to gain the capabilities of ChatGPT within your product.
There are of course drawbacks to this approach. When you utilize an HTTP API for an off-the-shelf model, you are limited in the amount of customization you can apply to said model. The furthest you can typically go is editing the prompt you provide to the model. Something we also need to realize is that while the barrier of entry to such a solution is low, this also means that the barrier of entry is low to any potential competitors. So if your entire product revolves around the capabilities of an AI model, it is perhaps not the best decision to rely on such a solution due to the lack of a competitive advantage. Conversely, if you are simply augmenting an existing product, such as adding AI assisted question and answering based on an existing documentation space, then the route of a pre-made model via an HTTP API would be an appealing one. Your main product would have already differentiated itself based on its merits.
Modern large language models are expensive in terms of development time and the monetary costs for the hardware needed to train and host them for inference. While using someone else's HTTP API helps you avoid a large up-front investment, the billing model of paying per token can lead to high running costs if you expect a lot of tokens being processed. For example, if your use case involves frequently processing long passages of text, this would lead to many tokens being processed. In this case, it is likely more cost effective in the long run to fine-tune and host your own model.
Fine tuning a model for your specific use case is almost always superior to picking a generalized off-the shelf one. If you limit the problem domain of your model, you can also usually get away with a much smaller model that's cheaper to train and host. For example, there is no need to store weights for a chatbot when all you need to do is classify some text. If your application would be well served by utilizing an existing model via an HTTP API, then there is no need to progress any further in this article. If you've elected to customize your own model, read on.
Model Selection
Task
While generalist models exist, they may not feature the best performance for your particular application. They are also typically larger and more expensive to deal with. It is therefore preferable if you could find a model that is best suited for your task.
For example, starting with a model pre-trained for text summarization if your application involves summarizing documents. This allows you to take advantage of the work someone else has put into training such a model for a specific task. In this scenario, you'd only need to fine tune the model on a smaller subset of data that is specific to your domain. As an example, a model trained specifically for medical text summarization can be found here on huggingface. If your application involved summarizing similar technical documents involving medical terms, this model would be a good fit.
If there are no models available for your specific use case, then a good place to start would be any model capable of arbitrary text-to-text generation. Such models are capable of taking any input text prompt, and generating the output text responses that you desire. Using such a model does mean that you would need to perform fine tuning such that the model starts to produce responses in the format you expect. A common text-to-text model with reasonable performance in multiple tasks is Google's T5.
Benchmarks
Once you've decided on the specific task you'll be addressing, it can be useful to look at the performance of existing models for the given task. There are common benchmarks that assess the performance of models in particular tasks. These benchmark results are often presented as part of the model description on sites such as huggingface.
Some common benchmarks include:
- MMLU: Massive Multi-task Language Understanding is a set of tests designed specifically to determine the accuracy of a model in answering a series of questions based on subjects such as history, mathematics and computer science. It can be seen to evaluate how good your model is at being a generalist that has knowledge of a wide variety of problem domains.
- CommonsenseQA: This is a benchmark focused on seeing how well a model can answer questions when given a passage of text that contains the answer. The intention is to minimise the emphasis on stored knowledge of the model and focus on how well a model can recognize answers within given information.
- MATH: As the name suggest, this is a benchmark solely focused on solving mathematical problems. Many of these problems require multi-step solutions, so this evaluates how well a model can break down a mathematics problem and solve it in a series of steps.
There are many other benchmarks that are readily available. The idea is that you find a benchmark that is able to represent the problem that you want to solve and subsequently find a model type that has good performance on that benchmark.
Parameter Size & Hardware Requirements
Another factor to consider when picking a model is the hardware required to train it as well as run inference on it. It's expected that large 70 billion parameter models have the best benchmark scores, however, these models often require expensive multi-GPU setups for training and inference.
It is important to balance the performance of your model with your budget for local hardware or rented cloud GPU instances.
In AI applications, the limiting factor of compute devices tends to be the amount of memory available to store all the weights of the model.
Huggingface has provided an estimator tool to help understand how much memory is required to train models. For example, we can see that training Google's gemma-2b model with the float16 datatype would require at least 18.76GB of memory. This disqualifies the vast majority of consumer grade GPUs and requires either high end GPUs, or renting datacentre grade GPU instances with large amounts of memory. In practice, training often needs a few gigabytes of extra memory over the amount mentioned by the estimator tool due to the storage of training data batches and other structures.
Fine Tuning
Full Parameter Fine Tuning
When it comes to customizing a model for the best performance for a downstream task, full parameter fine tuning yields the best results. This process is effectively a continuation of the initial training process, but simply with data that is specific to your use case. All of the trainable weights of the model are updated during this process.
The primary drawback of this approach is that it is time consuming, and often requires high end hardware. Very large models, such as those exceeding 7 billion trainable parameters, often require the training process to be carried out across multiple GPUs. All of these factors to higher costs and complexities when it comes to full parameter fine tuning.
Parameter Efficient Fine Tuning
A more accessible approach to customizing models is freezing a portion of the trainable weights and instead only updating a subset of them. This method typically requires far fewer computing resources to perform customization within a reasonably timeframe. It can be a suitable option for hobbyists or where a high degree of customization is simply not required.
An example of parameter efficient fine tuning would be adapter learning, where a few conventional feed forward neural network layers are added between the layers of a large language model with trainable weights. During the training process, it is only these new layers that are trained, with the existing weights being frozen.
The other more popular form of this technique would be low-rank adaptation, where relatively small trainable weight matrices are put into transformer layers to approximate the weight updates that would occur during full parameter fine tuning. This drastically reduces the compute and memory requirements of training at the expense of customization and predication quality.
Reinforcement Learning with Human Feedback
The first round of model fine tuning often yields good performance for a given use case. However, there are certain subtleties in model performance that may have not been adequately captured by the dataset. There are sometimes cases where the model would produce output that a human would find non-ideal in some way. For example, producing a recipe that is too difficult to follow in a normal person's kitchen.
We therefore perform a second round of training to address these issues. This process is known as reinforcement learning with human feedback (RLHF). This is a process that is usually done after the model has been tested by numerous users and data on this usage has been gathered.
During this process, we need 3 inputs:
- Input prompts that were provided to the model during user testing
- Responses that the model produced in response to those prompts
- A reward score, typically a number, that a user provided as a rating to the output prompt
The reward scores can actually be produced from the output of another model that has been trained based on user scoring. This technique is typically used to accelerate the process of reinforcement learning. Alternatively, a large dataset of human provided feedback can also be used.
During the training process, 2 instances of your model are used. One that is trainable, and another that isn't. The non-trainable model is important to ensure that the trainable model behaviour doesn't take an unexpected turn by optimizing solely for the reward function. It is often the case in reinforcement learning that models learn to exploit strange behaviours to optimize rewards rather than solve the problem at hand. Using 2 models in this way mitigates that issue.
Input-Output pairs with are then used to train the model and update its weights based on the provided rewards where the rewards themselves could be stored in the dataset or computed by another model.
We can see there can be up to 3 models being run at the same time during this process, which makes RLHF extremely memory and compute intensive. It has also traditionally been complex to implement but thankfully libraries such as TRL have been created to greatly simply the process of setting up RLHF.
It's also worth mentioning Nvidia's SteerLM as a technique that was introduced to mitigate some of the issues with RLHF. It utilizes multi-dimensional feedback to steer a model's responses towards optimizing certain specific dimensions of said feedback. It still however, requires significant computation power and also comes with a degree of vendor lock in to Nvidia's tools and model formats.
Deployment
Once you're ready to deploy your model to your target audience, it's time to pick a hosting environment to deploy your model.
If your use case demands the usage of on-premises hardware or highly custom application solution, then it's not likely that a managed model hosting solution would be useful to you.
However, for vast majority of use cases, managed model hosting is going to be the most appealing option due to its ease of use. One such option would be the hosted inference API solution offered by huggingface. Many other cloud providers such as RunPod, AWS and Azure offer similar services where they manage the security and infrastructure to host your model behind a simple HTTP API. These services come at a fee, of course.
There are also serverless inference API options, such as those offered by AWS SageMaker, where you pay only for the compute time that was consumed by your model. This option is well suited to infrequent or unpredictable workloads. For constant or predictable traffic, however, the cost profile of serverless inference is usually much worse than regular monthly provisioned resources.
Finally, you may also quantize your model after training to reduce the compute resources required to run inference on it. This can lead to improved performance and reduce cost. Quantization typically involves reducing the precision of the weights in the model. For example, many models utilize 32-bit or 16-bit floating point numbers as their weights. Quantizing these weights to a reduced precision such as 8-bit floating point, or even 8-bit integer can lead to drastic speed increases. It does come at a cost of model accuracy and prediction performance, but the reduction in these metrics is almost always not proportional to the performance increase. For example, you may quantize a 16-bit model to 8-bit for a halving of memory footprint, but the quality of its output may only be reduced by 10% in common use cases. Quantization is especially effective on very large models.