Should you fine-tune your LLM?

Fine-tuning is when you take a general LLM and train it further on domain-specific datasets so that its performance in this domain improves. Fine-tuning is a supervised technique. So you give the model pairs of inputs and desired outputs.

Fine-tuning is a term that came up with LLMs. Traditionally in machine learning this technique is called "transfer learning". The idea is that rather than building a new model from scratch, you reduce training efforts by "standing on the shoulders" of an existing model.

So, should you fine-tune your LLM if you want it to do well on a specific task?

Not necessarily.

Fine-tuning pitfalls

Fine-tuning can increase hallucination

Fine-tuning can backfire and make LLM hallucinations worse.

Riley Goodside from Scale AI explains this in this video from the Cognitive Revolution series with Nathan Labenz. The tl;dr (paraphrasing Riley's argument losely here):

"Finetuning on new factual information will make the model more likely to hallucinate—not just for the new information, but generally. This is because you're telling the model, 'new things, or things that look weird to you, given your pretraining, can be OK sometimes.'"

No good training data available

You absolutely need good training data to fine-tune an LLM. Otherwise you will end up with a garbage-in, garbage-out situation.

The most dangerous thing here is probably not if you don't have good training data and you know it. It's when you think you have good training data, but you don't.

Vendor lock-in (if the LLM is not open source)

When you fine-tune a model to which you don't have full access (= my simplified definition of "open source" here), you are not in control of your model. What happens if the vendor changes the underlying model? Increases prices? Changes the terms of use?

Not enough machine learning expertise

Even though some people seem to give the impression that fine-tuning is easy, it does require substantial machine learning expertise. Finetuning involves many moving parts, such as:

Does it finetune the whole model, or specific parts? For example, layer-wise finetuning seems to be a promising approach.
What is the finetuning strategy that's being used? Lakera has a good guide on this.
What are the tuning hyperparameters being used?
What is your LLM performance benchmarks for success? To be fair though, this question always applies, not just for fine-tuning.
What is a good setup of the compute infrastructure to be cost-effective?

Alternatives to fine-tuning

Because of the effort that's required to fine-tune an LLM, it's typically best to first exhaust all other options. For example:

Few-shot prompting

Few-shot prompting is a technique where you put examples or instructions in the prompt. This is not just orders of magnitude easier than finetuning. It is also more flexible. This is important if your task or domain changes.

RAG (retrieval-augmented generation) and tool calling

If the goal is to give an LLM access to knowledge, RAG and tool calling typically make more sense than fine-tuning. They are easier to control and modify, and they can access real-time data. At least currently, it is not possible to fine-tune an LLM with real-time data.

Divide and conquer

Can you divide your task into simpler subtasks? Ideally where not every subtask even requires an LLM? One advantage of this is that you get a system architecture that is much easier to debug, enhance, and scale.

Somebody else has done it for you

Perhaps the fine-tuned LLM you are looking for has been built already? Check out Hugging Face. There are currently over 800,000 models available, and this might include just what you are looking for.

Examples where fine-tuning is a good idea

My goal here is not to argue against finetuning. That's why I have some examples below where I think that fine-tuning makes sense.

You want to build a coding assistant for a proprietary programming language

Many, if not most, software engineers now use some form of AI assistance. For example Codium, Cursor, or v0.

What if you work in a programming language that is proprietary, i.e. only accessible within your company? Such programming languages are not uncommon in industrial control systems, for example.

In this case, it is likely (but not certain) that off-the-shelf tools like Cursor, Codium, or v0 will not work.

You need a small, affordable model

If you have a task that requires very high throughput (lots of tokens), using a smaller LLM can be a great option—or even neccessary, for cost reasons. And in this case, finetuning helps you optimize the LLM to your specific task. Finetuning can then even give you an LLM that outperforms a larger, non-finetuned, model.

You need a very fast model

Small(er), finetuned, specialized models can be faster than large general-purpose ones. For example, Cursor (the code editor) uses a finetuned model for code editing.

You have to be in control of your model

If your domain is highly regulated, or if you work in a classified environment, you'll likely have no other choice than to use a model that you control. And not only that. Such scenarios are often also substantially different from non-regulated, non-classified environments, so that you'll get better performance if you finetune. In other words, this might be a similar scenario to the proprietary programming language example above.