Explaining in simple terms the core technologies required to start developing LLM-based applications.
The purpose of this article is to explain in simple terms the key technologies necessary to start developing LLM-based applications. It is intended for software developers, data scientists and AI enthusiasts who have a basic understanding of machine learning concepts and want to dive deeper. The article also provides numerous useful links for further study. It’s going to be interesting!
1. Introduction to Large Language Models (LLMs)
I think you’ve already heard a thousand times about what an LLM is, so I won’t overload you with it. All we need to know is: a Large Language Model (LLM) is a LARGE neural network model that predicts the next token based on the previously predicted one. That’s all.
The popularity of LLMs is due to their versatility and effectiveness. They perfectly cope with such tasks as translation, summarization, analysis of meanings, etc.
Some examples of projects using LLMs:
- Notion AI — helps improve writing quality, generate content, correct spelling and grammar, edit voice and intonation, translate, and more.
- GitHub Copilot — improves you code by offering autocomplete-style suggestions.
- Dropbox Dash — provides a natural-language search functionality, and also specifically cites which files the answer is derived from.
If you want a detailed understanding of how LLMs work, I highly recommend reading the excellent article “A Very Gentle Introduction to Large Language Models without the Hype” by Mark Riedl.
2. Open Source vs Closed Source Models
While there are quite a few differences, I highlight the following as the main ones:
- Privacy — one of the most important reasons why large companies choose self-hosted solutions.
- Fast prototyping — great for small startups to quickly test their ideas without excessive expenditure.
- Quality of generation — either you fine-tune the model for your specific task or use a paid API.
There is no definitive answer to what is better or worse. I highlighted the following points:
If you are interested in delving deeper into the details, I suggest you read my article “You don’t need hosted LLMs, do you?”.
Popular Open Source models
Popular Closed Source models
Explore the LLM Collection to view all models.
3. The Art of Prompt Engineering
I know, I know, many consider it a pseudo-science or just a temporary hype. But the truth is, we still don’t fully understand how LLMs work. Why do they sometimes provide high-quality responses and other times fabricate facts (hallucinate)? Or why does adding “let’s think step-by-step” to a prompt suddenly improve the quality?
Due to all this, scientists and enthusiasts can only experiment with different prompts, trying to make models perform better.
I won’t bore you with complex prompt chains; instead, I’ll just give a few examples that will instantly improve performance:
- “Let’s think step by step” — works great for reasoning or logical tasks..
- “Take a deep breath and work on this problem step-by-step“— an improved version of the previous point. It can add a few more percent of quality.
- “This is very important to my career” — just add it to the end of your prompt and you’ll notice a 5–20% improvement in quality.
Also, I’ll share a useful prompt template right away:
Let’s combine our X command and clear thinking to quickly and accurately decipher the answer in the step-by-step approach. Provide details and include sources in the answer. This is very important to my career.
Where X is the industry of the task you are solving, for example, programming.
I highly recommend spending a few evenings exploring prompt engineering techniques. This will not only allow you to better control the model’s behavior but will also help improve quality and reduce hallucinations. For this, I recommend reading the Prompt Engineering Guide.
Useful Links:
- prompttools — prompt testing and experimentation, with support for both LLMs (e.g. OpenAI, LLaMA).
- promptfoo — testing and evaluating LLM output quality.
- Awesome ChatGPT Prompts — A collection of prompt examples to be used with the ChatGPT model.
4. Incorporating New Data: Retrieval Augmented Generation (RAG)
RAG is a technique that combines the LLM with external knowledge bases. This allows the model to add relevant information or specific data not included in the original training set to the model.
Despite the intimidating name (sometimes we add the word “reranker” to it), it’s actually a pretty old and surprisingly simple technique:
- You convert documents into numbers, we call them embeddings.
- Then, you also convert the user’s search query into embeddings using the same model.
- Find the top K closest documents, usually based on cosine similarity.
- Ask the LLM to generate a response based on these documents.
When to Use
- Need for Current Information: When the application requires information that is constantly updating, like news articles.
- Domain-Specific Applications: For applications that require specialized knowledge outside the LLM’s training data. For example, internal company documents.
When NOT to Use
- General Conversational Applications: Where the information needs to be general and doesn’t require additional data.
- Limited Resource Scenarios: The retrieval component of RAG involves searching through large knowledge bases, which can be computationally expensive and slow — though still faster and less expensive than fine-tuning.
Building an Application with RAG
A great starting point is using the LlamaIndex library. It allows you to quickly connect your data to LLMs. For this you only need a few lines of code:
from llama_index import VectorStoreIndex, SimpleDirectoryReader # 1. Load your documents: documents = SimpleDirectoryReader("YOUR_DATA").load_data() # 2. Convert them to vectors: index = VectorStoreIndex.from_documents(documents) # 3. Ask the question: query_engine = index.as_query_engine() response = query_engine.query("When's my boss's birthday?") print(response)
In real-world applications, things are noticeably more complex. Like in any development, you’ll encounter many nuances. For example, the retrieved documents might not always be relevant to the question or there might be issues with speed. However, even at this stage, you can significantly improve the quality of your search system.
What to Read & Useful Links
- Building RAG-based LLM Applications for Production — an excellent detailed article about the main components of RAG.
- Why Your RAG Is Not Reliable in a Production Environment — a great article by Ahmed Besbes that explains in clear language the difficulties that can arise when using RAG.
- 7 Query Strategies for Navigating Knowledge Graphs With LlamaIndex — an informative article from Wenqi Glantz that takes a detailed and nuanced look at building a RAG pipeline using LlamaIndex.
- OpenAI Retrieval tool — use OpenAI’s RAG if you want a minimum of effort.
5. Fine-Tuning Your LLM
Fine-tuning is the process of continuing the training of a pre-trained LLM on a specific dataset. You might ask why we need to train the model further if we can already add data using RAG. The simple answer is that only fine-tuning can tailor your model to understand a specific domain or define its style. For instance, I created a copy of myself by fine-tuning on personal correspondences:
Okay, if I’ve convinced you of its importance, let’s see how it works (spoiler — it’s not so difficult):
- Take a trained LLM, sometimes called Base LLM. You can download them from HuggingFace.
- Prepare your training data. You only need to compile instructions and responses. Here’s an example of such a dataset. You can also generate synthetic data using GPT-4.
- Choose a suitable fine-tuning method. LoRA and QLoRA are currently popular.
- Fine-tune the model on new data.
When to Use
- Niche Applications: When the application deals with specialized or unconventional topics. For example, legal document applications that need to understand and handle legal jargon.
- Custom Language Styles: For applications requiring a specific tone or style. For example, creating an AI character whether it’s a celebrity or a character from a book.
When NOT to Use
- Broad Applications: Where the scope of the application is general and doesn’t require specialized knowledge.
- Limited Data: Fine-tuning requires a significant amount of relevant data. However, you can always generate them with another LLM. For example, the Alpaca dataset of 52k LLM-generated instruction-response pairs was used to create the first finetuning Llama v1 model earlier this year.
Fine-tune your LLM
You can find a vast number of articles dedicated to model fine-tuning. Just on Medium alone, there are thousands. Therefore, I don’t want to delve too deeply into this topic and will show you a high-level library, Lit-GPT, which hides all the magic inside. Yes, it doesn’t allow for much customization of the training process, but you can quickly conduct experiments and get initial results. You’ll need just a few lines of code:
# 1. Download the model: python scripts/download.py --repo_id meta-llama/Llama-2-7b # 2. Convert the checkpoint to the lit-gpt format: python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/llama # 3. Generate an instruction tuning dataset: python scripts/prepare_alpaca.py # it should be your dataset # 4. Run the finetuning script python finetune/lora.py --checkpoint_dir checkpoints/llama/ --data_dir your_data_folder/ --out_dir my_finetuned_model/
And that’s it! Your training process will start:
Be aware that the process can take a long time. It takes approximately 10 hours and 30 GB memory to fine-tune Falcon-7B on a single A100 GPU.
Of course, I’ve slightly oversimplified, and we’ve only scratched the surface. In reality, the fine-tuning process is much more complex and to get better results, you’ll need to understand various adapters, their parameters, and much more. However, even after such a simple iteration, you will have a new model that follows your instructions.
What to Read & Useful Links
- Create a Clone of Yourself With a Fine-tuned LLM — my article where I wrote about collecting datasets, used parameters, and gave useful tips on fine-tuning.
- Understanding Parameter-Efficient Fine-tuning of Large Language Models — an excellent tutorial if you want to get into the details of the concept of fine-tuning and popular parameter-efficient alternatives.
- Fine-tuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments — one of my favorite articles for understanding the capabilities of LoRA.
- OpenAI Fine-tuning — if you want to fine-tune GPT-3.5 with minimal effort.
6. Deploying Your LLM Application
Sometimes, all we want is to simply push a “deploy” button…
Fortunately, this is quite feasible. There are a huge number of frameworks that specialize in deploying large language models. What makes them so good?
- Lots of pre-built wrappers and integrations.
- A vast selection of available models.
- A multitude of internal optimizations.
- Rapid prototyping.
Choosing the Right Framework
The choice of framework for deploying an LLM application depends on various factors, including the size of the model, the scalability requirements of the application, and the deployment environment. Currently, there isn’t a vast diversity of frameworks, so it shouldn’t be too difficult to understand their differences. Below, I’ve prepared a cheat sheet for you that will help you quickly get started:
Also, in my article “7 Frameworks for Serving LLMs” I provide a more detailed overview of the existing solutions. I recommend checking it out if you’re planning to deploy your model.
Example Code for Deployment
Let’s move from theory to practice and try to deploy LLaMA-2 using Text Generation Inference. And, as you might have guessed, you’ll need just a few lines of code:
# 1. Create a folder where your model will be stored: mkdir data # 2. Run Docker container (launch RestAPI service): docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id meta-llama/Llama-2-7b # 3. And now you can make requests: curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"Tell me a joke!","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json'
That’s it! You’ve set up a RestAPI service with built-in logging, Prometheus endpoint for monitoring, token streaming, and your model is fully optimized. Isn’t this magical?
What to Read & Useful Links
- 7 Frameworks for Serving LLMs — comprehensive guide into LLMs inference and serving with detailed comparison.
- Inference Endpoints — a product from HuggingFace that will allow you to deploy any LLMs in a few clicks. A good choice when you need rapid prototyping.
7. What Remains Behind the Scenes
Even though we’ve covered the main concepts needed for developing LLM-based applications, there are still some aspects you’ll likely encounter in the future. So, I’d like to leave a few useful links:
Optimization
When you launch your first model, you inevitably find it’s not as fast as you’d like and consumes a lot of resources. If this is your case, you need to understand how it can be optimized.
- 7 Ways To Speed Up Inference of Your Hosted LLMs — techniques to speed up inference of LLMs to increase token generation speed and reduce memory consumption.
- Optimizing Memory Usage for Training LLMs in PyTorch — article provides a series of techniques that can reduce memory consumption in PyTorch by approximately 20x without sacrificing modeling performance and prediction accuracy.
Evaluating
Suppose you have a fine-tuned model. But how can you be sure that its quality has improved? What metrics should we use?
- All about evaluating Large language models — a good overview article about benchmarks and metrics.
- evals — the most popular framework for evaluating LLMs and LLM systems.
Vector Databases
If you work with RAG, at some point, you’ll move from storing vectors in memory to a database. For this, it’s important to understand what’s currently on the market and its limitations.
- All You Need to Know about Vector Databases — a step-by-step guide by Dominik Polzer to discover and harness the power of vector databases.
- Picking a vector database: a comparison and guide for 2023 — comparison of Pinecone, Weviate, Milvus, Qdrant, Chroma, Elasticsearch and PGvector databases.
LLM Agents
In my opinion, the most promising development in LLMs. If you want multiple models to work together, I recommend exploring the following links.
- A Survey on LLM-based Autonomous Agents — this is probably the most comprehensive overview of LLM based agents.
- autogen — is a framework that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks.
- OpenAgents — an open platform for using and hosting language agents in the wild.
Reinforcement Learning from Human Feedback (RLHF)
As soon as you allow users access to your model, you start taking responsibility. What if it responds rudely? Or reveals bomb-making ingredients? To avoid this, check out these articles:
- Illustrating Reinforcement Learning from Human Feedback (RLHF) — an overview article that details the RLHF technology.
- RL4LMs — a modular RL library to fine-tune language models to human preferences.
- TRL — a set of tools to train transformer language models with Reinforcement Learning, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step.
Conclusion
Despite the hype, which we’re all a bit tired of, LLMs will be with us for a long time, and the ability to understand their stack and write simple applications can give you a significant boost. I hope I’ve managed to immerse you a bit in this area and show you that there is nothing complicated or scary about it.
Thank you for your attention, stay tuned for new articles!
Disclaimer: The information in the article is current as of November 2023, but please be aware that changes may occur thereafter.
Unless otherwise noted, all images are by the author.
If you have any questions or suggestions, feel free to connect on LinkedIn.
All you need to know to Develop using Large Language Models was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.