Home » So You Think You Need An LLM?

So You Think You Need An LLM?

by Dennis

There is little question that Large Language Models (LLMs) have dominated the AI conversation in recent months. Most companies, of all sizes, have begun at least some efforts to develop their own LLMs or at least understand how it can help them in their work. In this article, we will take a look at some of the criteria that you should consider when evaluating an LLM project at your organization.

Here’s one of my favorite sayings that I use when someone approaches me about training to become a data scientist:

Everyone wants to look like a bodybuilder, but no one wants to lift the heavy weights!

I want you to keep this in mind as we go through this article. I want you to realize that a lot of the work you’ve seen on finished models and other demos you’ve been pitched are like seeing a row of bodybuilders standing up on a show stage, in the final round. You are only seeing the end result of a lot of work! Likewise, it’s going to require a lot of work on your part to build your own LLM for your company. Is it possible? Yes. Will it be easy? Absolutely not.

The Art of the Possible

The first thing we need to consider is if you need an LLM to begin with or not. Keep the following in mind:

Are you using public data or internal data? If the answer is public — or documents that can otherwise be obtained outside of your organization — then you should probably not proceed. Whether you realize it or not, there is a massive race among start-ups right now to gather up as many documents as possible. Any public repository you can think of is currently (or has been) raided by people looking to put the data into an LLM. Hold off on your project and monitor your email for vendor demo invitations. I assure you someone will come along (very) soon with the tool you have in mind.

Do you have enough documents? Make no mistake about it, it takes a lot of documents to properly train an LLM model. So you need to determine if (1) you have enough documents and (2) are those documents structured in a way that you can extract and use the data appropriately?

In some cases you will find that you don’t have anywhere near the documents you need. In others you will find that the documents are a nightmare to work with and the data is difficult to extract. Maybe the model is viable but the ROI is not because the preparation phase will kill your budget.

Custom LLM Meta Steps

So you think you have enough data and resources to proceed with your own model? That’s good. The overall approach I want you to keep in mind is that you are taking unstructured data — which is the natural stage for most text data — and you are adding it back some structure to it to train another model so that model can provide answers to unstructured questions.

Did I lose you? If I did, that’s OK, it’s just how I want you to think about the overall process. After all, you still need a training set and those still tend to have some structure to them for LLM models.

Note that in this section I’m relying on some of the guidance put out by Open AI. Their web site contains an excellent example of creating an LLM that can answer questions about the Tokyo Olympics and walking through that example may also help you.

Here’s the meta steps that I recommend:

  • Break up the text data into some semblance of a Title / Section / Content structure. Each one of those should be a column in your (structured) training data.
  • Ensure that each Content element is an appropriate length. Using Open AI guidance, Content that is less than 40 words is unlikely to be useful in training a model.
  • Employ a model to help you come up with the possible Questions (or Prompts) from this data. You can choose to have a human validation step after this one, so that you can validate that the Prompts make sense.
  • Employ a model to help you come up with the possible Answers from the Questions and the data. At this point you will absolutely need a human validation stage to make sure that your Question and Answer pairings make sense. We haven’t done any LLM training at this point, this is all just setup for the training cycles. This is why human validation is important at this stage. If this training data is not good then it will be reflected in your resulting model.
  • In the last meta step, you will use your chosen LLM tool to create the model that your users will ultimately use. There are many ways that this happens — mostly through some sort of adversarial network approach — so I will leave it to you to research this further.

Please note that the last three steps above are very processing intensive. You are asking some very big neural networks to do a lot of work for you. Make sure you take the time to scale out your compute resources — both in physical size and monetary costs — before proceeding with these steps. It can get expensive very quickly!

From here, all that remains is your standard data science process of developing new iterations of your model and testing new versions. The trick here is that these models don’t usually have a quantitative score for you to rely on; it just has to feel “right” to your users. That creates an ambiguous finish line but is all dependent on your particular project and data.

Closing Notes

I am not trying to be intentionally discouraging in this article. If you have been discouraged, I’m OK with that. These models require a lot of data and preparation of that data. And then the generative side requires a lot of processing power that doesn’t come cheap. Please make sure that you evaluate your projects carefully and proceed accordingly.

You may also like

Color logo - no background

I’m Dennis and I write about data science for managers and executives, all in an easy format to understand. I created this site to bring you my data science podcast, original content and highly curated articles for all the latest on machine learning, generative AI and all other things data science for the contemporary data science executive.

Copyright 2023 Data Science with Dennis

-
00:00
00:00
Update Required Flash plugin
-
00:00
00:00