Introduction to RAG

Retrieval Augmented Generation of AI large language models, referred to as RAG technology, is an important application scenario for the current usage of AI technology.

RAG technology solves the problem of incomplete and outdated information in large language models. After the large language model training is completed, the information contained in it no longer changes. If the materials used for large language model training do not contain certain aspects of information, when generating content, some confusing content will be generated, and it is easy to answer questions incorrectly. For internal enterprise applications, this problem almost certainly exists, because the base model cannot obtain private data within the enterprise during training. Similarly, the data generated after model training cannot be used by the model.

For this problem, one solution is to fine-tune the base model and integrate the internal data. But fine-tuning the model is not a simple matter and requires professional knowledge, a lot of resources and time.

Fine-tuning the model won't solve the problem of not being able to use new data. The data of some applications are updated very frequently. We simply cannot fine-tune a large language model every time new data is created or modified.

The second solution is tool calling, which provides customized tools to the large language model and connects the model to the API of the external system.

The last solution is to use Retrieval Augmented Generation technology. The principle behind this technology is actually very simple. Since the large language model is missing some information, then providing this information as part of the input to the model will naturally solve the problem. Limited by the size of context window of models, the provided context information cannot be too long. This requires retrieving the content most relevant to the input as context to get more accurate results.

In the RAG implementation, there will be a vector database to store all reference documents. For an input, documents that are semantically similar to the input text are first retrieved from the vector database. Then these documents are used to augment the original input text. The augmented query will be the actual input sent to the model. Since the prompt already contains enough contextual information, the model can provide reasonable output based on this information.

In order to retrieve documents with similar content, text embedding technology is needed. Text embedding technology converts a piece of text into a vector. This is also the reason for using vector databases. After all reference documents are embedded, the resulting vectors are stored in the vector database. The input query text undergoes the same embedding process to obtain a vector. The similarity between the query text and the reference document is converted into the similarity between the corresponding vectors. Similarity can be determined by calculating the distance between vectors.

The shorter the distance between two vectors, the more similar the two corresponding documents are. We don’t need to pay attention to the details of distance calculation algorithms when implementing the application. We can just use the API provided by the vector database directly.

The above RAG process does not include the import of documents. In actual applications, it is often necessary to import data from enterprise systems into vector databases. If the data of the system changes, the relevant changes also need to be synchronized to the vector database.