is the preferred pattern for our teams to improve the quality of responses generated by a large language model (LLM). We’ve successfully used it in many projects, including the . With RAG, information about relevant and trustworthy documents is stored in a database. For a given prompt, the database is queried, relevant documents are retrieved and the prompt augmented with the content of the documents, thus providing richer context to the LLM. This results in higher quality output and greatly reduced hallucinations. The context window — which determines the maximum size of the LLM input — has grown significantly with newer models, but selecting the most relevant documents is still a crucial step. Our experience indicates that a carefully constructed smaller context can yield better results than a broad and large context. Using a large context is also slower and more expensive. We used to rely solely on embeddings stored in a vector database to identify additional context. Now, we're seeing reranking and hybrid search: search tools such as Elasticsearch Relevance Engine as well as approaches like that utilize knowledge graphs created with the help of an LLM. A graph-based approach has worked particularly well in our work on understanding legacy codebases with GenAI.
is the preferred pattern for our teams to improve the quality of responses generated by a large language model (LLM). We’ve successfully used it in several projects, including the popular . With RAG, information about relevant and trustworthy documents — in formats like HTML and PDF — are stored in databases that supports a vector data type or efficient document search, such as pgvector, Qdrant or Elasticsearch Relevance Engine. For a given prompt, the database is queried to retrieve relevant documents, which are then combined with the prompt to provide richer context to the LLM. This results in higher quality output and greatly reduced hallucinations. The context window — which determines the maximum size of the LLM input — is limited, which means that selecting the most relevant documents is crucial. We improve the relevancy of the content that is added to the prompt by reranking. Similarly, the documents are usually too large to calculate an embedding, which means they must be split into smaller chunks. This is often a difficult problem, and one approach is to have the chunks overlap to a certain extent.
is a technique to combine pretrained parametric and nonparametric memory for language generation. It enables you to augment the existing knowledge of pretrained LLMs with the private and contextual knowledge of your domain or industry. With RAG, you first retrieve a set of relevant documents from the nonparametric memory (usually via a similarity search from a vector data store) and then use the parametric memory of LLMs to generate output that is consistent with the retrieved documents. We find RAG to be an effective technique for a variety of knowledge intensive NLP tasks — including question answering, summarization and story generation.