- Updated: October 1, 2024
- 6 min read
Enhancing LLMs with External Knowledge: A Framework by Microsoft Researchers
Enhancing LLMs with External Knowledge: A Framework for Building Effective Data-Augmented Applications
Large language models (LLMs) have revolutionized the way we interact with artificial intelligence, but their true potential lies in their ability to leverage external knowledge beyond their training data. This is especially crucial for enterprise applications, where domain-specific and customer-centric knowledge is essential. OpenAI’s ChatGPT integration has demonstrated the power of retrieval-augmented generation (RAG), a technique that allows LLMs to incorporate external data into their responses. However, as Microsoft researchers have recently highlighted, simple RAG techniques are not sufficient for many real-world scenarios.
Categorizing RAG Tasks
In a groundbreaking paper, Microsoft researchers propose a framework for categorizing different types of RAG tasks based on the complexity of the external data required and the reasoning involved. This framework consists of four levels:
- Explicit facts: Queries that require retrieving explicitly stated facts from the data.
- Implicit facts: Queries that require inferring information not explicitly stated in the data, often involving basic reasoning or common sense.
- Interpretable rationales: Queries that require understanding and applying domain-specific rationales or rules that are explicitly provided in external resources.
- Hidden rationales: Queries that require uncovering and leveraging implicit domain-specific reasoning methods or strategies that are not explicitly described in the data.
Each level presents unique challenges and requires specific solutions to effectively address them.
Addressing Explicit Fact Queries
Explicit fact queries are the simplest type, focusing on retrieving factual information directly stated in the provided data. The most common approach for addressing these queries is using basic RAG, where the LLM retrieves relevant information from a knowledge base and uses it to generate a response. However, even with explicit fact queries, RAG pipelines face several challenges at each stage.
At the indexing stage, where the RAG system creates a store of data chunks that can be later retrieved as context, it might have to deal with large and unstructured datasets, potentially containing multi-modal elements like images and tables. This can be addressed with multi-modal document parsing and multi-modal embedding models that can map the semantic context of both textual and non-textual elements into a shared embedding space.
At the information retrieval stage, the system must make sure that the retrieved data is relevant to the user’s query. Here, developers can use techniques that improve the alignment of queries with document stores. For example, an LLM can generate synthetic answers for the user’s query. The answers per se might not be accurate, but their embeddings can be used to retrieve documents that contain relevant information.
During the answer generation stage, the model must determine whether the retrieved information is sufficient to answer the question and find the right balance between the given context and its own internal knowledge. Specialized fine-tuning techniques can help the LLM learn to ignore irrelevant information retrieved from the knowledge base. Joint training of the retriever and response generator can also lead to more consistent performance.
Handling Implicit Fact Queries
Implicit fact queries require the LLM to go beyond simply retrieving explicitly stated information and perform some level of reasoning or deduction to answer the question. For example, a user might ask “How many products did company X sell in the last quarter?” or “What are the main differences between the strategies of company X and company Y?” Answering these queries requires combining information from multiple sources within the knowledge base, a process sometimes referred to as “multi-hop question answering.”
Implicit fact queries introduce additional challenges, including the need for coordinating multiple context retrievals and effectively integrating reasoning and retrieval capabilities. These queries require advanced RAG techniques, such as Interleaving Retrieval with Chain-of-Thought (IRCoT) and Retrieval Augmented Thought (RAT), which use chain-of-thought prompting to guide the retrieval process based on previously recalled information. Another promising approach involves combining knowledge graphs with LLMs. Chroma DB integration on the UBOS platform allows developers to leverage knowledge graphs for complex reasoning and linking different concepts.
Applying Interpretable Rationales
Interpretable rationale queries require LLMs to not only understand factual content but also apply domain-specific rules. These rationales might not be present in the LLM’s pre-training data, but they are often explicitly provided in the knowledge corpus. For example, a customer service chatbot might need to integrate documented guidelines on handling returns or refunds with the context provided by a customer’s complaint.
One of the key challenges in handling these queries is effectively integrating the provided rationales into the LLM and ensuring that it can accurately follow them. Prompt tuning techniques, such as those that use reinforcement learning and reward models, can enhance the LLM’s ability to adhere to specific rationales. LLMs can also be used to optimize their own prompts, as demonstrated by DeepMind’s OPRO technique, which uses multiple models to evaluate and optimize each other’s prompts.
Developers can also use the chain-of-thought reasoning capabilities of LLMs to handle complex rationales. However, manually designing chain-of-thought prompts for interpretable rationales can be time-consuming. Techniques such as Automate-CoT can help automate this process by using the LLM itself to create chain-of-thought examples from a small labeled dataset.
Uncovering Hidden Rationales
Hidden rationale queries present the most significant challenge. These queries involve domain-specific reasoning methods that are not explicitly stated in the data. The LLM must uncover these hidden rationales and apply them to answer the question. For instance, the model might have access to historical data that implicitly contains the knowledge required to solve a problem. The model needs to analyze this data, extract relevant patterns, and apply them to the current situation.
The challenges of hidden rationale queries include retrieving information that is logically or thematically related to the query, even when it is not semantically similar. Also, the knowledge required to answer the query often needs to be consolidated from multiple sources. Some methods use the in-context learning capabilities of LLMs to teach them how to select and extract relevant information from multiple sources and form logical rationales.
Other approaches focus on generating logical rationale examples for few-shot and many-shot prompts. However, addressing hidden rationale queries effectively often requires some form of fine-tuning, particularly in complex domains. This fine-tuning is usually domain-specific and involves training the LLM on examples that enable it to reason over the query and determine what kind of external information it needs.
Implications for Building LLM Applications
The survey and framework compiled by the Microsoft Research team show how far LLMs have come in using external data for practical applications. However, it is also a reminder that many challenges have yet to be addressed. Enterprises can use this framework to make more informed decisions about the best techniques for integrating external knowledge into their LLMs.
RAG techniques can go a long way to overcome many of the shortcomings of vanilla LLMs. However, developers must also be aware of the limitations of the techniques they use and know when to upgrade to more complex systems or avoid using LLMs altogether. By understanding the different levels of RAG tasks and the associated challenges, developers can build more effective and reliable data-augmented LLM applications.
For those interested in exploring this topic further, the original research paper by Microsoft can be accessed at https://arxiv.org/abs/2409.14924.