Exploring RAG and its components

RAG

As the usage of Large Language Models (LLMs) continues to grow, it becomes increasingly important how we retrieve and generate answers to complex queries. RAG (Retrieval-Augmented Generation) is a technology that combines document retrieval with model generation. It involves getting a user query, finding relevant documents, and using those documents as context to generate a more informed and accurate answer. Let’s explore RAG and its various components today!

Query translation

Multi Query

From a single user input, multiple other similar queries (example - paraphrases) are generated and then all of these queries are used for search in the embedding space.
It is particularly useful when the original user query might not be suited for the embedding space and will not find any similar documents at the end, the results from all these queries can be combined to give a single, comprehensive output.

RAG Fusion

Same as Multi Query wherein each LLM generated query returns a list of documents as output.
A ranking is done on all these documents and then the top X ranked documents are passed in to the LLM along with the original user input to give one final answer.

Decomposition

The user input is broken down into sub-questions that will be solved individually.
These are then answered sequentially and the result of past questions is used as context when answering future questions.
The answer to the last question is then provided as the final output.
An alternate version exists where the past answers are not used in future questions. Instead, all individual question-answer pairs are combined at the end to generate a final output to the original user question.

Step Back

Instead of answering the user question directly, a more abstract question is generated first and then its response is used to answer the user’s original question.

HyDe

To judge how close in meaning is one piece of text to another, Cosine Similarity is often used when both the pieces are represented as vectors in the same embedding space.
To find a document containing a close enough answer to an input question, these 2 can’t be compared directly since they are 2 different entities (a document vs a question).
To solve this, a hypothetical document is generated based on the input query and then this document is used to find the similar documents.
From the existing set of actual documents with the expectation that this similar document will contain answer to the initial user query.

Routing

Directing the user input to the appropriate data source. There are 2 ways to do so -

Logical routing

Uses a model to choose the data source which matches the domain of the query.
Semantic routing
It uses internal, pre-defined prompts that are further used to compute a similarity search with the user input.
The prompt with the highest similarity decides how the final answer will be generated (example - by specifying the domain of the query).

Query construction

Involves going from unstructured, natural language input to structured query object that is tailored to the details provided about the database. Can be used to generate metadata search queries corresponding to an input text command.

Indexing

It involves generating numerical representation of text documents that can be efficiently searched for relevance to user inputs.

Multi representation Indexing

Each document in the database is converted into a summary that contains the document’s information and is optimized for retrieval. The documents and their summaries are linked.
Any input question is now matched with a similar summary and the output is the document originally linked with this summary.

RAPTOR Indexing

All documents in the database are clustered together and summarized. This process is repeated until 1 summary remains or it has been repeated X times.
The user input is now matched with these high-level summaries and thus allows for a lot more documents to be considered at once while generating an output.
It shows improvement over simple document similarity search where only top Y documents can be considered at once.

ColBERT Indexing

So far, all techniques embed a whole document into a single vector which might lead to information loss. Instead, ColBERT breaks a document into tokens which are then embedded in the space.
Any input query is also broken into tokens and embedded in the same space.
For each query token, its similarity is computed against all document tokens and the one with max similarity is chosen.
The Document Similarity Score is the sum of the similarities of each query token with its corresponding document token that gives max similarity.

Recent advancements

CRAG : Corrective-RAG

This is one of the strategies of Active RAG wherein the user query is used to retrieve documents which are then ranked in terms of relevance to the query.

Any relevant documents are passed as is to the context whereas irrelevant documents lead to a query transformation followed by an external information search for this transformed query (example - a web search).
The result of this external search is added to the context along with other documents and now a final output is generated.

RAG