What is Retrieval Augmented Generation (RAG)?

Large Language Models (LLMs) like GPT-4 or Mistral-7b are extraordinary in many ways, yet they come with challenges.

For now, let's focus on one specific limitation: the timeliness of their data. Since these models are trained up to a particular cut-off date, they aren't well-suited for real-time or organization-specific information.

Imagine you're a developer architecting an LLM-enabled app for Amazon. You're aiming to support shoppers as they comb through Amazon for the latest deals on sneakers. Naturally, you want to furnish them with the most current offers available. After all, nobody wants to rely on outdated information, and the same holds for data queried from your LLM.

This is where Retrieval-Augmented Generation, commonly known as RAG, significantly improves the capabilities of LLMs.

In a way that might resemble a resourceful friend in an exam setting or during a speech who—figuratively speaking, of course—swiftly passes you the most relevant "cue card" out of a ton of information to help you understand what you should be writing or saying next.

Perhaps that wasn't the perfect example, but you get the point. 😄

With RAG, efficient retrieval of the most relevant data for your use case ensures the text generated is both current and substantiated.

RAG, as its name indicates, operates through a three-fold process:

  • Retrieval: It sources pertinent information.
  • Augmentation: This information is then added to the model's initial input.
  • Generation: Finally, the LLM utilizes this augmented input to create an informed output.

Simply put, RAG empowers LLMs to include real-time, reliable data from external databases in their generated text.

For a better explanation, check out this video by Mudit. He shares challenges with LLMs resolved with the help of Retrieval Augmented Generation, and how RAG works in action.

In this video, you have also learned about:

  • The User Interface Component is designed to pose questions.
  • The Storage Layer, which utilizes vector indexes or vector DBs (e.g. Pinecone, Weaviate, Pathway In-memory Vector Store, ChromaDB, etc.).
  • The Service, Chain, or Pipeline Layer, which is instrumental in the model's functioning (with a brief mention of the Chain Library used for chaining prompts).
  • Summary of our learnings around LLM Architecture Components.

Let's look at a cleaner architecture diagram, and various steps of the pipeline and summarize the advantages of RAG based on what we've understood so far.