Simple RAG: A Foundational Approach to Retrieval-Augmented Generation
Introduction to Simple RAG
Retrieval-Augmented Generation (RAG) is a hybrid approach that combines information retrieval with generative models. It enhances the performance of language models by incorporating external knowledge, which improves accuracy and factual correctness.
In a Simple RAG setup, we follow these steps:
- Data Ingestion: Load and preprocess the text data.
- Chunking: Break the data into smaller chunks to improve retrieval performance.
- Embedding Creation: Convert the text chunks into numerical representations using an embedding model.
- Semantic Search: Retrieve relevant chunks based on a user query.
- Response Generation: Use a language model to generate a response based on retrieved text.
This post implements a Simple RAG approach, evaluates the model's response, and explores various improvements.
Setting Up the Environment
We begin by importing necessary libraries.
import fitz
import os
import numpy as np
import json
from openai import OpenAI
Extracting Text from a PDF File
To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library.
def extract_text_from_pdf(pdf_path):
"""
Extracts text from a PDF file.
Args:
pdf_path (str): Path to the PDF file.
Returns:
str: Extracted text from the PDF.
"""
mypdf = fitz.open(pdf_path)
all_text = ""
for page_num in range(mypdf.page_count):
page = mypdf[page_num]
text = page.get_text("text")
all_text += text
return all_text
Chunking the Extracted Text
Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.
def chunk_text(text, n, overlap):
"""
Chunks the given text into segments of n characters with overlap.
Args:
text (str):
The text to be chunked.
n (int): The number of characters in each chunk.
overlap (int): The number of overlapping characters between chunks.
Returns:
List[str]: A list of text chunks.
"""
chunks = []
for i in range(0, len(text), n - overlap):
chunks.append(text[i:i + n])
return chunks
Setting Up the OpenAI API Client
We initialize the OpenAI client to generate embeddings and responses.
client = OpenAI(
base_url="https://api.studio.nebius.com/v1/",
api_key=os.getenv("OPENAI_API_KEY")
)
Extracting and Chunking Text from a PDF File
Now, we load the PDF, extract text, and split it into chunks.
pdf_path = "data/AI_Information.pdf"
extracted_text = extract_text_from_pdf(pdf_path)
text_chunks = chunk_text(extracted_text, 1000, 200)
print(f"Number of text chunks: {len(text_chunks)}")
print("\nFirst text chunk:")
print(text_chunks[0])
Creating Embeddings for Text Chunks
Embeddings transform text into numerical vectors, which allow for efficient similarity search.
def create_embeddings(text, model="BAAI/bge-en-icl"):
"""
Creates embeddings for the given text using the specified OpenAI model.
"""
response = client.embeddings.create(
model=model,
input=text
)
return response
response = create_embeddings(text_chunks)
Performing Semantic Search
We implement cosine similarity to find the most relevant text chunks for a user query.
def cosine_similarity(vec1, vec2):
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
def semantic_search(query, text_chunks, embeddings, k=5):
query_embedding = create_embeddings(query).data[0].embedding
similarity_scores = []
for i, chunk_embedding in enumerate(embeddings):
similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))
similarity_scores.append((i, similarity_score))
similarity_scores.sort(key=lambda x: x[1], reverse=True)
top_indices = [index for index, _ in similarity_scores[:k]]
return [text_chunks[index] for index in top_indices]
Running a Query on Extracted Chunks
with open('data/val.json') as f:
data = json.load(f)
query = data[0]['question']
top_chunks = semantic_search(query, text_chunks, response.data, k=2)
print(f"Query: {query}")
for i, chunk in enumerate(top_chunks):
print(f"Context {i + 1}:\n{chunk}\n=====================================")
Generating a Response Based on Retrieved Chunks
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"
def generate_response(system_prompt, user_message, model="meta-llama/Llama-3.2-3B-Instruct"):
response = client.chat.completions.create(
model=model,
temperature=0,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
]
)
return response
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}\n"
ai_response = generate_response(system_prompt, user_prompt)
Evaluating the AI Response
We compare the AI response with the expected answer and assign a score.
evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5."
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)
print(evaluation_response.choices[0].message.content)