Oreoluwa
Semantic Chunking for RAG
June 3, 2025
12 min read

Semantic Chunking for RAG

RAG
NLP
AI
Semantic Search

Introduction to Semantic Chunking

Text chunking is an essential step in Retrieval-Augmented Generation (RAG), where large text bodies are divided into meaningful segments to improve retrieval accuracy. Unlike fixed-length chunking, semantic chunking splits text based on the content similarity between sentences.

Breakpoint Methods:

  • Percentile: Finds the Xth percentile of all similarity differences and splits chunks where the drop is greater than this value.
  • Standard Deviation: Splits where similarity drops more than X standard deviations below the mean.
  • Interquartile Range (IQR): Uses the interquartile distance (Q3 - Q1) to determine split points.

This notebook implements semantic chunking using the percentile method and evaluates its performance on a sample text.

Setting Up the Environment

We begin by importing necessary libraries.

import fitz
import os
import numpy as np
import json
from openai import OpenAI

Extracting Text from a PDF File

To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library.

def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file.
    """
    mypdf = fitz.open(pdf_path)
    all_text = ""
    for page in mypdf:
        all_text += page.get_text("text") + " "
    return all_text.strip()

pdf_path = "data/AI_Information.pdf"
extracted_text = extract_text_from_pdf(pdf_path)

Setting Up the OpenAI API Client

We initialize the OpenAI client to generate embeddings and responses.

client = OpenAI(
    base_url="https://api.studio.nebius.com/v1/",
    api_key=os.getenv("OPENAI_API_KEY")
)

Creating Sentence-Level Embeddings

We split text into sentences and generate embeddings.

def get_embedding(text, model="BAAI/bge-en-icl"):
    response = client.embeddings.create(model=model, input=text)
    return np.array(response.data[0].embedding)

sentences = extracted_text.split(". ")
embeddings = [get_embedding(sentence) for sentence in sentences]

Calculating Similarity Differences

We compute cosine similarity between consecutive sentences.

def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

similarities = [cosine_similarity(embeddings[i], embeddings[i + 1]) for i in range(len(embeddings) - 1)]

Implementing Semantic Chunking

We implement three different methods for finding breakpoints.

def compute_breakpoints(similarities, method="percentile", threshold=90):
    if method == "percentile":
        threshold_value = np.percentile(similarities, threshold)
    elif method == "standard_deviation":
        mean = np.mean(similarities)
        std_dev = np.std(similarities)
        threshold_value = mean - (threshold * std_dev)
    elif method == "interquartile":
        q1, q3 = np.percentile(similarities, [25, 75])
        threshold_value = q1 - 1.5 * (q3 - q1)
    else:
        raise ValueError("Invalid method. Choose 'percentile', 'standard_deviation', or 'interquartile'.")
    return [i for i, sim in enumerate(similarities) if sim < threshold_value]

breakpoints = compute_breakpoints(similarities, method="percentile", threshold=90)

Splitting Text into Semantic Chunks

We split the text based on computed breakpoints.

def split_into_chunks(sentences, breakpoints):
    chunks = []
    start = 0
    for bp in breakpoints:
        chunks.append(". ".join(sentences[start:bp + 1]) + ".")
        start = bp + 1
    chunks.append(". ".join(sentences[start:]))
    return chunks

text_chunks = split_into_chunks(sentences, breakpoints)

Creating Embeddings for Semantic Chunks

We create embeddings for each chunk for later retrieval.

def create_embeddings(text_chunks):
    return [get_embedding(chunk) for chunk in text_chunks]

chunk_embeddings = create_embeddings(text_chunks)

Performing Semantic Search

We implement cosine similarity to retrieve the most relevant chunks.

def semantic_search(query, text_chunks, chunk_embeddings, k=5):
    query_embedding = get_embedding(query)
    similarities = [cosine_similarity(query_embedding, emb) for emb in chunk_embeddings]
    top_indices = np.argsort(similarities)[-k:][::-1]
    return [text_chunks[i] for i in top_indices]

with open('data/val.json') as f:
    data = json.load(f)
query = data[0]['question']
top_chunks = semantic_search(query, text_chunks, chunk_embeddings, k=2)
print(f"Query: {query}")
for i, chunk in enumerate(top_chunks):
    print(f"Context {i+1}:\n{chunk}\n{'='*40}")

Generating a Response Based on Retrieved Chunks

system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"
def generate_response(system_prompt, user_message, model="meta-llama/Llama-3.2-3B-Instruct"):
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}\n"
ai_response = generate_response(system_prompt, user_prompt)

Evaluating the AI Response

We compare the AI response with the expected answer and assign a score.

evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5."
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)
print(evaluation_response.choices[0].message.content)