Retrieval Augmented Generation (RAG) with Lucene and OpenAI

Nov 16, 2024 - , ,

Introduction

This is a guide on how to implement ‘retrieval augmented generation (RAG)’ with Lucene as a vector search. The goal is to find relevant data in Lucene and use that to give the AI this information as additional context when answering the query.

This is especially useful if you have more additional data than fits into the context of the AI model. You can instead search the most relevant documents via Lucene, and then pass only those in the context.

The basic algorithm is:

  • Create a Lucene index with all your additional data
    • We split each document into smaller parts, as there is a maximum text length that can be used to create an embedding.
    • For each document part, we generate an embedding (= vector) using OpenAI
    • We store this embedding and its corresponding document part in the Lucene index
  • Then when processing a query, we first also create the same embedding for the query
  • We can now use this query vector to search Lucene for any documents most similar to the query
  • As a last step, we pass the relevant context document parts to the OpenAI chat API to generate our final answer.

Gradle config

We will be using Lucene for indexing and searching and the OpenAI API for generating embeddings and chat AI.
Add the following dependencies to your build.gradle.kts file:

implementation("org.apache.lucene:lucene-core:10.0.0")
implementation("org.apache.httpcomponents.client5:httpclient5-fluent:5.4.1")
implementation("com.google.code.gson:gson:2.11.0")

Indexing documents

There are three steps to index documents:

  • Split each document into smaller chunks
  • Generate an embedding for each chunk
  • Store the chunk and its embedding in Lucene

Spliting documents into chunks

Before indexing the documents directly, we split them into smaller parts. This has two advantages:

  • There is a size limit to the OpenAI embeddings API, so we can’t send longer documents
  • Since we will use a similarity search to find related documents, we will get better results if we split each document into smaller parts that contain more focused content.

So this depends on the type of input data. For example when using markdown files: Each document will be split by its headers and subheaders to create smaller chunks. To not completely lose the content for each chunk, we add all previous headers to each chunk. The goal is that each chunk is <= 8192 characters, since that is the OpenAI limit for embeddings.

This is a very simple variation of such a splitting algorith:

List<String> splitMarkdown(String document) {
  var headerTokens = List.of("# ", "## ", "### ", "#### ");
  var lines = document.split("\n");
  var headers = new ArrayList<String>();
  var currentChunk = new StringBuilder();
  var result = new ArrayList<String>();
  for (var line : lines) {
    var header = headerTokens.stream()
      .filter(line::startsWith)
      .findFirst();
    if (header.isPresent()) {
      result.add(String.join("\n", headers) + currentChunk.toString());
      currentChunk.setLength(0);
      var level = headerTokens.indexOf(header.get());
      // Drop headers that are deeper than the current level
      while (level < headers.size()) {
        headers.removeLast();
      }
      headers.add(line + "\n");
    } else {
      currentChunk.append(line).append("\n");
    }
  }
  result.add(String.join("\n", headers) + currentChunk.toString());
  return result;
}

Generate embeddings

To create embeddings, we will use the OpenAI Embeddings API (Docs)

The function takes any text and returns a vector embedding of the text. The following code shows how to make a simple HTTP call to the OpenAI API.

public static float[] computeEmbedding(String input) {
  var requestBody = GSON.toJson(new EmbeddingRequest(input, "text-embedding-3-small"));
  var response = Request.post("https://api.openai.com/v1/embeddings")
    .addHeader("Content-Type", "application/json")
    .addHeader("Authorization", "Bearer " + API_KEY)
    .bodyString(requestBody, ContentType.APPLICATION_JSON)
    .execute();
  var embeddingResponse = GSON.fromJson(response.returnContent().asString(),
    EmbeddingResponse.class);
  return embeddingResponse.data().getFirst().embedding();
}

record EmbeddingRequest(String input, String model) {}

record EmbeddingResponse(String model, List<EmbeddingData> data) {}

record EmbeddingData(int index, float[] embedding) {}

Building the index

Now we can combine these functions to build our index. The inputDocuments are assumed to be a list of Markdown documents.

Our Lucene index will have two fields per document:

  • A text field contents for the original text
  • A vector field contents-vector for the embedding
void createIndex(List<String> inputDocuments) {
  // Prepare documents by splitting them into smaller chunks
  var chunks = inputDocuments.stream()
    .flatMap(doc -> splitMarkdown(doc).stream())
    .collect(Collectors.toList());

  // Where the index will be stored
  var indexDirectory = FSDirectory.open(Paths.get("./index/"));
  var analyzer = new StandardAnalyzer();
  var config = new IndexWriterConfig(analyzer);

  // Setup config with custom codec to allow for vectors of length 2048. 
  // Necessary if the used embeddings are >1024 dimensions, since the 
  // Lucene default maximum is 1024.
  // See below for the implementation of CustomCodec
  config.setCodec(new CustomCodec());

  try (var indexWriter = new IndexWriter(indexDirectory, config)) {
    for (String input : chunks) {
      // Index document with both the original text and its embedding
      var document = new Document();
      document.add(new TextField("contents", input, Field.Store.YES));
      var vector = OpenAI.computeEmbedding(input);
      document.add(new KnnFloatVectorField("contents-vector", vector,
        VectorSimilarityFunction.DOT_PRODUCT));
      indexWriter.addDocument(document);
    }
  }
}

Extending Lucene to allow for vector fields with more than 1024 dimensions

By default, Lucene allows a maximum of 1024 dimension for a vector field. However OpenAI embeddings return more dimensions in their embeddings. (Though it is possible to truncate the OpenAI embeddings to 1024 values).

But it is also possible to extend Lucene with a custom codec that allows >1024 dimensions. We just have to extend the Lucene default codec with a custom vector format. In that vector format we delegate to the Lucene default format, but overwrite the getMaxDimensions method.

/**
 * Custom codec that extends the Lucene100Codec and allows for vectors of length 2048. 
 * Otherwise, the codec delegates to the default Lucene99HnswVectorsFormat.
 */
public class CustomCodec extends FilterCodec {
  public CustomCodec() {
    super("CustomCodec", new Lucene100Codec());
  }

  @Override
  public KnnVectorsFormat knnVectorsFormat() {
    return new KnnVectorsFormat("CustomVectorsFormat") {
      private final KnnVectorsFormat delegate = new Lucene99HnswVectorsFormat();

      @Override
      public int getMaxDimensions(String fieldName) {
        return 2048;
      }

      @Override
      public KnnVectorsWriter fieldsWriter(SegmentWriteState state) throws IOException {
        return delegate.fieldsWriter(state);
      }

      @Override
      public KnnVectorsReader fieldsReader(SegmentReadState state) throws IOException {
        return delegate.fieldsReader(state);
      }
    };
  }
}

This codec must also be made available when searching. This can be done by creating a file with the fully qualified name of the codec class.

File: resources/META-INF/services/org.apache.lucene.codecs.Codec
dev.giger.CustomCodec

Searching the index

The index can now be searched like any other Lucene index. We use KnnFloatVectorQuery to build a query for an
embedding vector. The given text query is also first transformed into an embedding via the OpenAI API.

The number of results depends on the amount of context that can be given to the final AI call. In this example, we will use the top 10 results.

List<SearchResult> searchIndex(String queryString) {
  var indexDirectory = FSDirectory.open(Paths.get("./index/"));
  var indexReader = DirectoryReader.open(indexDirectory);
  var searcher = new IndexSearcher(indexReader);

  int numResults = 10;
  var query = new KnnFloatVectorQuery(
    "contents-vector",
    computeEmbedding(queryString),
    numResults
  );

  var topDocs = searcher.search(query, numResults);
  var storedFields = searcher.storedFields();

  var result = new ArrayList<SearchResult>();
  for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
    var doc = storedFields.document(scoreDoc.doc);
    result.add(new SearchResult(
      scoreDoc.score,
      doc.toString()
    ));
  }
}

record SearchResult(float score, String content) {}

Adding the results to an AI query

We now have queried our Lucene index and it has given us a list of the 10 most relevant documents. Using this context is now as simple as passing the additional information when querying the AI. Here is the code to send such a request to the OpenAI API:

String completion(String query, List<String> context) {
  var prompt = "You are an assistant that answers questions.\n\n"
    + "Answer the question based on the following knowledge:\n\n"
    + String.join("\n\n ", context);

  var messages = List.of(
    new CompletionMessage("system", prompt),
    new CompletionMessage("user", query)
  );

  var requestBody = GSON.toJson(new CompletionRequest("gpt-4o-mini", messages));
  var response = Request.post("https://api.openai.com/v1/chat/completions")
    .addHeader("Content-Type", "application/json")
    .addHeader("Authorization", "Bearer " + API_KEY)
    .bodyString(requestBody, ContentType.APPLICATION_JSON)
    .execute();
  var responseBody = new BufferedReader(
    new InputStreamReader(response.returnContent().asStream()));
  var completionResponse = GSON.fromJson(responseBody, CompletionResponse.class);
  return completionResponse.choices().toString();
}

record CompletionRequest(String model, List<CompletionMessage> messages) {}

record CompletionMessage(String role, String content) {}

record CompletionResponse(String id, String model, List<CompletionChoice> choices) {}

record CompletionChoice(int index, CompletionMessage message, String finishReason) {}

Trying it out

All the previous code can be combined for a query like this:

String queryAI(String query) {
  var context = searchIndex(query);
  return completion(query, context.stream().map(SearchResult::content).toList());
}

Since the OpenAI model already has quite extensive knowledge, it will be interesting to see how much the additional context from Lucene will improve the answers.

So as an example I will ask it about a made-up microcontroller that it can’t know about:

> What are the characteristics of the GIDO123XYZ microcontroller?

As of my last update in October 2023, there is no widely recognized microcontroller specifically named "GIDO123XYZ."
It's possible that it may be a new or niche product that was released after my training data was compiled, or it could
be a fictional or hypothetical example.

Now we can add some information about this specific microcontroller to our database. We just add another document with the following content:

# GIDO123XYZ Microcontroller

Advanced AI capabilities, 123 teraflops of processing power, 5G connectivity, 10-year battery life, supports quantum
computing, built-in security features. Perfectly suited for AI-powered IoT devices.

Let’s ask the AI again with this added context:

> What are the characteristics of the GIDO123XYZ microcontroller?

The GIDO123XYZ microcontroller has the following characteristics:

- Advanced AI capabilities
- 123 teraflops of processing power
- 5G connectivity
- 10-year battery life
- Support for quantum computing
- Built-in security features

It is perfectly suited for AI-powered IoT devices.

Of course this is a very simple example, but it shows that the AI can easily use the added context to generate a more accurate answer.