LLM – Retrieval Augmented Generation (RAG) with Ollama Embeddings

The work on the Large Language Model (LLM) bot so far has seen the running of LLM locally using Ollama, a switch in models (from tinyllama to gemma) whilst introducing LangChain and then the switch to LangChain templates.

Note: If you skipped the previous blog entry posts, I’m following along with Real Pythons “Build an LLM RAG Chabot….” tutorial. My blog posts are to help with my understanding, keep track of any side paths I head down, and note any adjustments I make (i.e. using Ollama instead of OpenAI, different Python libraries etc).

Langchain templates use contexts to help assist the bot with information to use when it generates an answer. Previously I used this to feed in the current weather conditions. However, it comes into play a lot more when making a bot that has Retrieval Augmented Generation (RAG) abilities.

Retrieval Augmented Generation (RAG)

During the prompt phase the prompt context can be used to pass documents to the bot, so that the LLM is used against the documents to help the bot generate an answer. This gives the bot the training of LLM (currently Ollama using gemma in my case) alongside potentially confidential or proprietary information via documents passed into the context. With RAG enabled it would become a bot that could be run offline with a local LLM with access to confidential / proprietary information that doesn’t need to be sent to service provider.

RAG is split into two phases: document retrieval and answer formulation.

Document retrieval can be a database (e.g. vector database, keyword table index) including comma separated values (CSV) files.

The human question (e.g. what a human is asking the bot) is encoded, generally with words being stemmed, and the a similar process is used against the documents. The bot then looks at vectors of the words to see which are closet in Euclidean distance. These vectors are stored within a vector based database.

RAG via ChromaDB – Retriever

A retriever is needed to retrieve the document(s), vectorise the word values, and store them in a vector based database. The Real Python guide uses ChromaDB for the vector based database, and their tutorial includes a CSV full of customer reviews at a hospital.

I had to make some changes to here as I’m not using OpenAI. I also stored the REVIEWS_CSV_PATH and REVIEWS_CHROMA_PATH in my environment variables (.env file).

import dotenv
import os
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import CSVLoader
from langchain_ollama import OllamaEmbeddings

dotenv.load_dotenv()

REVIEWS_CSV_PATH = os.getenv('REVIEWS_CSV_PATH')
REVIEWS_CHROMA_PATH = os.getenv('REVIEWS_CHROMA_PATH')

loader = CSVLoader(REVIEWS_CSV_PATH, source_column='review')
reviews = loader.load()

reviews_vector_db = Chroma.from_documents(reviews,
                                          OllamaEmbeddings(model=os.getenv('LLM_MODEL'),
                                            base_url=os.getenv('LLM_URL')),
                                          persist_directory=REVIEWS_CHROMA_PATH)

The Real Python guide imported CSVLoader from langchain.document_loaders.csv_loader, however I got a deprecation warning when doing this so switched to importing it from langchain_community.document_loaders.

Running the above created the vector database so that it can then be used with the bot.

Bot With RAG Abilities

As with the retriever I made a few changes here so that the bot uses my locally running Ollama instance, uses Ollama Embeddings instead of OpenAI and CSV loader comes from langchain_community.

import dotenv
import os
from langchain_ollama import OllamaLLM
from langchain.prompts import (
    PromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
    ChatPromptTemplate,
)
from langchain_core.output_parsers import StrOutputParser
from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings
from langchain.schema.runnable import RunnablePassthrough

dotenv.load_dotenv()

chat_model = OllamaLLM(model=os.getenv('LLM_MODEL'),
                       base_url=os.getenv('LLM_URL'))

review_template_str = """Your job is to use patient reviews 
    to answer questions about their experience at a hospital. 
    Use the following context to answer questions. 
    Be as detailed as possible, but don't make up any information 
    that's not from the context. 
    If you don't know an answer, say you don't know.

{context} """

review_system_prompt = SystemMessagePromptTemplate(
    prompt=PromptTemplate(
        input_variables = ["context"],
        template=review_template_str,
    )
)

review_human_prompt = HumanMessagePromptTemplate(
    prompt=PromptTemplate(
        input_variables = ["question"],
        template="{question}",
    )
)

messages = [review_system_prompt, review_human_prompt]

review_prompt_template = ChatPromptTemplate(
    input_variables = ["context", "question"],
    messages=messages,
)

reviews_vector_db = Chroma(
    persist_directory=os.getenv('REVIEWS_CHROMA_PATH'),
    embedding_function=OllamaEmbeddings(
        model=os.getenv('LLM_MODEL'),
        base_url=os.getenv('LLM_URL'),
        )
)

reviews_retriever = reviews_vector_db.as_retriever(k=10)

review_chain = (
    {"context": reviews_retriever, "question": RunnablePassthrough()}
    | review_prompt_template 
    | chat_model
    | StrOutputParser()
)

The above introduces a reviews_retriever that retrieves reviews from the vector database. The k=10 is important and is tied into vector similarities. The user’s question is vectorised, and then the 10 reviews that are closet (in vector terms) to the user’s question. If k=5 was used then it would be the 5 reviews that are closet (in vector terms) to the user’s question.

The review_chain has been edited so that it now passes in the reviews_retriever as the context. This means the bot is now using the vector database (containing the vectorised CSV document) as the context for any questions it is asked, allowing it to answer with detail from the CSV document.