Experimenting with Retrieval-Augmented Generation (RAG) for Thai Content


Welcome back to my blog! In my latest Generative AI experiment, I delved into the exciting world of Retrieval-Augmented Generation (RAG). This time, I aimed to test how well an LLM could process and understand Thai content extracted from a PDF document. Here's a step-by-step breakdown of my experiment and how you can replicate it.

Objectives

  1. Load and process a PDF document containing Thai content.
  2. Vectorize the document and store the embeddings in Qdrant.
  3. Query the document and interact with the LLM to receive responses in Thai.

Tools and Libraries

For this experiment, I used the following tools and libraries:

  • LangChain Community Document Loaders: To load and split the PDF document.
  • HuggingFace Embeddings: For generating embeddings from the document content.
  • Qdrant: As the vector database to store the document embeddings.
  • ChatOpenAI: To interact with the LLM and generate responses.
  • Gradio: For creating a user-friendly interface to upload PDFs and chat with the LLM.

Code Breakdown

Step 1: Setting Up the Environment

First, ensure you have the necessary libraries installed. Here's a quick setup:


from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import SentenceTransformersTokenTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from qdrant_client import QdrantClient
from langchain_community.vectorstores import Qdrant
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.messages import HumanMessage, AIMessage
import gradio as gr

Step 2: Loading and Splitting the PDF Document

I decided between RecursiveCharacterTextSplitter and SentenceTransformersTokenTextSplitter. Since my objective is to process Thai content, I opted to explore SentenceTransformersTokenTextSplitter using the model paraphrase-multilingual-mpnet-base-v2, which supports multilingual text.


def gradio_vectorize_and_store(pdf_file):
    text_splitter = SentenceTransformersTokenTextSplitter(tokens_per_chunk=128, chunk_overlap=20, model_name="paraphrase-multilingual-mpnet-base-v2")
    loader = PyMuPDFLoader(pdf_file.name)
    docs = loader.load()
    splits = text_splitter.split_documents(docs)
    embeddings = HuggingFaceEmbeddings(model_name="paraphrase-multilingual-mpnet-base-v2")
    url = "http://localhost:6333"
    qdrant = Qdrant.from_documents(
        splits,
        embeddings,
        url=url,
        prefer_grpc=True,
        collection_name="my_documents",
        force_recreate=True,
    )
    return "Processed and stored the vector database successfully!"

Step 3: Querying the Document

Again, I chose paraphrase-multilingual-mpnet-base-v2 to vectorize the document splits due to its robust support for semantic search. This means even if the content stored in the vector database is in Thai, you can search in English and still get semantically relevant results.

RAG is a two-step process. First, it uses the user query and past conversations in the chat to formulate a meaningful shot to query the vector database for relevant documents. This formulation is done through the LLM. Then, the retrieved documents are fed into the LLM to get the final answer. The model we use for LLM is TheBloke/Mistral-7B-Instruct-v0.2-GGUF. I also tested with lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF, but Mistral provided more meaningful responses without "garbage." There might be ways to fine-tune, but that is not our main objective at this moment.


def gradio_query(query, history):
    url = "http://localhost:6333"
    client = QdrantClient(url)
    embeddings = HuggingFaceEmbeddings(model_name="paraphrase-multilingual-mpnet-base-v2")
    collection_name = "my_documents"
    qdrant = Qdrant(client, collection_name, embeddings)

    retriever = qdrant.as_retriever(search_type="mmr", search_kwargs=dict(k=5, fetch_k=25))

    llm = ChatOpenAI(
        base_url="http://localhost:1234/v1",
        api_key="lm-studio",
        model="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
        temperature=0.7,
        streaming=True,
    )

    contextualize_q_system_prompt = """Reformulate the latest user question into a standalone question that does not require previous context. Do NOT answer the question; only rephrase it if needed."""
    contextualize_q_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", contextualize_q_system_prompt),
            MessagesPlaceholder("chat_history"),
            ("human", "{input}"),
        ]
    )
    history_aware_retriever = create_history_aware_retriever(
        llm, retriever, contextualize_q_prompt
    )

    qa_system_prompt = """You are an assistant for question-answering tasks. Use the provided context to answer the question. If you don't know the answer, say so. Keep the answer concise, using a maximum of three sentences.
    {context}"""
    qa_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", qa_system_prompt),
            MessagesPlaceholder("chat_history"),
            ("human", "{input}"),
        ]
    )
    question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

    rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

    chat_history = []
    for (question, answer) in history:
        chat_history.extend([HumanMessage(content=question), AIMessage(content=answer)])
    
    ai_msg = rag_chain.invoke({"input": query, "chat_history": chat_history})

    history.append((query, ai_msg["answer"]))
    return "", history

Step 4: Building the Gradio Interface

We create two tabs - one for PDF upload and another for chat. You can upload a file and chat with the document you just uploaded. You can change the document and start chatting with the new knowledge.


with gr.Blocks() as app:
    with gr.Tab("Vectorize and Store"):
        gr.Markdown("## Upload PDF File to Vectorize and Store")
        pdf_file_input = gr.File(label="Upload PDF File", type="filepath")
        vectorize_button = gr.Button("Vectorize and Store")
        vectorize_output = gr.Textbox(label="Status")

        vectorize_button.click(gradio_vectorize_and_store, inputs=pdf_file_input, outputs=vectorize_output)

    with gr.Tab("Chat with Me"):
        gr.Markdown("## Message to Chat with Me")
        chatbot = gr.Chatbot()
        msg = gr.Textbox()
        clear = gr.ClearButton([msg, chatbot])

        msg.submit(gradio_query, [msg, chatbot], [msg, chatbot])

app.launch()

Conclusion

First, I downloaded MFEC's annual report 2023, which is written in Thai, from here. Then, I uploaded the file, vectorized, and stored its content into a vector database. After that, I switched to Chat and asked questions. That's it! When I directed the LLM to always answer in Thai, it could respond in Thai but not very well. So, I let it answer in English. The responses from the LLM were relevant both to the question and to the content in the vector database. Thanks to paraphrase-multilingual-mpnet-base-v2, it handled Thai quite well in semantic Q&A. If we had an LLM that could converse fluently in Thai, it would be perfect.

Happy experimenting, and stay tuned for more AI adventures!

Comments

Popular posts from this blog

รถไฟฟ้าไทย: เกมการเมืองหรืออนาคตที่ยั่งยืน?

หนี้ครัวเรือน: ภัยเงียบ ที่กำลังกัดกินเศรษฐกิจไทย