Building a Production RAG Chatbot with FastAPI and LangChain | Axom Infotech

Retrieval-Augmented Generation (RAG) has gone from research paper to production pattern faster than almost any other AI technique. But most tutorials stop at "it works on my machine." This guide covers what it actually takes to ship a RAG chatbot to production.

What We're Building

A FastAPI backend with:

Document ingestion pipeline (PDF, web pages, plain text)
Qdrant vector store for semantic search
LangChain RAG chain with Groq LLM for fast inference
Streaming responses via Server-Sent Events
Conversation memory with Redis

Step 1: Document Ingestion

The quality of your RAG system lives or dies by how you chunk your documents. Don't use the default 1000-character splitter and call it a day.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " ", ""],
)

The chunk_overlap is critical without it, answers that span chunk boundaries get cut off.

Step 2: Embedding and Storage

We use text-embedding-3-small from OpenAI (or a local model if you want zero API costs):

from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import Qdrant

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents=chunks,
    embedding=embeddings,
    url="http://localhost:6333",
    collection_name="knowledge_base",
)

Step 3: The RAG Chain

from langchain_groq import ChatGroq
from langchain.chains import ConversationalRetrievalChain

llm = ChatGroq(model="llama-3.1-8b-instant", temperature=0.1)

retriever = vectorstore.as_retriever(
    search_type="mmr",  # Max marginal relevance for diversity
    search_kwargs={"k": 5, "fetch_k": 20},
)

chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
)

Step 4: Streaming with FastAPI

Users hate waiting. Stream the response token by token:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post("/chat")
async def chat(request: ChatRequest):
    async def generate():
        async for chunk in chain.astream({"question": request.message}):
            if "answer" in chunk:
                yield f"data: {chunk['answer']}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Production Checklist

Before you deploy:

Rate limiting: Add Redis-based rate limiting per user/IP
Prompt injection: Sanitize inputs, use a system prompt that defines scope
Hallucination guard: Add source citations and instruct the model to say "I don't know" when the retrieved context doesn't contain the answer
Monitoring: Log every query/response pair for fine-tuning and debugging
Caching: Cache embeddings for repeated queries with Redis

Results We've Seen

In our HealthSync project, this exact architecture reduced support tickets by 45% within 60 days of launch. The key was aggressive prompt engineering to keep the chatbot on-topic and citation of source documents to build user trust.

Have questions about building your own RAG system? Talk to our AI team.

Written by Hariom Patil

Lead Frontend Engineer at Axom Infotech

Back to Blog