Retrieval-Augmented Generation (RAG) has gone from research paper to production pattern faster than almost any other AI technique. But most tutorials stop at "it works on my machine." This guide covers what it actually takes to ship a RAG chatbot to production.
What We're Building
A FastAPI backend with:
- Document ingestion pipeline (PDF, web pages, plain text)
- Qdrant vector store for semantic search
- LangChain RAG chain with Groq LLM for fast inference
- Streaming responses via Server-Sent Events
- Conversation memory with Redis
Step 1: Document Ingestion
The quality of your RAG system lives or dies by how you chunk your documents. Don't use the default 1000-character splitter and call it a day.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ". ", " ", ""],
)
The chunk_overlap is critical without it, answers that span chunk boundaries get cut off.
Step 2: Embedding and Storage
We use text-embedding-3-small from OpenAI (or a local model if you want zero API costs):
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import Qdrant
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Qdrant.from_documents(
documents=chunks,
embedding=embeddings,
url="http://localhost:6333",
collection_name="knowledge_base",
)
Step 3: The RAG Chain
from langchain_groq import ChatGroq
from langchain.chains import ConversationalRetrievalChain
llm = ChatGroq(model="llama-3.1-8b-instant", temperature=0.1)
retriever = vectorstore.as_retriever(
search_type="mmr", # Max marginal relevance for diversity
search_kwargs={"k": 5, "fetch_k": 20},
)
chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
return_source_documents=True,
)
Step 4: Streaming with FastAPI
Users hate waiting. Stream the response token by token:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
@app.post("/chat")
async def chat(request: ChatRequest):
async def generate():
async for chunk in chain.astream({"question": request.message}):
if "answer" in chunk:
yield f"data: {chunk['answer']}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
Production Checklist
Before you deploy:
- Rate limiting: Add Redis-based rate limiting per user/IP
- Prompt injection: Sanitize inputs, use a system prompt that defines scope
- Hallucination guard: Add source citations and instruct the model to say "I don't know" when the retrieved context doesn't contain the answer
- Monitoring: Log every query/response pair for fine-tuning and debugging
- Caching: Cache embeddings for repeated queries with Redis
Results We've Seen
In our HealthSync project, this exact architecture reduced support tickets by 45% within 60 days of launch. The key was aggressive prompt engineering to keep the chatbot on-topic and citation of source documents to build user trust.
Have questions about building your own RAG system? Talk to our AI team.

Written by Hariom Patil
Lead Frontend Engineer at Axom Infotech

