RAG最佳实践
本文档总结了构建高效RAG(检索增强生成) 系统的最佳实践。
文档处理最佳实践
1. 文档分块策略
智能分块
from langchain.text_splitter import RecursiveCharacterTextSplitter
# 根据文档类型选择分块策略
def get_text_splitter(doc_type: str):
if doc_type == "code":
return RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", " ", ""]
)
elif doc_type == "markdown":
return RecursiveCharacterTextSplitter(
chunk_size=2000,
chunk_overlap=300,
separators=["\n## ", "\n### ", "\n", " "]
)
else:
return RecursiveCharacterTextSplitter(
chunk_size=1500,
chunk_overlap=200
)
2. 元数据管理
丰富的元数据
def create_chunk_with_metadata(text: str, source: str, page: int):
return {
"text": text,
"metadata": {
"source": source,
"page": page,
"chunk_index": 0,
"timestamp": datetime.now().isoformat(),
"doc_type": "pdf"
}
}
检索最佳实践
1. 混合检索策略
结合多种检索方法
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.vectorstores import FAISS
def create_hybrid_retriever(vectorstore, documents):
# 向量检索
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# 关键词检索
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 5
# 混合检索
ensemble_retriever = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.7, 0.3]
)
return ensemble_retriever