从零到一:Embedding 驱动语义搜索
本教程从零构建一个基于 BGE Embedding 与 Chroma 的本地语义检索服务。
项目概述
功能
- 文档导入与分块
- BGE 向量化
- Chroma 存储与检索
- 相似度排序返回
技术栈
- Embedding:BGE-large-zh-v1.5
- 向量库:Chroma
- 分块:LangChain RecursiveCharacterTextSplitter
第一步:环境准备
mkdir semantic-search && cd semantic-search
python -m venv venv
source venv/bin/activate
pip install sentence-transformers chromadb langchain
第二步:文档加载与分块
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = TextLoader("docs.txt")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=50)
splits = splitter.split_documents(docs)
第三步:向量化与存储
from sentence_transformers import SentenceTransformer
import chromadb
model = SentenceTransformer("BAAI/bge-large-zh-v1.5")
client = chromadb.Client()
collection = client.get_or_create_collection("docs", metadata={"hnsw:space": "cosine"})
texts = [s.page_content for s in splits]
embeddings = model.encode(texts).tolist()
ids = [f"doc_{i}" for i in range(len(splits))]
collection.add(ids=ids, embeddings=embeddings, documents=texts)
第四步:检索
query = "什么是 RAG"
query_emb = model.encode("Represent this sentence for retrieval: " + query).tolist()
results = collection.query(query_embeddings=[query_emb], n_results=5)
print(results["documents"])
第五步:封装为 API
使用 FastAPI 暴露 /search 接口,支持文本查询并返回相似文档。