本地推理快速入门

以 Ollama 为例，快速在本地运行大模型。

安装 Ollama

macOS / Linux

# 官方安装脚本
curl -fsSL https://ollama.com/install.sh | sh

Windows

从 ollama.com 下载安装包，按向导安装。

拉取并运行模型

# 拉取 LLaMA 3（约 4.7GB）
ollama pull llama3

# 交互式运行
ollama run llama3

在交互界面直接输入问题即可。

API 调用

Ollama 默认在 http://localhost:11434 提供 OpenAI 兼容 API：

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [{"role": "user", "content": "你好"}],
    "stream": false
  }'

与 LangChain 集成

pip install langchain langchain-community

from langchain_community.llms import Ollama

llm = Ollama(model="llama3")
print(llm.invoke("介绍一下 RAG"))

与 LlamaIndex 集成

from llama_index.llms.ollama import Ollama

llm = Ollama(model="llama3", request_timeout=60.0)
response = llm.complete("你好")
print(response.text)

常见模型推荐

模型	大小	说明
llama3	4.7GB	通用对话
qwen2.5	4.4GB	中文友好
mistral	4.1GB	平衡性能
phi3	2.3GB	小显存
gemma2	5.4GB	多语言

下一步

高级开发：vLLM 部署、量化、多 GPU
与 RAG 结合：构建本地知识库问答
与 Agent 结合：本地 Agent 后端

安装 Ollama​

macOS / Linux​

Windows​

拉取并运行模型​

API 调用​

与 LangChain 集成​

与 LlamaIndex 集成​

常见模型推荐​

下一步​

安装 Ollama

macOS / Linux

Windows

拉取并运行模型

API 调用

与 LangChain 集成

与 LlamaIndex 集成

常见模型推荐

下一步