跳到主要内容

从零到一:构建大模型评测系统

本教程将带您从零开始,构建一个完整的大模型评测系统。我们将创建一个多模型对比评测平台,能够评估不同模型在多个任务上的表现。

项目概述

功能特性

  • ✅ 多模型支持
  • ✅ 多数据集评测
  • ✅ 多种评测指标
  • ✅ 自动评测流程
  • ✅ 结果可视化
  • ✅ 报告生成

技术栈

  • 后端:Python 3.11+ + FastAPI
  • 评测框架:自定义评测框架
  • LLM:OpenAI / Anthropic / 本地模型
  • 数据库:SQLite / PostgreSQL
  • 可视化:Matplotlib / Plotly

第一步:环境准备

1.1 创建项目目录

mkdir model-evaluation-system
cd model-evaluation-system

1.2 安装依赖

创建 requirements.txt

fastapi>=0.104.0
uvicorn>=0.24.0
openai>=1.0.0
anthropic>=0.7.0
pydantic>=2.5.0
sqlalchemy>=2.0.0
pandas>=2.0.0
matplotlib>=3.7.0
plotly>=5.17.0
python-dotenv>=1.0.0

第二步:实现核心功能

2.1 模型接口 (src/models/base.py)

"""模型基类"""
from abc import ABC, abstractmethod
from typing import List, Dict

class BaseModel(ABC):
"""模型基类"""

def __init__(self, name: str, api_key: str = None):
self.name = name
self.api_key = api_key

@abstractmethod
def generate(self, prompt: str, **kwargs) -> str:
"""生成文本"""
pass

@abstractmethod
def batch_generate(self, prompts: List[str], **kwargs) -> List[str]:
"""批量生成"""
pass

2.2 OpenAI模型实现 (src/models/openai_model.py)

"""OpenAI模型实现"""
from openai import OpenAI
from src.models.base import BaseModel

class OpenAIModel(BaseModel):
"""OpenAI模型"""

def __init__(self, name: str, api_key: str, model_name: str = "gpt-4"):
super().__init__(name, api_key)
self.client = OpenAI(api_key=api_key)
self.model_name = model_name

def generate(self, prompt: str, **kwargs) -> str:
"""生成文本"""
response = self.client.chat.completions.create(
model=self.model_name,
messages=[{"role": "user", "content": prompt}],
**kwargs
)
return response.choices[0].message.content

def batch_generate(self, prompts: List[str], **kwargs) -> List[str]:
"""批量生成"""
results = []
for prompt in prompts:
results.append(self.generate(prompt, **kwargs))
return results

2.3 评测指标 (src/metrics/__init__.py)

"""评测指标"""
from typing import List, Dict

def accuracy_score(predictions: List[str], references: List[str]) -> float:
"""准确率"""
correct = sum(1 for p, r in zip(predictions, references) if p == r)
return correct / len(predictions) if predictions else 0.0

def bleu_score(predictions: List[str], references: List[str]) -> float:
"""BLEU分数"""
from nltk.translate.bleu_score import sentence_bleu
scores = []
for pred, ref in zip(predictions, references):
score = sentence_bleu([ref.split()], pred.split())
scores.append(score)
return sum(scores) / len(scores) if scores else 0.0

def semantic_similarity(predictions: List[str], references: List[str]) -> float:
"""语义相似度"""
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

pred_embeddings = model.encode(predictions)
ref_embeddings = model.encode(references)

from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(pred_embeddings, ref_embeddings)
return similarities.diagonal().mean()

2.4 评测引擎 (src/evaluator.py)

"""评测引擎"""
from typing import List, Dict
from src.models.base import BaseModel
from src.metrics import accuracy_score, bleu_score, semantic_similarity

class Evaluator:
"""评测引擎"""

def __init__(self):
self.metrics = {
"accuracy": accuracy_score,
"bleu": bleu_score,
"semantic_similarity": semantic_similarity
}

def evaluate(self, model: BaseModel, dataset: List[Dict], metrics: List[str] = None) -> Dict:
"""评测模型"""
if metrics is None:
metrics = list(self.metrics.keys())

# 生成预测
prompts = [item["prompt"] for item in dataset]
predictions = model.batch_generate(prompts)
references = [item["reference"] for item in dataset]

# 计算指标
results = {}
for metric_name in metrics:
if metric_name in self.metrics:
score = self.metrics[metric_name](predictions, references)
results[metric_name] = score

return {
"model": model.name,
"dataset_size": len(dataset),
"metrics": results,
"predictions": predictions
}

2.5 API接口 (src/api.py)

"""FastAPI应用"""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Dict

from src.models.openai_model import OpenAIModel
from src.evaluator import Evaluator

app = FastAPI(title="大模型评测API")

evaluator = Evaluator()

class EvaluationRequest(BaseModel):
model_name: str
model_type: str # "openai", "anthropic", "local"
api_key: str
dataset: List[Dict]
metrics: List[str] = None

@app.post("/evaluate")
async def evaluate(request: EvaluationRequest):
"""评测模型"""
try:
# 创建模型实例
if request.model_type == "openai":
model = OpenAIModel(
name=request.model_name,
api_key=request.api_key
)
else:
raise ValueError(f"不支持的模型类型: {request.model_type}")

# 执行评测
result = evaluator.evaluate(model, request.dataset, request.metrics)

return result
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

第三步:数据集准备

3.1 创建测试数据集

"""创建测试数据集"""
import json

dataset = [
{
"prompt": "什么是人工智能?",
"reference": "人工智能是计算机科学的一个分支,旨在创建能够执行通常需要人类智能的任务的系统。"
},
{
"prompt": "解释一下机器学习。",
"reference": "机器学习是人工智能的一个子领域,通过算法让计算机从数据中学习模式。"
},
# ... 更多测试用例
]

with open("data/test_dataset.json", "w") as f:
json.dump(dataset, f, ensure_ascii=False, indent=2)

第四步:运行评测

4.1 启动服务

python -m src.api

4.2 执行评测

import requests

response = requests.post("http://localhost:8000/evaluate", json={
"model_name": "gpt-4",
"model_type": "openai",
"api_key": "your-api-key",
"dataset": dataset,
"metrics": ["accuracy", "semantic_similarity"]
})

print(response.json())

总结

本教程展示了如何构建大模型评测系统:

  1. 模型接口抽象
  2. 评测指标实现
  3. 评测引擎开发
  4. API接口
  5. 数据集准备

通过这个系统,您可以对比不同模型的表现,为模型选择提供数据支持。