跳到主要内容

从零到一:构建多模态AI应用

本教程将带您从零开始,构建一个完整的多模态AI应用。我们将创建一个智能图像问答系统,能够理解图像内容并回答相关问题。

项目概述

功能特性

  • ✅ 图像理解
  • ✅ 文本问答
  • ✅ 图像描述生成
  • ✅ 图像搜索
  • ✅ Web界面
  • ✅ API接口

技术栈

  • 框架:Transformers + CLIP
  • 模型:CLIP / BLIP / GPT-4V
  • 后端:Python 3.11+ + FastAPI
  • 前端:React + TypeScript

第一步:环境准备

1.1 安装依赖

创建 requirements.txt

torch>=2.0.0
transformers>=4.35.0
pillow>=10.0.0
fastapi>=0.104.0
uvicorn>=0.24.0
openai>=1.0.0
python-dotenv>=1.0.0

第二步:实现核心功能

2.1 图像编码器 (src/encoders/image_encoder.py)

"""图像编码器"""
import torch
from transformers import CLIPProcessor, CLIPModel
from PIL import Image

class ImageEncoder:
"""图像编码器"""

def __init__(self, model_name="openai/clip-vit-base-patch32"):
self.model = CLIPModel.from_pretrained(model_name)
self.processor = CLIPProcessor.from_pretrained(model_name)
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model.to(self.device)
self.model.eval()

def encode(self, image: Image.Image) -> torch.Tensor:
"""编码图像"""
inputs = self.processor(images=image, return_tensors="pt").to(self.device)
with torch.no_grad():
image_features = self.model.get_image_features(**inputs)
return image_features

2.2 文本编码器 (src/encoders/text_encoder.py)

"""文本编码器"""
import torch
from transformers import CLIPProcessor, CLIPModel

class TextEncoder:
"""文本编码器"""

def __init__(self, model_name="openai/clip-vit-base-patch32"):
self.model = CLIPModel.from_pretrained(model_name)
self.processor = CLIPProcessor.from_pretrained(model_name)
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model.to(self.device)
self.model.eval()

def encode(self, text: str) -> torch.Tensor:
"""编码文本"""
inputs = self.processor(text=text, return_tensors="pt", padding=True).to(self.device)
with torch.no_grad():
text_features = self.model.get_text_features(**inputs)
return text_features

2.3 多模态问答 (src/qa/multimodal_qa.py)

"""多模态问答"""
from openai import OpenAI
from PIL import Image
import base64
import io

class MultimodalQA:
"""多模态问答"""

def __init__(self, api_key: str):
self.client = OpenAI(api_key=api_key)

def answer_question(self, image: Image.Image, question: str) -> str:
"""回答关于图像的问题"""
# 将图像转换为base64
buffered = io.BytesIO()
image.save(buffered, format="PNG")
image_base64 = base64.b64encode(buffered.getvalue()).decode()

# 调用GPT-4V
response = self.client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": question},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_base64}"
}
}
]
}
],
max_tokens=300
)

return response.choices[0].message.content

2.4 API接口 (src/api.py)

"""FastAPI应用"""
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from PIL import Image
import io

from src.qa.multimodal_qa import MultimodalQA
import os

app = FastAPI(title="多模态AI API")

app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)

qa_system = MultimodalQA(api_key=os.getenv("OPENAI_API_KEY"))

class QuestionRequest(BaseModel):
question: str

@app.post("/answer")
async def answer_question(
question: QuestionRequest,
image: UploadFile = File(...)
):
"""回答关于图像的问题"""
try:
# 读取图像
image_data = await image.read()
image_obj = Image.open(io.BytesIO(image_data))

# 回答问题
answer = qa_system.answer_question(image_obj, question.question)

return {"answer": answer}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

第三步:运行和测试

3.1 启动服务

python -m src.api

3.2 测试API

curl -X POST "http://localhost:8000/answer" \
-F "question=这张图片里有什么?" \
-F "image=@test_image.jpg"

总结

本教程展示了如何构建多模态AI应用:

  1. 图像和文本编码
  2. 多模态理解
  3. 问答系统
  4. API接口

多模态AI能够同时理解图像和文本,为应用提供更强大的能力。