Ollama API完整指南：从模型调用到本地AI工作流构建

Ollama本地AIPython SDKREST APIAI工作流MCP协议

为什么选Ollama而不是直接用API

成本是核心原因。 我的VPS上跑了3个模型，每个月成本约€15。如果是OpenAI的GPT-4o，同样的调用量至少$50起步。对于需要长时间运行的开发测试，这个成本差距很明显。

另一个原因是对数据的控制。我的部分项目涉及内部文档处理，不适合发送到第三方API。Ollama跑在本地，所有数据都在自己的服务器上。

当然Ollama有局限：模型质量不如GPT-4o或Claude，有些复杂推理任务表现一般。但对于工具调用、代码生成、文档总结这类任务，Llama 3.1 70B已经足够。

REST API基础调用

Ollama提供了一套标准的REST API，默认端口11434。

最基础的模型调用：

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "解释什么是REST API，用一句话",
  "stream": false
}'

返回是一个JSON，包含response字段。stream: false表示同步返回，适合短查询。

流式输出（streaming）：

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "写一个Python快速排序函数",
  "stream": true
}'

加上stream: true后，返回的是Server-Sent Events格式，每个token单独发送。Python里处理流式输出：

import requests

def generate_stream(prompt, model="llama3.1:8b"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": True},
        stream=True
    )
    for line in response.iter_lines():
        if line:
            data = json.loads(line)
            print(data.get("response", ""), end="", flush=True)

聊天接口： 从0.1.15版本开始，Ollama推荐使用chat接口代替generate：

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [
    {"role": "system", "content": "你是一个代码审查助手"},
    {"role": "user", "content": "这段代码有什么问题？def foo(): return 1/0"}
  ]
}'

这个接口更适合多轮对话场景，支持system、user、assistant三种角色。

Python SDK实战

官方Python SDK比直接调用REST API更方便：

pip install ollama

基础用法：

import ollama

response = ollama.chat(
    model='llama3.1:8b',
    messages=[
        {'role': 'user', 'content': '什么是上下文窗口？'}
    ]
)
print(response['message']['content'])

流式输出：

import ollama

stream = ollama.chat(
    model='llama3.1:8b',
    messages=[{'role': 'user', 'content': '解释Kubernetes'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

**流式输出的一个坑：** 很多教程直接for chunk in stream，但有时候stream返回的不是迭代器，而是包含多个chunk的列表。安全写法：

stream = ollama.chat(model='llama3.1:8b', messages=[...], stream=True)
# 确保是生成器
if hasattr(stream, '__iter__'):
    for chunk in stream:
        print(chunk['message']['content'], end='', flush=True)
else:
    print(stream['message']['content'])

模型管理命令

Ollama的模型存储在~/.ollama/models，可以通过命令行管理：

# 查看已安装模型
ollama list

# 拉取新模型
ollama pull qwen2.5:14b

# 删除模型
ollama rm llama3.1:8b

# 复制模型（用于创建自定义变体）
ollama cp llama3.1:8b my-custom-llama

Modelfile自定义配置： 如果需要自定义模型行为（比如系统提示词），创建Modelfile：

FROM llama3.1:8b
PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM """
你是一个专门审查API文档的技术作家。
用简洁清晰的语言解释概念，避免行话。
"""

然后通过ollama create注册：

ollama create my-api-reviewer -f Modelfile

之后就可以用my-api-reviewer作为模型名调用。

用Ollama构建本地AI Agent

我构建的最实用的场景是把Ollama作为Agent的推理引擎，配合工具调用实现自动化。

简化版Agent架构：

import ollama
import json

class SimpleAgent:
    def __init__(self, model="llama3.1:8b"):
        self.model = model
        self.tools = {
            "calculator": self.calc,
            "search": self.search
        }

    def calc(self, expr):
        try:
            return eval(expr)
        except:
            return "计算错误"

    def search(self, query):
        # 简化版，实际应该调用搜索API
        return f"关于{query}的信息..."

    def run(self, task):
        # 第一步：让模型分析任务，决定是否需要工具
        response = ollama.chat(
            model=self.model,
            messages=[{
                "role": "user",
                "content": f"任务：{task}\n可用工具：{list(self.tools.keys())}\n判断是否需要工具，如果需要返回JSON格式：{{\"tool\": \"工具名\", \"args\": {{\"参数名\": \"参数值\"}}}}，不需要则回复\"无需工具：你的回答\""
            }]
        )

        result = response['message']['content']

        # 尝试解析工具调用
        if result.startswith("{") and "tool" in result:
            try:
                call = json.loads(result)
                tool_name = call.get("tool")
                args = call.get("args", {})
                if tool_name in self.tools:
                    tool_result = self.tools[tool_name](**args)
                    # 第二步：用工具结果生成最终回答
                    final = ollama.chat(
                        model=self.model,
                        messages=[
                            {"role": "user", "content": f"任务：{task}"},
                            {"role": "system", "content": f"工具结果：{tool_result}"}
                        ]
                    )
                    return final['message']['content']
            except:
                pass

        return result

这个简化版Agent的局限：模型本身不原生支持工具调用，需要靠prompt engineering引导，准确率不稳定。如果需要更稳定的工具调用，建议用支持function calling的模型（如Qwen2.5 + Toolformer），或者使用MCP协议。

MCP协议集成

MCP Server 协议（Model Context Protocol）是Anthropic提出的工具调用标准，Ollama从0.1.42版本开始支持。

MCP Server配置示例（以文件系统为例）：

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
    }
  }
}

调用时的工具传递：

import ollama

response = ollama.chat(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "列出/tmp目录下的所有文件"}],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "filesystem_list",
                "description": "列出目录文件",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "path": {"type": "string"}
                    }
                }
            }
        }
    ]
)

# 检查是否触发工具调用
if response.get('message', {}).get('tool_calls'):
    for tool_call in response['message']['tool_calls']:
        print(f"调用工具: {tool_call['function']['name']}")
        print(f"参数: {tool_call['function']['arguments']}")

MCP的一个坑： Ollama的MCP支持还不够完善，部分工具调用会失败或返回格式错误。生产环境建议用Claude CLI配合Browserbase Skills，MCP支持更成熟。

上下文管理与内存

Ollama默认上下文窗口4096 tokens（8B模型），可以通过num_ctx参数调整：

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "options": {"num_ctx": 8192},
  "messages": [...]
}'

长对话的上下文管理策略：

1. 简单截断： 超过指定长度就丢弃早期消息。最简单但可能丢失重要上下文。

2. 摘要压缩： 定期用模型将之前对话压缩成摘要，后续只保留摘要。伪代码：

def summarize_if_needed(messages, max_turns=10):
    if len(messages) > max_turns * 2:
        old_messages = messages[:-max_turns]
        recent = messages[-max_turns:]
        summary_prompt = f"总结以下对话要点，保留关键信息：{old_messages}"
        summary = ollama.chat(model=MODEL, messages=[{"role": "user", "content": summary_prompt}])
        return [{"role": "system", "content": f"之前对话摘要：{summary['message']['content']}"}] + recent
    return messages

3. 向量数据库存储： 将每轮对话的要点存入向量数据库，检索时召回相关历史。这个方案最复杂但效果最好，适合需要长期记忆的场景。

性能优化实战

GPU卸载配置： 如果显存不够，可以限制GPU卸载比例：

# .ollama/config.json
{
  "gpu": "ampere",
  "num_gpu": 1,
  "main_gpu": 0
}

或者在Modelfile里设置：

PARAMETER num_gpu 1

并发请求处理： Ollama单实例处理并发能力有限，高并发场景建议：

1. 增加num_parallel配置

2. 前端加负载均衡（Nginx 性能调优）

3. 多个Ollama实例（每个用不同端口）

实测单台VPS（4核8G）跑Ollama并发超过3个请求就开始卡顿。如果是生产级并发，考虑用vLLM或TGI替代。

模型选择建议：

场景	推荐模型	最低配置
代码生成	codellama:34b	24GB RAM
中文对话	qwen2.5:14b	16GB RAM
快速测试	llama3.2:3b	4GB RAM
平衡选择	llama3.1:8b	8GB RAM

常见问题排查

问题1：模型下载很慢

Ollama使用ollama pull下载模型，默认从GitHub releases。如果网络不稳定：

# 使用代理
export HTTPS_PROXY=http://127.0.0.1:7890
ollama pull llama3.1:8b

问题2：显存不足（OOM）

减少num_gpu或改用更小的模型：

ollama run llama3.2:3b  # 比8B小很多

问题3：API超时

Ollama默认没有超时限制，但某些客户端会自动设置。如果需要自己控制：

import signal

def timeout_handler(signum, frame):
    raise TimeoutError("API调用超时")

signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(30)  # 30秒超时

try:
    response = ollama.chat(model=MODEL, messages=[...])
finally:
    signal.alarm(0)

问题4：中文编码问题

部分模型输出的中文可能乱码，确保终端编码是UTF-8：

export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8

总结

Ollama是本地运行大模型最便捷的方案之一，REST API和Python SDK都足够完善，适合：

开发测试阶段的快速实验
对数据隐私有要求的场景
成本敏感的长期运行任务

主要局限是模型质量和并发性能。对于复杂的工具调用场景，建议等Ollama的MCP支持更成熟后再在生产环境使用。

👉 立即体验MiniMax API：https://platform.minimaxi.com/subscribe/token-plan?code=E5yur9NOub&source=link

相关工具推荐：

🔗 Related Tech Articles

Deep dive into related technical topics:

2026-05-10-ollama-api完整指南如何用rest接口和python-sdk构建本地ai工作流.html

技术标签: 本地ai, python sdk

2026-05-10-complete-ollama-api-guide-building-local-ai-workfl-en.html

技术标签: local ai, python sdk

Complete Ollama API Guide: From Model Calls to Building Local AI Workflows

技术标签: local ai, python sdk

💻 Recommended Hardware

查看推荐 →