Complete Ollama 本地大模型 API Guide: From Model Calls to Building Local AI Workflows

OllamaLocal AIPython SDKREST APIAI WorkflowMCP Server 协议 Protocol

Why Ollama Over Direct API Calls

Cost is the core reason. I run 3 models on my VPS for about €15/month. The same usage with OpenAI's GPT-4o would cost at least $50. For long-running development tests, this difference adds up fast.

Another reason is data control. Some of my projects involve internal document processing that shouldn't go to third-party APIs. Ollama runs locally — all data stays on your own servers.

Ollama has limitations: model quality lags behind GPT-4o or Claude, and complex reasoning tasks can be underwhelming. But for tool calling, code generation, and document summarization, Llama 3.1 70B gets the job done.

REST API Fundamentals

Ollama provides a standard REST API on port 11434 by default.

Basic model call:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Explain what a REST API is in one sentence",
  "stream": false
}'

Returns a JSON with a response field. With stream: false, it's synchronous — good for short queries.

Streaming output:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Write a Python quicksort function",
  "stream": true
}'

With stream: true, it returns Server-Sent Events with each token sent separately. Handling this in Python:

import requests
import json

def generate_stream(prompt, model="llama3.1:8b"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": True},
        stream=True
    )
    for line in response.iter_lines():
        if line:
            data = json.loads(line)
            print(data.get("response", ""), end="", flush=True)

Chat interface: Starting from version 0.1.15, Ollama recommends the chat endpoint over generate:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [
    {"role": "system", "content": "You are a code review assistant"},
    {"role": "user", "content": "What is wrong with this code? def foo(): return 1/0"}
  ]
}'

This interface is better suited for multi-turn conversations, supporting system, user, and assistant roles.

Python SDK in Practice

The official Python SDK is more convenient than calling the REST API directly:

pip install ollama

Basic usage:

import ollama

response = ollama.chat(
    model='llama3.1:8b',
    messages=[
        {'role': 'user', 'content': 'What is a context window?'}
    ]
)
print(response['message']['content'])

Streaming output:

import ollama

stream = ollama.chat(
    model='llama3.1:8b',
    messages=[{'role': 'user', 'content': 'Explain Kubernetes'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

**A streaming gotcha:** Many tutorials use for chunk in stream directly, but sometimes stream returns a list of chunks instead of an iterator. A safer approach:

stream = ollama.chat(model='llama3.1:8b', messages=[...], stream=True)
# Ensure it's a generator
if hasattr(stream, '__iter__'):
    for chunk in stream:
        print(chunk['message']['content'], end='', flush=True)
else:
    print(stream['message']['content'])

Model Management Commands

Ollama stores models in ~/.ollama/models and manages them via CLI:

# List installed models
ollama list

# Pull a new model
ollama pull qwen2.5:14b

# Remove a model
ollama rm llama3.1:8b

# Copy a model (for creating custom variants)
ollama cp llama3.1:8b my-custom-llama

Custom config with Modelfile: To customize model behavior (e.g., system prompts), create a Modelfile:

FROM llama3.1:8b
PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM """
You are a technical writer specialized in API documentation review.
Explain concepts clearly without jargon.
"""

ollama create my-api-reviewer -f Modelfile

You can then call it using my-api-reviewer as the model name.

Building a Local AI Agent with Ollama

The most useful scenario I built: using Ollama as an Agent's reasoning engine with tool calling for automation.

Simplified Agent architecture:

import ollama
import json

class SimpleAgent:
    def __init__(self, model="llama3.1:8b"):
        self.model = model
        self.tools = {
            "calculator": self.calc,
            "search": self.search
        }

    def calc(self, expr):
        try:
            return eval(expr)
        except:
            return "Calculation error"

    def search(self, query):
        # Simplified — should call actual search API
        return f"Information about {query}..."

    def run(self, task):
        # Step 1: Let model decide if tools are needed
        response = ollama.chat(
            model=self.model,
            messages=[{
                "role": "user",
                "content": f"Task: {task}\nAvailable tools: {list(self.tools.keys())}\nIf tools are needed, return JSON: {{\"tool\": \"toolname\", \"args\": {{\"param\": \"value\"}}}}, otherwise reply \"No tool: your answer\""
            }]
        )

        result = response['message']['content']

        # Try to parse tool call
        if result.startswith("{") and "tool" in result:
            try:
                call = json.loads(result)
                tool_name = call.get("tool")
                args = call.get("args", {})
                if tool_name in self.tools:
                    tool_result = self.tools[tool_name](**args)
                    # Step 2: Generate final answer with tool result
                    final = ollama.chat(
                        model=self.model,
                        messages=[
                            {"role": "user", "content": f"Task: {task}"},
                            {"role": "system", "content": f"Tool result: {tool_result}"}
                        ]
                    )
                    return final['message']['content']
            except:
                pass

        return result

This simplified Agent has limits: the model doesn't natively support tool calls — it relies on prompt engineering, which is unreliable. For stable tool calling, use models with function calling support (like Qwen2.5 + Toolformer) or MCP protocol.

MCP Protocol Integration

MCP (Model Context Protocol) is Anthropic's tool-calling standard. Ollama has supported it since version 0.1.42.

MCP Server config example (filesystem):

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
    }
  }
}

Passing tools in API calls:

import ollama

response = ollama.chat(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "List all files in /tmp"}],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "filesystem_list",
                "description": "List files in a directory",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "path": {"type": "string"}
                    }
                }
            }
        }
    ]
)

# Check if tool call was triggered
if response.get('message', {}).get('tool_calls'):
    for tool_call in response['message']['tool_calls']:
        print(f"Calling tool: {tool_call['function']['name']}")
        print(f"Args: {tool_call['function']['arguments']}")

MCP gotcha: Ollama's MCP support isn't fully polished yet — some tool calls fail or return format errors. For production, use Claude CLI with Browserbase Skills, which has more mature MCP support.

Context Management and Memory

Ollama defaults to 4096 tokens context window (8B model). Adjust with num_ctx:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "options": {"num_ctx": 8192},
  "messages": [...]
}'

Strategies for long conversation context:

1. Simple truncation: Drop early messages when length exceeds a threshold. Simplest approach but risks losing important context.

2. Summary compression: Periodically compress prior conversation into a summary using the model. Keep only the summary going forward. Pseudo-code:

def summarize_if_needed(messages, max_turns=10):
    if len(messages) > max_turns * 2:
        old_messages = messages[:-max_turns]
        recent = messages[-max_turns:]
        summary_prompt = f"Summarize this conversation, keeping key info: {old_messages}"
        summary = ollama.chat(model=MODEL, messages=[{"role": "user", "content": summary_prompt}])
        return [{"role": "system", "content": f"Prior conversation summary: {summary['message']['content']}"}] + recent
    return messages

3. Vector database storage: Store key points from each turn in a vector DB and retrieve relevant history during lookup. Most complex but most effective for long-term memory.

Performance Optimization in Practice

GPU offloading config: If VRAM is tight, limit GPU offloading:

# .ollama/config.json
{
  "gpu": "ampere",
  "num_gpu": 1,
  "main_gpu": 0
}

Or set it in Modelfile:

PARAMETER num_gpu 1

Concurrent request handling: Ollama handles limited concurrency per instance. For high concurrency:

1. Increase num_parallel config

2. Add load balancing in front (Nginx 性能调优)

3. Run multiple Ollama instances (different ports)

Testing on a VPS (4 cores, 8GB RAM): Ollama starts lagging with more than 3 concurrent requests. For production-level concurrency, consider vLLM or TGI instead.

Model selection guide:

Use Case	Recommended Model	Min Config
Code generation	codellama:34b	24GB RAM
Chinese dialogue	qwen2.5:14b	16GB RAM
Quick testing	llama3.2:3b	4GB RAM
Balanced choice	llama3.1:8b	8GB RAM

Troubleshooting Common Issues

Issue 1: Slow model downloads

Ollama uses ollama pull to download models from GitHub releases by default. If your network is unstable:

# Use proxy
export HTTPS_PROXY=http://127.0.0.1:7890
ollama pull llama3.1:8b

Issue 2: Out of memory (OOM)

Reduce num_gpu or switch to a smaller model:

ollama run llama3.2:3b  # Much smaller than 8B

Issue 3: API timeout

Ollama has no default timeout, but some clients auto-set one. For manual control:

import signal

def timeout_handler(signum, frame):
    raise TimeoutError("API call timed out")

signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(30)  # 30 second timeout

try:
    response = ollama.chat(model=MODEL, messages=[...])
finally:
    signal.alarm(0)

Issue 4: Chinese encoding problems

Some models output garbled Chinese. Ensure your terminal uses UTF-8:

export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8

Summary

Ollama is one of the most convenient ways to run LLMs locally. The REST API and Python SDK are solid, and it's ideal for:

Quick experiments during development and testing
Use cases with strict data privacy requirements
Cost-sensitive long-running tasks

Main limitations are model quality and concurrent performance. For complex tool-calling scenarios, wait for Ollama's MCP support to mature before using it in production.

👉 Try MiniMax API now: https://platform.minimaxi.com/subscribe/token-plan?code=E5yur9NOub&source=link

Related tools:

🔗 Related Tech Articles

Deep dive into related technical topics:

2026-05-10-complete-ollama-api-guide-building-local-ai-workfl-en.html

技术标签: local ai, python sdk

2026-05-10-ollama-api完整指南如何用rest接口和python-sdk构建本地ai工作流.html

技术标签: 本地ai, python sdk

2026-05-10-ollama-api完整指南如何用rest接口和python-sdk构建本地ai工作流.html

技术标签: 本地ai, python sdk

🤖 Local AI Inference Hardware

查看推荐 →