Complete Ollama 本地大模型 API Guide: From Model Calls to Building Local AI Workflows
Why Ollama Over Direct API Calls
Cost is the core reason. I run 3 models on my VPS for about €15/month. The same usage with OpenAI's GPT-4o would cost at least $50. For long-running development tests, this difference adds up fast.
Another reason is data control. Some of my projects involve internal document processing that shouldn't go to third-party APIs. Ollama runs locally — all data stays on your own servers.
Ollama has limitations: model quality lags behind GPT-4o or Claude, and complex reasoning tasks can be underwhelming. But for tool calling, code generation, and document summarization, Llama 3.1 70B gets the job done.
REST API Fundamentals
Ollama provides a standard REST API on port 11434 by default.
Basic model call:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Explain what a REST API is in one sentence",
"stream": false
}'
Returns a JSON with a response field. With stream: false, it's synchronous — good for short queries.
Streaming output:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Write a Python quicksort function",
"stream": true
}'
With stream: true, it returns Server-Sent Events with each token sent separately. Handling this in Python:
import requests
import json
def generate_stream(prompt, model="llama3.1:8b"):
response = requests.post(
"http://localhost:11434/api/generate",
json={"model": model, "prompt": prompt, "stream": True},
stream=True
)
for line in response.iter_lines():
if line:
data = json.loads(line)
print(data.get("response", ""), end="", flush=True)
Chat interface: Starting from version 0.1.15, Ollama recommends the chat endpoint over generate:
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1:8b",
"messages": [
{"role": "system", "content": "You are a code review assistant"},
{"role": "user", "content": "What is wrong with this code? def foo(): return 1/0"}
]
}'
This interface is better suited for multi-turn conversations, supporting system, user, and assistant roles.
Python SDK in Practice
The official Python SDK is more convenient than calling the REST API directly:
pip install ollama
Basic usage:
import ollama
response = ollama.chat(
model='llama3.1:8b',
messages=[
{'role': 'user', 'content': 'What is a context window?'}
]
)
print(response['message']['content'])
Streaming output:
import ollama
stream = ollama.chat(
model='llama3.1:8b',
messages=[{'role': 'user', 'content': 'Explain Kubernetes'}],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
**A streaming gotcha:** Many tutorials use for chunk in stream directly, but sometimes stream returns a list of chunks instead of an iterator. A safer approach:
stream = ollama.chat(model='llama3.1:8b', messages=[...], stream=True)
# Ensure it's a generator
if hasattr(stream, '__iter__'):
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
else:
print(stream['message']['content'])
Model Management Commands
Ollama stores models in ~/.ollama/models and manages them via CLI:
# List installed models
ollama list
# Pull a new model
ollama pull qwen2.5:14b
# Remove a model
ollama rm llama3.1:8b
# Copy a model (for creating custom variants)
ollama cp llama3.1:8b my-custom-llama
Custom config with Modelfile: To customize model behavior (e.g., system prompts), create a Modelfile:
FROM llama3.1:8b
PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM """
You are a technical writer specialized in API documentation review.
Explain concepts clearly without jargon.
"""
Register it with ollama create:
ollama create my-api-reviewer -f Modelfile
You can then call it using my-api-reviewer as the model name.
Building a Local AI Agent with Ollama
The most useful scenario I built: using Ollama as an Agent's reasoning engine with tool calling for automation.
Simplified Agent architecture:
import ollama
import json
class SimpleAgent:
def __init__(self, model="llama3.1:8b"):
self.model = model
self.tools = {
"calculator": self.calc,
"search": self.search
}
def calc(self, expr):
try:
return eval(expr)
except:
return "Calculation error"
def search(self, query):
# Simplified — should call actual search API
return f"Information about {query}..."
def run(self, task):
# Step 1: Let model decide if tools are needed
response = ollama.chat(
model=self.model,
messages=[{
"role": "user",
"content": f"Task: {task}\nAvailable tools: {list(self.tools.keys())}\nIf tools are needed, return JSON: {{\"tool\": \"toolname\", \"args\": {{\"param\": \"value\"}}}}, otherwise reply \"No tool: your answer\""
}]
)
result = response['message']['content']
# Try to parse tool call
if result.startswith("{") and "tool" in result:
try:
call = json.loads(result)
tool_name = call.get("tool")
args = call.get("args", {})
if tool_name in self.tools:
tool_result = self.tools[tool_name](**args)
# Step 2: Generate final answer with tool result
final = ollama.chat(
model=self.model,
messages=[
{"role": "user", "content": f"Task: {task}"},
{"role": "system", "content": f"Tool result: {tool_result}"}
]
)
return final['message']['content']
except:
pass
return result
This simplified Agent has limits: the model doesn't natively support tool calls — it relies on prompt engineering, which is unreliable. For stable tool calling, use models with function calling support (like Qwen2.5 + Toolformer) or MCP protocol.
MCP Protocol Integration
MCP (Model Context Protocol) is Anthropic's tool-calling standard. Ollama has supported it since version 0.1.42.
MCP Server config example (filesystem):
{
"mcpServers": {
"filesystem": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
}
}
}
Passing tools in API calls:
import ollama
response = ollama.chat(
model="llama3.1:8b",
messages=[{"role": "user", "content": "List all files in /tmp"}],
tools=[
{
"type": "function",
"function": {
"name": "filesystem_list",
"description": "List files in a directory",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string"}
}
}
}
}
]
)
# Check if tool call was triggered
if response.get('message', {}).get('tool_calls'):
for tool_call in response['message']['tool_calls']:
print(f"Calling tool: {tool_call['function']['name']}")
print(f"Args: {tool_call['function']['arguments']}")
MCP gotcha: Ollama's MCP support isn't fully polished yet — some tool calls fail or return format errors. For production, use Claude CLI with Browserbase Skills, which has more mature MCP support.
Context Management and Memory
Ollama defaults to 4096 tokens context window (8B model). Adjust with num_ctx:
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1:8b",
"options": {"num_ctx": 8192},
"messages": [...]
}'
Strategies for long conversation context:
1. Simple truncation: Drop early messages when length exceeds a threshold. Simplest approach but risks losing important context.
2. Summary compression: Periodically compress prior conversation into a summary using the model. Keep only the summary going forward. Pseudo-code:
def summarize_if_needed(messages, max_turns=10):
if len(messages) > max_turns * 2:
old_messages = messages[:-max_turns]
recent = messages[-max_turns:]
summary_prompt = f"Summarize this conversation, keeping key info: {old_messages}"
summary = ollama.chat(model=MODEL, messages=[{"role": "user", "content": summary_prompt}])
return [{"role": "system", "content": f"Prior conversation summary: {summary['message']['content']}"}] + recent
return messages
3. Vector database storage: Store key points from each turn in a vector DB and retrieve relevant history during lookup. Most complex but most effective for long-term memory.
Performance Optimization in Practice
GPU offloading config: If VRAM is tight, limit GPU offloading:
# .ollama/config.json
{
"gpu": "ampere",
"num_gpu": 1,
"main_gpu": 0
}
Or set it in Modelfile:
PARAMETER num_gpu 1
Concurrent request handling: Ollama handles limited concurrency per instance. For high concurrency:
1. Increase num_parallel config
2. Add load balancing in front (Nginx 性能调优)
3. Run multiple Ollama instances (different ports)
Testing on a VPS (4 cores, 8GB RAM): Ollama starts lagging with more than 3 concurrent requests. For production-level concurrency, consider vLLM or TGI instead.
Model selection guide:
| Use Case | Recommended Model | Min Config |
|---|---|---|
| Code generation | codellama:34b | 24GB RAM |
| Chinese dialogue | qwen2.5:14b | 16GB RAM |
| Quick testing | llama3.2:3b | 4GB RAM |
| Balanced choice | llama3.1:8b | 8GB RAM |
Troubleshooting Common Issues
Issue 1: Slow model downloads
Ollama uses ollama pull to download models from GitHub releases by default. If your network is unstable:
# Use proxy
export HTTPS_PROXY=http://127.0.0.1:7890
ollama pull llama3.1:8b
Issue 2: Out of memory (OOM)
Reduce num_gpu or switch to a smaller model:
ollama run llama3.2:3b # Much smaller than 8B
Issue 3: API timeout
Ollama has no default timeout, but some clients auto-set one. For manual control:
import signal
def timeout_handler(signum, frame):
raise TimeoutError("API call timed out")
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(30) # 30 second timeout
try:
response = ollama.chat(model=MODEL, messages=[...])
finally:
signal.alarm(0)
Issue 4: Chinese encoding problems
Some models output garbled Chinese. Ensure your terminal uses UTF-8:
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
Summary
Ollama is one of the most convenient ways to run LLMs locally. The REST API and Python SDK are solid, and it's ideal for:
- Quick experiments during development and testing
- Use cases with strict data privacy requirements
- Cost-sensitive long-running tasks
Main limitations are model quality and concurrent performance. For complex tool-calling scenarios, wait for Ollama's MCP support to mature before using it in production.
👉 Try MiniMax API now: https://platform.minimaxi.com/subscribe/token-plan?code=E5yur9NOub&source=link
Related tools:
🔗 Related Tech Articles
Deep dive into related technical topics: