Ollama 本地大模型 Advanced Usage: Modelfile Customization and RAG Vector Search
I've been using the Ollama API for 6 months now. What started as running ollama run to test models has evolved into using Modelfiles to customize model behavior and implementing RAG retrieval with embedding vectors. Along the way, I hit plenty of pitfalls. This article covers two core use cases in depth: **Modelfile Custom Configuration** and **Ollama + RAG Vector Search**.
Why You Need Modelfile
Running ollama run qwen2.5:7b out of the box works fine, but there are three real problems:
- **No persistent persona**: You can't have the model remember "you are a Python code review assistant"
- **Inconsistent parameters**: Every API call requires passing temperature, num_ctx — changing defaults means editing code everywhere
- **No reusable prompt templates**: Copy-pasting SYSTEM text across different models is messy
Modelfile solves all three. A single config file defines: which base model, which parameters, what persona, and how to structure the prompt template.
Part 1: Complete Modelfile Configuration Guide
1.1 Basic Modelfile
Create a "Technical Writer" persona:
# Create Modelfile
cat > Modelfile << 'EOF'
FROM qwen2.5:7b
# Parameter configuration
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_k 30
PARAMETER top_p 0.9
# System persona
SYSTEM """
You are a senior technical writer who explains complex concepts in clear, concise language.
Rules:
1. Keep paragraphs to 3 sentences maximum
2. Prefer lists over long paragraphs
3. All code examples must be runnable
4. When unsure about technical details, note "check official documentation"
"""
EOF
# Create custom model from Modelfile
ollama create tech-writer -f Modelfile
# Verify creation
ollama list
Key parameters (verified against docs.ollama.com/modelfile.md, 2026-05):
| Parameter | Default | Description | Recommended |
|---|---|---|---|
| num_ctx | 2048 | Context window size | 4096-8192 |
| temperature | 0.8 | Creativity (0=deterministic, 1=creative) | 0.5-0.9 |
| top_k | 40 | Candidate word count, lower=more conservative | 20-40 |
| top_p | 0.9 | Nucleus sampling, lower=more focused | 0.8-0.95 |
1.2 Advanced: Modelfile with Few-Shot Examples
Want structured output? Use the MESSAGE instruction:
cat > Modelfile-code-review << 'EOF'
FROM qwen2.5:7b
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.1
SYSTEM """
You are a strict Python code review assistant.
"""
# Few-shot examples teach the model your expected output format
MESSAGE """
User: def foo(x): return x*2
Assistant: ### Code Review Report
**Issue**: Function name too simple
**Suggestion**: Consider more descriptive name like `calculate_double`
**Severity**: Low
**Runnable**: ✅ Runnable
"""
MESSAGE """
User: import os; print(os.system('ls'))
Assistant: ### Code Review Report
**Issue**: os.system() poses security risks
**Suggestion**: Use subprocess module instead
**Severity**: High
**Runnable**: ✅ Runnable
"""
MESSAGE """
User: """ + user_input + """
Assistant: """
EOF
ollama create code-reviewer -f Modelfile-code-review
1.3 Pitfall Record: FROM Path Issues
Pitfall 1: Model names must be exact
# Wrong: version mismatch causes errors
FROM qwen2.5 # ❌ No tag defaults to 'latest', version may differ
# Correct: specify exact version
FROM qwen2.5:7b-instruct-q4_K_M # ✅ Exact version and quantization level
Pitfall 2: num_ctx silently truncates when exceeding model limit
My test data:
- llama3.2:1b max num_ctx = 8192
- qwen2.5:7b max num_ctx = 32768
- gemma3:4b max num_ctx = 8192
No error when exceeding the limit — just degraded quality. Run ollama show first to check actual limits.
Part 2: RAG Vector Search Practical Guide
2.1 Ollama Embedding Model Setup
RAG step 1: convert documents to vectors. I use the official embedding model nomic-embed-text:
# Install embedding model
ollama pull nomic-embed-text
# Verify installation
ollama list
# Should see nomic-embed-text
2.2 Complete Python RAG Implementation
Dependencies:
pip install ollama chromadb pypdf
Complete code:
import ollama
import chromadb
from chromadb.config import Settings
from pypdf import PdfReader
# Initialize vector database
client = chromadb.Client(Settings(
anonymized_telemetry=False,
allow_reset=True
))
collection = client.create_collection("tech-docs")
# Text chunking function
def chunk_text(text, chunk_size=500, overlap=50):
"""Split long text into overlapping chunks"""
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunk = text[i:i + chunk_size]
if chunk:
chunks.append(chunk)
return chunks
# Extract text from PDF
def extract_pdf_text(pdf_path):
reader = PdfReader(pdf_path)
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
return text
# Index a document
def index_document(pdf_path, doc_id_prefix):
text = extract_pdf_text(pdf_path)
chunks = chunk_text(text)
for i, chunk in enumerate(chunks):
# Generate embedding vector
response = ollama.embeddings(model='nomic-embed-text', prompt=chunk)
embedding = response['embedding']
# Store in vector database
collection.add(
ids=[f"{doc_id_prefix}-{i}"],
embeddings=[embedding],
documents=[chunk]
)
print(f"Indexed: {len(chunks)} text chunks")
# Query function
def search_similar(query, top_k=3):
# Vectorize the query
response = ollama.embeddings(model='nomic-embed-text', prompt=query)
query_embedding = response['embedding']
# Similarity search
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
return results
# Usage example
if __name__ == "__main__":
# Index a technical document (assuming PDF exists)
# index_document("technical-guide.pdf", "guide-001")
# Query relevant passages
results = search_similar("What Modelfile PARAMETER options does Ollama support")
print("Top 3 relevant passages:")
for i, doc in enumerate(results['documents'][0]):
print(f"\n--- Result {i+1} ---")
print(doc[:200] + "..." if len(doc) > 200 else doc)
2.3 RAG Pitfall Records
Pitfall 3: Embedding model version mismatch
In Ollama 0.1.42+, the embedding API format changed. New version:
# New API (0.1.42+)
response = ollama.embeddings(model='nomic-embed-text', prompt=chunk)
Old versions had different formats. If you get "embedding type not found", check the version first:
ollama show nomic-embed-text | head -5
Pitfall 4: ChromaDB persistence issue
Default in-memory mode loses data on restart. For production:
# Persist vector database
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection("tech-docs")
Pitfall 5: Poor Chinese tokenization
English embedding models handle Chinese poorly. Solutions:
# Option 1: Use multilingual embedding model
ollama pull mxbai-embed-large # Better multilingual support
# Option 2: Preprocess Chinese text
import re
def preprocess_chinese(text):
# Remove excess whitespace, preserve basic punctuation
text = re.sub(r'\s+', ' ', text)
return text.strip()
Part 3: Modelfile Advanced Techniques
3.1 Template Variables: Dynamic Prompt Generation
Modelfile's TEMPLATE instruction supports variable substitution:
cat > Modelfile-interview << 'EOF'
FROM qwen2.5:7b
PARAMETER temperature 0.8
PARAMETER num_ctx 4096
TEMPLATE """
<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
SYSTEM """
You are a mock interviewer. The candidate's position is: {{ .Prompt }}
Generate 5 technical questions based on the position.
"""
EOF
ollama create interview-assistant -f Modelfile-interview
3.2 ADAPTER: Fine-Tuned Adapter
If you've done QLoRA fine-tuning, load it with ADAPTER:
cat > Modelfile-finetuned << 'EOF'
FROM qwen2.5:7b
# Load fine-tuned adapter weights
ADAPTER ./qwen-finetuned/adapter
PARAMETER num_ctx 4096
PARAMETER temperature 0.5
EOF
ollama create my-finetuned-model -f Modelfile-finetuned
Note: ADAPTER requires matching model architecture and Ollama >= 0.1.40.
Part 4: Real-World Application Scenarios
4.1 Build a Local Code Assistant
- Include runnable code examples in every answer
- For complex problems, provide multiple approaches with tradeoffs
- Clearly state when you are unsure
cat > Modelfile-codeassist << 'EOF'
FROM qwen2.5:7b
PARAMETER temperature 0.3
PARAMETER num_ctx 16384
PARAMETER repeat_penalty 1.1
PARAMETER num_predict 2048
SYSTEM """
You are a professional Python coding assistant.
"""
EOF
ollama create code-assist -f Modelfile-codeassist
# Usage
ollama run code-assist "Explain what this code does: def f(x): return [i**2 for i in x if i%2]"
4.2 Combine RAG with Modelfile
Inject RAG-retrieved context into the Modelfile's system prompt:
def rag_augmented_query(user_query, collection):
"""RAG-enhanced query with context injection"""
# 1. Retrieve relevant documents
results = search_similar(user_query, top_k=3)
context = "\n\n".join(results['documents'][0])
# 2. Build enhanced prompt
enhanced_prompt = f"""Answer based on the following references. If information is insufficient, say so.
References:
{context}
User question: {user_query}
"""
# 3. Call Ollama API
response = ollama.generate(
model='tech-writer', # Modelfile model created earlier
prompt=enhanced_prompt,
options={
'temperature': 0.7,
'num_ctx': 8192
}
)
return response['response']
# Usage
answer = rag_augmented_query("What PARAMETER options does Modelfile support", collection)
print(answer)
Summary
This article covered two core Ollama capabilities:
1. Modelfile: Define model parameters and persona persistently via FROM/PARAMETER/SYSTEM/TEMPLATE/MESSAGE instructions
2. RAG Retrieval: Generate vectors with nomic-embed-text, store and search with ChromaDB, inject context for precise Q&A
3. Pitfalls: num_ctx limits, version compatibility, Chinese tokenization, vector database persistence
Further exploration:
- Pair Ollama with OpenWebUI for a friendlier web interface
- Use Lobe Chat to connect with Ollama for multimodal conversations
- Automate RAG document updates with n8n workflows
👉 Get started: MiniMax API Platform, stable access from China, suitable for production use.
---
🔗 Related Tech Articles
Deep dive into related technical topics: