Ollama 本地大模型 Advanced Usage: Modelfile Customization and RAG Vector Search

OllamaModelfileRAGVector SearchLocal AIEmbedding

I've been using the Ollama API for 6 months now. What started as running ollama run to test models has evolved into using Modelfiles to customize model behavior and implementing RAG retrieval with embedding vectors. Along the way, I hit plenty of pitfalls. This article covers two core use cases in depth: **Modelfile Custom Configuration** and **Ollama + RAG Vector Search**.

Why You Need Modelfile

Running ollama run qwen2.5:7b out of the box works fine, but there are three real problems:

**No persistent persona**: You can't have the model remember "you are a Python code review assistant"
**Inconsistent parameters**: Every API call requires passing temperature, num_ctx — changing defaults means editing code everywhere
**No reusable prompt templates**: Copy-pasting SYSTEM text across different models is messy

Modelfile solves all three. A single config file defines: which base model, which parameters, what persona, and how to structure the prompt template.

Part 1: Complete Modelfile Configuration Guide

1.1 Basic Modelfile

Create a "Technical Writer" persona:

# Create Modelfile
cat > Modelfile << 'EOF'
FROM qwen2.5:7b

# Parameter configuration
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_k 30
PARAMETER top_p 0.9

# System persona
SYSTEM """
You are a senior technical writer who explains complex concepts in clear, concise language.
Rules:
1. Keep paragraphs to 3 sentences maximum
2. Prefer lists over long paragraphs
3. All code examples must be runnable
4. When unsure about technical details, note "check official documentation"
"""
EOF

# Create custom model from Modelfile
ollama create tech-writer -f Modelfile

# Verify creation
ollama list

Key parameters (verified against docs.ollama.com/modelfile.md, 2026-05):

Parameter	Default	Description	Recommended
num_ctx	2048	Context window size	4096-8192
temperature	0.8	Creativity (0=deterministic, 1=creative)	0.5-0.9
top_k	40	Candidate word count, lower=more conservative	20-40
top_p	0.9	Nucleus sampling, lower=more focused	0.8-0.95

1.2 Advanced: Modelfile with Few-Shot Examples

Want structured output? Use the MESSAGE instruction:

cat > Modelfile-code-review << 'EOF'
FROM qwen2.5:7b

PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.1

SYSTEM """
You are a strict Python code review assistant.
"""

# Few-shot examples teach the model your expected output format
MESSAGE """
User: def foo(x): return x*2
Assistant: ### Code Review Report
**Issue**: Function name too simple
**Suggestion**: Consider more descriptive name like `calculate_double`
**Severity**: Low
**Runnable**: ✅ Runnable
"""
MESSAGE """
User: import os; print(os.system('ls'))
Assistant: ### Code Review Report
**Issue**: os.system() poses security risks
**Suggestion**: Use subprocess module instead
**Severity**: High
**Runnable**: ✅ Runnable
"""
MESSAGE """
User: """ + user_input + """
Assistant: """
EOF

ollama create code-reviewer -f Modelfile-code-review

1.3 Pitfall Record: FROM Path Issues

Pitfall 1: Model names must be exact

# Wrong: version mismatch causes errors
FROM qwen2.5   # ❌ No tag defaults to 'latest', version may differ

# Correct: specify exact version
FROM qwen2.5:7b-instruct-q4_K_M  # ✅ Exact version and quantization level

Pitfall 2: num_ctx silently truncates when exceeding model limit

My test data:

llama3.2:1b max num_ctx = 8192
qwen2.5:7b max num_ctx = 32768
gemma3:4b max num_ctx = 8192

No error when exceeding the limit — just degraded quality. Run ollama show first to check actual limits.

Part 2: RAG Vector Search Practical Guide

2.1 Ollama Embedding Model Setup

RAG step 1: convert documents to vectors. I use the official embedding model nomic-embed-text:

# Install embedding model
ollama pull nomic-embed-text

# Verify installation
ollama list
# Should see nomic-embed-text

2.2 Complete Python RAG Implementation

Dependencies:

pip install ollama chromadb pypdf

Complete code:

import ollama
import chromadb
from chromadb.config import Settings
from pypdf import PdfReader

# Initialize vector database
client = chromadb.Client(Settings(
    anonymized_telemetry=False,
    allow_reset=True
))
collection = client.create_collection("tech-docs")

# Text chunking function
def chunk_text(text, chunk_size=500, overlap=50):
    """Split long text into overlapping chunks"""
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunk = text[i:i + chunk_size]
        if chunk:
            chunks.append(chunk)
    return chunks

# Extract text from PDF
def extract_pdf_text(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text

# Index a document
def index_document(pdf_path, doc_id_prefix):
    text = extract_pdf_text(pdf_path)
    chunks = chunk_text(text)

    for i, chunk in enumerate(chunks):
        # Generate embedding vector
        response = ollama.embeddings(model='nomic-embed-text', prompt=chunk)
        embedding = response['embedding']

        # Store in vector database
        collection.add(
            ids=[f"{doc_id_prefix}-{i}"],
            embeddings=[embedding],
            documents=[chunk]
        )

    print(f"Indexed: {len(chunks)} text chunks")

# Query function
def search_similar(query, top_k=3):
    # Vectorize the query
    response = ollama.embeddings(model='nomic-embed-text', prompt=query)
    query_embedding = response['embedding']

    # Similarity search
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )

    return results

# Usage example
if __name__ == "__main__":
    # Index a technical document (assuming PDF exists)
    # index_document("technical-guide.pdf", "guide-001")

    # Query relevant passages
    results = search_similar("What Modelfile PARAMETER options does Ollama support")
    print("Top 3 relevant passages:")
    for i, doc in enumerate(results['documents'][0]):
        print(f"\n--- Result {i+1} ---")
        print(doc[:200] + "..." if len(doc) > 200 else doc)

2.3 RAG Pitfall Records

Pitfall 3: Embedding model version mismatch

In Ollama 0.1.42+, the embedding API format changed. New version:

# New API (0.1.42+)
response = ollama.embeddings(model='nomic-embed-text', prompt=chunk)

Old versions had different formats. If you get "embedding type not found", check the version first:

ollama show nomic-embed-text | head -5

Pitfall 4: ChromaDB persistence issue

Default in-memory mode loses data on restart. For production:

# Persist vector database
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection("tech-docs")

Pitfall 5: Poor Chinese tokenization

English embedding models handle Chinese poorly. Solutions:

# Option 1: Use multilingual embedding model
ollama pull mxbai-embed-large  # Better multilingual support

# Option 2: Preprocess Chinese text
import re

def preprocess_chinese(text):
    # Remove excess whitespace, preserve basic punctuation
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

Part 3: Modelfile Advanced Techniques

3.1 Template Variables: Dynamic Prompt Generation

Modelfile's TEMPLATE instruction supports variable substitution:

cat > Modelfile-interview << 'EOF'
FROM qwen2.5:7b

PARAMETER temperature 0.8
PARAMETER num_ctx 4096

TEMPLATE """
<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
SYSTEM """
You are a mock interviewer. The candidate's position is: {{ .Prompt }}
Generate 5 technical questions based on the position.
"""
EOF

ollama create interview-assistant -f Modelfile-interview

3.2 ADAPTER: Fine-Tuned Adapter

If you've done QLoRA fine-tuning, load it with ADAPTER:

cat > Modelfile-finetuned << 'EOF'
FROM qwen2.5:7b

# Load fine-tuned adapter weights
ADAPTER ./qwen-finetuned/adapter

PARAMETER num_ctx 4096
PARAMETER temperature 0.5
EOF

ollama create my-finetuned-model -f Modelfile-finetuned

Note: ADAPTER requires matching model architecture and Ollama >= 0.1.40.

Part 4: Real-World Application Scenarios

4.1 Build a Local Code Assistant

Include runnable code examples in every answer
For complex problems, provide multiple approaches with tradeoffs
Clearly state when you are unsure

cat > Modelfile-codeassist << 'EOF'
FROM qwen2.5:7b

PARAMETER temperature 0.3
PARAMETER num_ctx 16384
PARAMETER repeat_penalty 1.1
PARAMETER num_predict 2048

SYSTEM """
You are a professional Python coding assistant.
"""
EOF

ollama create code-assist -f Modelfile-codeassist

# Usage
ollama run code-assist "Explain what this code does: def f(x): return [i**2 for i in x if i%2]"

4.2 Combine RAG with Modelfile

Inject RAG-retrieved context into the Modelfile's system prompt:

def rag_augmented_query(user_query, collection):
    """RAG-enhanced query with context injection"""
    # 1. Retrieve relevant documents
    results = search_similar(user_query, top_k=3)
    context = "\n\n".join(results['documents'][0])

    # 2. Build enhanced prompt
    enhanced_prompt = f"""Answer based on the following references. If information is insufficient, say so.

References:
{context}

User question: {user_query}
"""

    # 3. Call Ollama API
    response = ollama.generate(
        model='tech-writer',  # Modelfile model created earlier
        prompt=enhanced_prompt,
        options={
            'temperature': 0.7,
            'num_ctx': 8192
        }
    )

    return response['response']

# Usage
answer = rag_augmented_query("What PARAMETER options does Modelfile support", collection)
print(answer)

Summary

This article covered two core Ollama capabilities:

1. Modelfile: Define model parameters and persona persistently via FROM/PARAMETER/SYSTEM/TEMPLATE/MESSAGE instructions

2. RAG Retrieval: Generate vectors with nomic-embed-text, store and search with ChromaDB, inject context for precise Q&A

3. Pitfalls: num_ctx limits, version compatibility, Chinese tokenization, vector database persistence

Further exploration:

Pair Ollama with OpenWebUI for a friendlier web interface
Use Lobe Chat to connect with Ollama for multimodal conversations
Automate RAG document updates with n8n workflows

👉 Get started: MiniMax API Platform, stable access from China, suitable for production use.

---

🔗 Related Tech Articles

Deep dive into related technical topics:

Ollama Advanced Usage: Modelfile Customization and RAG Vector Search

技术标签: modelfile, rag

Ollama高级用法：Modelfile自定义配置与RAG向量检索实战

技术标签: modelfile, rag

Ollama高级用法：Modelfile自定义配置与RAG向量检索实战

技术标签: modelfile, rag

🤖 Local AI Inference Hardware

查看推荐 →