Table of Contents#
- Introduction
- Prerequisites
- Quick Decision Guide
- Fine-Tuning Deep Dive
- RAG Deep Dive
- Prompt Engineering Deep Dive
- Detailed Comparison
- Hybrid Approaches
- Cost Analysis
- Performance Benchmarks
- Decision Framework
- Common Pitfalls
- FAQ
- Conclusion
Prerequisites#
- Basic understanding of LLMs and their capabilities
- Python 3.8+ and Node.js 18+ installed
- Familiarity with API usage and basic ML concepts
- Access to LLM APIs (OpenAI, Anthropic, or similar)
- Basic knowledge of vector databases (for RAG)
Introduction#
In the evolving landscape of AI development, choosing the right optimization strategy for large language models (LLMs) is crucial. With multiple approaches available, understanding when to use fine-tuning, retrieval-augmented generation (RAG), or prompt engineering can significantly impact your system’s performance, cost, and time to market.
The Three Approaches at a Glance#
- Fine-Tuning: Retraining the model on your specific data
- RAG: Augmenting prompts with retrieved context
- Prompt Engineering: Crafting optimal prompts without model changes
graph TD A[Optimization Strategies] -->|Fine-Tuning| B[Specialized Applications] A -->|RAG| C[Data-Heavy Tasks] A -->|Prompt Engineering| D[General Use Cases] style A fill:#f9f,stroke:#333,stroke-width:2px style B fill:#bbf,stroke:#333,stroke-width:2px style C fill:#9f9 style D fill:#ff9
Quick Decision Guide#
interface DecisionCriteria {
budget: 'low' | 'medium' | 'high';
dataVolume: 'small' | 'medium' | 'large';
updateFrequency: 'static' | 'occasional' | 'frequent';
latencyRequirement: 'realtime' | 'nearRealtime' | 'batch';
accuracy: 'acceptable' | 'high' | 'critical';
}
function recommendApproach(criteria: DecisionCriteria): string {
// High accuracy + specific domain = Fine-tuning
if (criteria.accuracy === 'critical' && criteria.updateFrequency === 'static') {
return 'fine-tuning';
}
// Large dynamic data + updates = RAG
if (criteria.dataVolume === 'large' && criteria.updateFrequency === 'frequent') {
return 'rag';
}
// Low budget + flexibility = Prompt Engineering
if (criteria.budget === 'low' && criteria.accuracy === 'acceptable') {
return 'prompt-engineering';
}
// Default to hybrid approach
return 'hybrid-rag-prompt';
}
Quick Decision Matrix#
Scenario | Recommended Approach | Why |
---|---|---|
Customer support chatbot | Prompt Engineering + RAG | Dynamic responses with knowledge base |
Medical diagnosis assistant | Fine-tuning | High accuracy critical, domain-specific |
Code generation tool | Fine-tuning + Prompt Engineering | Specialized syntax with flexible outputs |
Document Q&A system | RAG | Large document corpus, real-time updates |
Creative writing assistant | Prompt Engineering | Flexibility and creativity paramount |
Fine-Tuning Deep Dive#
What is Fine-Tuning?#
Fine-tuning involves updating a pre-trained model’s weights using your specific dataset. This creates a specialized version of the model that excels at your particular use case while retaining general language understanding.
When to Use Fine-Tuning#
✅ Perfect for:
- Domain-specific language (legal, medical, technical)
- Consistent output format requirements
- Proprietary knowledge integration
- Style/tone consistency (brand voice)
- High-accuracy classification tasks
❌ Avoid when:
- Data changes frequently
- Budget is limited
- Need real-time information
- Small dataset (<1000 examples)
- Quick iterations needed
Production Implementation with Modal#
# fine_tune_modal.py
import modal
from modal import Image, Secret, gpu
import json
from typing import Dict, List
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling
)
from datasets import Dataset
import wandb
# Define Modal app
app = modal.App("llm-fine-tuning")
# Custom image with dependencies
training_image = (
Image.debian_slim()
.pip_install(
"transformers==4.36.0",
"torch==2.1.0",
"datasets==2.14.0",
"accelerate==0.24.0",
"bitsandbytes==0.41.0",
"wandb==0.15.0",
"peft==0.6.0" # For LoRA fine-tuning
)
)
@app.function(
image=training_image,
gpu=gpu.A100(memory=40), # Use A100 for faster training
secrets=[Secret.from_name("huggingface"), Secret.from_name("wandb")],
timeout=3600, # 1 hour timeout
)
def fine_tune_model(
model_name: str = "meta-llama/Llama-2-7b-hf",
dataset_path: str = "./training_data.jsonl",
output_dir: str = "./fine_tuned_model",
use_lora: bool = True
):
# Initialize wandb for experiment tracking
wandb.init(project="llm-fine-tuning", name=f"fine_tune_{model_name.split('/')[-1]}")
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
if use_lora:
from peft import LoraConfig, get_peft_model, TaskType
# Load model in 8-bit for memory efficiency
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True,
device_map="auto",
torch_dtype=torch.float16
)
# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
else:
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Load and preprocess dataset
with open(dataset_path, 'r') as f:
data = [json.loads(line) for line in f]
def preprocess_function(examples):
# Format: "### Human: {prompt}\n### Assistant: {response}"
texts = []
for prompt, response in zip(examples['prompt'], examples['response']):
text = f"### Human: {prompt}\n### Assistant: {response}"
texts.append(text)
model_inputs = tokenizer(
texts,
max_length=512,
truncation=True,
padding=True
)
model_inputs["labels"] = model_inputs["input_ids"].copy()
return model_inputs
dataset = Dataset.from_dict({
'prompt': [d['prompt'] for d in data],
'response': [d['response'] for d in data]
})
tokenized_dataset = dataset.map(
preprocess_function,
batched=True,
remove_columns=dataset.column_names
)
# Training arguments
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=100,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="no",
report_to="wandb",
push_to_hub=True,
hub_model_id=f"{model_name.split('/')[-1]}-fine-tuned"
)
# Create trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
tokenizer=tokenizer,
data_collator=DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False
)
)
# Train
trainer.train()
# Save model
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
# Log metrics
wandb.log({
"final_loss": trainer.state.log_history[-1]['loss'],
"total_steps": trainer.state.global_step
})
return {"status": "success", "model_path": output_dir}
# Deploy fine-tuning job
@app.local_entrypoint()
def main():
# Prepare your training data
training_data = [
{"prompt": "What is machine learning?", "response": "Machine learning is..."},
# Add more training examples
]
with open("training_data.jsonl", "w") as f:
for item in training_data:
f.write(json.dumps(item) + "\n")
# Run fine-tuning on Modal
result = fine_tune_model.remote(
model_name="microsoft/phi-2",
dataset_path="training_data.jsonl",
use_lora=True
)
print(f"Fine-tuning completed: {result}")
Fine-Tuning Best Practices#
- Data Quality: Ensure high-quality, diverse training data
- Validation Split: Always maintain a validation set
- Learning Rate: Start with 2e-5 and adjust based on loss
- Batch Size: Balance between GPU memory and training stability
- Early Stopping: Monitor validation loss to prevent overfitting
RAG Deep Dive#
What is RAG?#
Retrieval-Augmented Generation (RAG) enhances LLM responses by dynamically retrieving relevant information from a knowledge base. It combines the power of semantic search with generative AI, allowing models to access information beyond their training data.
When to Use RAG#
✅ Perfect for:
- Knowledge bases and documentation
- Real-time data requirements
- Large document collections
- Multi-source information synthesis
- Reducing hallucinations
❌ Avoid when:
- Sub-millisecond latency required
- No external data sources
- Simple, static responses needed
- Offline environments
Production RAG Implementation with Supabase#
// rag-system.ts
import { createClient } from '@supabase/supabase-js';
import { OpenAI } from 'openai';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
interface Document {
id: string;
content: string;
metadata: Record<string, any>;
embedding?: number[];
}
class RAGSystem {
private supabase;
private openai;
private splitter;
constructor() {
this.supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_ANON_KEY!
);
this.openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
this.splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
}
// Index documents into vector store
async indexDocuments(documents: Document[]): Promise<void> {
const chunks: any[] = [];
for (const doc of documents) {
// Split document into chunks
const textChunks = await this.splitter.splitText(doc.content);
// Generate embeddings for each chunk
for (const chunk of textChunks) {
const embedding = await this.generateEmbedding(chunk);
chunks.push({
content: chunk,
metadata: {
...doc.metadata,
source_id: doc.id,
chunk_index: chunks.length
},
embedding
});
}
}
// Batch insert into Supabase
const { error } = await this.supabase
.from('document_chunks')
.insert(chunks);
if (error) throw error;
}
// Generate embedding for text
private async generateEmbedding(text: string): Promise<number[]> {
const response = await this.openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
});
return response.data[0].embedding;
}
// Retrieve relevant documents
async retrieve(query: string, topK: number = 5): Promise<any[]> {
// Generate query embedding
const queryEmbedding = await this.generateEmbedding(query);
// Perform similarity search
const { data, error } = await this.supabase.rpc('match_documents', {
query_embedding: queryEmbedding,
match_threshold: 0.7,
match_count: topK
});
if (error) throw error;
return data;
}
// Generate response with context
async generateWithContext(
query: string,
systemPrompt?: string
): Promise<{
response: string;
sources: any[];
usage: any;
}> {
// Retrieve relevant documents
const relevantDocs = await this.retrieve(query);
// Build context from retrieved documents
const context = relevantDocs
.map(doc => `[Source: ${doc.metadata.source_id}]\n${doc.content}`)
.join('\n\n---\n\n');
// Create enhanced prompt
const messages = [
{
role: 'system' as const,
content: systemPrompt || `You are a helpful assistant. Use the following context to answer questions. Always cite your sources.`
},
{
role: 'user' as const,
content: `Context:\n${context}\n\nQuestion: ${query}\n\nProvide a comprehensive answer based on the context above.`
}
];
// Generate response
const completion = await this.openai.chat.completions.create({
model: 'gpt-4-turbo-preview',
messages,
temperature: 0.7,
max_tokens: 1000
});
return {
response: completion.choices[0].message.content || '',
sources: relevantDocs.map(doc => ({
id: doc.metadata.source_id,
snippet: doc.content.substring(0, 200) + '...'
})),
usage: completion.usage
};
}
}
// Usage example
export async function handleRAGQuery(query: string) {
const rag = new RAGSystem();
try {
const result = await rag.generateWithContext(query);
return {
answer: result.response,
sources: result.sources,
tokens: result.usage.total_tokens,
cost: calculateCost(result.usage)
};
} catch (error) {
console.error('RAG error:', error);
throw error;
}
}
// Supabase SQL function for similarity search
/*
CREATE OR REPLACE FUNCTION match_documents(
query_embedding vector(1536),
match_threshold float,
match_count int
)
RETURNS TABLE (
id uuid,
content text,
metadata jsonb,
similarity float
)
LANGUAGE plpgsql
AS $$
BEGIN
RETURN QUERY
SELECT
document_chunks.id,
document_chunks.content,
document_chunks.metadata,
1 - (document_chunks.embedding <=> query_embedding) as similarity
FROM document_chunks
WHERE 1 - (document_chunks.embedding <=> query_embedding) > match_threshold
ORDER BY document_chunks.embedding <=> query_embedding
LIMIT match_count;
END;
$$;
*/
Advanced RAG Techniques#
1. Hybrid Search (Keyword + Semantic)#
class HybridRAG extends RAGSystem {
async hybridSearch(
query: string,
topK: number = 5
): Promise<any[]> {
// Semantic search
const semanticResults = await this.retrieve(query, topK * 2);
// Keyword search using Supabase full-text search
const { data: keywordResults } = await this.supabase
.from('document_chunks')
.select('*')
.textSearch('content', query)
.limit(topK * 2);
// Combine and re-rank results
return this.rerank([...semanticResults, ...keywordResults], query, topK);
}
private async rerank(results: any[], query: string, topK: number): Promise<any[]> {
// Use a cross-encoder for re-ranking
// Implementation depends on your re-ranking model
return results.slice(0, topK);
}
}
2. Multi-Query RAG#
async function multiQueryRAG(originalQuery: string): Promise<any[]> {
// Generate multiple query variations
const queryVariations = await generateQueryVariations(originalQuery);
// Retrieve for each variation
const allResults = await Promise.all(
queryVariations.map(q => rag.retrieve(q, 3))
);
// Deduplicate and combine results
const uniqueResults = deduplicateResults(allResults.flat());
return uniqueResults;
}
Prompt Engineering#
What is Prompt Engineering?#
Prompt engineering involves structuring and customizing inputs to elicit specific responses from a pre-trained model without modifying the model’s weights.
When to Use Prompt Engineering#
- Low Cost: Achieve results without incurring the high costs of training.
- Rapid Iterations: Quickly test and deploy changes.
- Flexibility: Adapt to a variety of tasks and contexts on the fly.
Considerations#
- Accuracy: Results may vary without fine-tuning.
- Limitations: Applicable within the constraints of the current model.
- Prompt Sensitivity: Subtle changes can dramatically affect output.
Code Example#
def generate_summary(text):
prompt = f"Summarize the following content: {text}"
response = requests.post(api_url, json={"prompt": prompt})
return response.json().get('summary')
summary = generate_summary("Artificial Intelligence is transforming industries.")
print(summary)
Comparison Matrix#
Feature | Fine-Tuning | RAG | Prompt Engineering |
---|---|---|---|
Cost | High | Medium to High | Low |
Customization | High | Medium | Low to Medium |
Real-Time Data | No | Yes | No |
Setup Complexity | High | Medium | Low |
Performance | High | High | Variable |
Decision Framework#
Consider These Factors#
- Budget Constraints: Smaller budgets benefit from prompt engineering.
- Data Availability: RAG excels with rich, diverse data sources.
- Task Complexity: Choose fine-tuning for highly specialized tasks.
Conclusion#
Determining the right strategy depends on your specific requirements. For most applications:
- Use Fine-Tuning when full model customization is imperative.
- Implement RAG for dynamic information retrieval needs.
- Apply Prompt Engineering when low cost and speed are priorities.
By carefully considering your use case, data availability, and budget, you can select the most suitable approach for your AI project.