Skip to main content
  1. Posts/

Fine-tuning vs RAG vs Prompt Engineering 2025: Complete Decision Guide

·1799 words·9 mins
Author
Steven
Software developer focusing on system-level debugging, performance optimization, and technical problem-solving
Building Production AI Systems - This article is part of a series.
Part : This Article

Table of Contents
#

Prerequisites
#

  • Basic understanding of LLMs and their capabilities
  • Python 3.8+ and Node.js 18+ installed
  • Familiarity with API usage and basic ML concepts
  • Access to LLM APIs (OpenAI, Anthropic, or similar)
  • Basic knowledge of vector databases (for RAG)

Introduction
#

In the evolving landscape of AI development, choosing the right optimization strategy for large language models (LLMs) is crucial. With multiple approaches available, understanding when to use fine-tuning, retrieval-augmented generation (RAG), or prompt engineering can significantly impact your system’s performance, cost, and time to market.

The Three Approaches at a Glance
#

  1. Fine-Tuning: Retraining the model on your specific data
  2. RAG: Augmenting prompts with retrieved context
  3. Prompt Engineering: Crafting optimal prompts without model changes
graph TD
    A[Optimization Strategies] -->|Fine-Tuning| B[Specialized Applications]
    A -->|RAG| C[Data-Heavy Tasks]
    A -->|Prompt Engineering| D[General Use Cases]
    
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#bbf,stroke:#333,stroke-width:2px
    style C fill:#9f9
    style D fill:#ff9

Quick Decision Guide
#

interface DecisionCriteria {
  budget: 'low' | 'medium' | 'high';
  dataVolume: 'small' | 'medium' | 'large';
  updateFrequency: 'static' | 'occasional' | 'frequent';
  latencyRequirement: 'realtime' | 'nearRealtime' | 'batch';
  accuracy: 'acceptable' | 'high' | 'critical';
}

function recommendApproach(criteria: DecisionCriteria): string {
  // High accuracy + specific domain = Fine-tuning
  if (criteria.accuracy === 'critical' && criteria.updateFrequency === 'static') {
    return 'fine-tuning';
  }
  
  // Large dynamic data + updates = RAG
  if (criteria.dataVolume === 'large' && criteria.updateFrequency === 'frequent') {
    return 'rag';
  }
  
  // Low budget + flexibility = Prompt Engineering
  if (criteria.budget === 'low' && criteria.accuracy === 'acceptable') {
    return 'prompt-engineering';
  }
  
  // Default to hybrid approach
  return 'hybrid-rag-prompt';
}

Quick Decision Matrix
#

ScenarioRecommended ApproachWhy
Customer support chatbotPrompt Engineering + RAGDynamic responses with knowledge base
Medical diagnosis assistantFine-tuningHigh accuracy critical, domain-specific
Code generation toolFine-tuning + Prompt EngineeringSpecialized syntax with flexible outputs
Document Q&A systemRAGLarge document corpus, real-time updates
Creative writing assistantPrompt EngineeringFlexibility and creativity paramount

Fine-Tuning Deep Dive
#

What is Fine-Tuning?
#

Fine-tuning involves updating a pre-trained model’s weights using your specific dataset. This creates a specialized version of the model that excels at your particular use case while retaining general language understanding.

When to Use Fine-Tuning
#

Perfect for:

  • Domain-specific language (legal, medical, technical)
  • Consistent output format requirements
  • Proprietary knowledge integration
  • Style/tone consistency (brand voice)
  • High-accuracy classification tasks

Avoid when:

  • Data changes frequently
  • Budget is limited
  • Need real-time information
  • Small dataset (<1000 examples)
  • Quick iterations needed

Production Implementation with Modal
#

# fine_tune_modal.py
import modal
from modal import Image, Secret, gpu
import json
from typing import Dict, List
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import Dataset
import wandb

# Define Modal app
app = modal.App("llm-fine-tuning")

# Custom image with dependencies
training_image = (
    Image.debian_slim()
    .pip_install(
        "transformers==4.36.0",
        "torch==2.1.0",
        "datasets==2.14.0",
        "accelerate==0.24.0",
        "bitsandbytes==0.41.0",
        "wandb==0.15.0",
        "peft==0.6.0"  # For LoRA fine-tuning
    )
)

@app.function(
    image=training_image,
    gpu=gpu.A100(memory=40),  # Use A100 for faster training
    secrets=[Secret.from_name("huggingface"), Secret.from_name("wandb")],
    timeout=3600,  # 1 hour timeout
)
def fine_tune_model(
    model_name: str = "meta-llama/Llama-2-7b-hf",
    dataset_path: str = "./training_data.jsonl",
    output_dir: str = "./fine_tuned_model",
    use_lora: bool = True
):
    # Initialize wandb for experiment tracking
    wandb.init(project="llm-fine-tuning", name=f"fine_tune_{model_name.split('/')[-1]}")
    
    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    
    if use_lora:
        from peft import LoraConfig, get_peft_model, TaskType
        
        # Load model in 8-bit for memory efficiency
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            load_in_8bit=True,
            device_map="auto",
            torch_dtype=torch.float16
        )
        
        # Configure LoRA
        lora_config = LoraConfig(
            r=16,
            lora_alpha=32,
            target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
            lora_dropout=0.05,
            bias="none",
            task_type=TaskType.CAUSAL_LM
        )
        
        model = get_peft_model(model, lora_config)
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
    
    # Load and preprocess dataset
    with open(dataset_path, 'r') as f:
        data = [json.loads(line) for line in f]
    
    def preprocess_function(examples):
        # Format: "### Human: {prompt}\n### Assistant: {response}"
        texts = []
        for prompt, response in zip(examples['prompt'], examples['response']):
            text = f"### Human: {prompt}\n### Assistant: {response}"
            texts.append(text)
        
        model_inputs = tokenizer(
            texts,
            max_length=512,
            truncation=True,
            padding=True
        )
        model_inputs["labels"] = model_inputs["input_ids"].copy()
        return model_inputs
    
    dataset = Dataset.from_dict({
        'prompt': [d['prompt'] for d in data],
        'response': [d['response'] for d in data]
    })
    
    tokenized_dataset = dataset.map(
        preprocess_function,
        batched=True,
        remove_columns=dataset.column_names
    )
    
    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        save_strategy="epoch",
        evaluation_strategy="no",
        report_to="wandb",
        push_to_hub=True,
        hub_model_id=f"{model_name.split('/')[-1]}-fine-tuned"
    )
    
    # Create trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset,
        tokenizer=tokenizer,
        data_collator=DataCollatorForLanguageModeling(
            tokenizer=tokenizer,
            mlm=False
        )
    )
    
    # Train
    trainer.train()
    
    # Save model
    trainer.save_model(output_dir)
    tokenizer.save_pretrained(output_dir)
    
    # Log metrics
    wandb.log({
        "final_loss": trainer.state.log_history[-1]['loss'],
        "total_steps": trainer.state.global_step
    })
    
    return {"status": "success", "model_path": output_dir}

# Deploy fine-tuning job
@app.local_entrypoint()
def main():
    # Prepare your training data
    training_data = [
        {"prompt": "What is machine learning?", "response": "Machine learning is..."},
        # Add more training examples
    ]
    
    with open("training_data.jsonl", "w") as f:
        for item in training_data:
            f.write(json.dumps(item) + "\n")
    
    # Run fine-tuning on Modal
    result = fine_tune_model.remote(
        model_name="microsoft/phi-2",
        dataset_path="training_data.jsonl",
        use_lora=True
    )
    
    print(f"Fine-tuning completed: {result}")

Fine-Tuning Best Practices
#

  1. Data Quality: Ensure high-quality, diverse training data
  2. Validation Split: Always maintain a validation set
  3. Learning Rate: Start with 2e-5 and adjust based on loss
  4. Batch Size: Balance between GPU memory and training stability
  5. Early Stopping: Monitor validation loss to prevent overfitting

RAG Deep Dive
#

What is RAG?
#

Retrieval-Augmented Generation (RAG) enhances LLM responses by dynamically retrieving relevant information from a knowledge base. It combines the power of semantic search with generative AI, allowing models to access information beyond their training data.

When to Use RAG
#

Perfect for:

  • Knowledge bases and documentation
  • Real-time data requirements
  • Large document collections
  • Multi-source information synthesis
  • Reducing hallucinations

Avoid when:

  • Sub-millisecond latency required
  • No external data sources
  • Simple, static responses needed
  • Offline environments

Production RAG Implementation with Supabase
#

// rag-system.ts
import { createClient } from '@supabase/supabase-js';
import { OpenAI } from 'openai';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';

interface Document {
  id: string;
  content: string;
  metadata: Record<string, any>;
  embedding?: number[];
}

class RAGSystem {
  private supabase;
  private openai;
  private splitter;

  constructor() {
    this.supabase = createClient(
      process.env.SUPABASE_URL!,
      process.env.SUPABASE_ANON_KEY!
    );
    
    this.openai = new OpenAI({
      apiKey: process.env.OPENAI_API_KEY
    });
    
    this.splitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200,
    });
  }

  // Index documents into vector store
  async indexDocuments(documents: Document[]): Promise<void> {
    const chunks: any[] = [];
    
    for (const doc of documents) {
      // Split document into chunks
      const textChunks = await this.splitter.splitText(doc.content);
      
      // Generate embeddings for each chunk
      for (const chunk of textChunks) {
        const embedding = await this.generateEmbedding(chunk);
        
        chunks.push({
          content: chunk,
          metadata: {
            ...doc.metadata,
            source_id: doc.id,
            chunk_index: chunks.length
          },
          embedding
        });
      }
    }
    
    // Batch insert into Supabase
    const { error } = await this.supabase
      .from('document_chunks')
      .insert(chunks);
    
    if (error) throw error;
  }

  // Generate embedding for text
  private async generateEmbedding(text: string): Promise<number[]> {
    const response = await this.openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: text,
    });
    
    return response.data[0].embedding;
  }

  // Retrieve relevant documents
  async retrieve(query: string, topK: number = 5): Promise<any[]> {
    // Generate query embedding
    const queryEmbedding = await this.generateEmbedding(query);
    
    // Perform similarity search
    const { data, error } = await this.supabase.rpc('match_documents', {
      query_embedding: queryEmbedding,
      match_threshold: 0.7,
      match_count: topK
    });
    
    if (error) throw error;
    return data;
  }

  // Generate response with context
  async generateWithContext(
    query: string,
    systemPrompt?: string
  ): Promise<{
    response: string;
    sources: any[];
    usage: any;
  }> {
    // Retrieve relevant documents
    const relevantDocs = await this.retrieve(query);
    
    // Build context from retrieved documents
    const context = relevantDocs
      .map(doc => `[Source: ${doc.metadata.source_id}]\n${doc.content}`)
      .join('\n\n---\n\n');
    
    // Create enhanced prompt
    const messages = [
      {
        role: 'system' as const,
        content: systemPrompt || `You are a helpful assistant. Use the following context to answer questions. Always cite your sources.`
      },
      {
        role: 'user' as const,
        content: `Context:\n${context}\n\nQuestion: ${query}\n\nProvide a comprehensive answer based on the context above.`
      }
    ];
    
    // Generate response
    const completion = await this.openai.chat.completions.create({
      model: 'gpt-4-turbo-preview',
      messages,
      temperature: 0.7,
      max_tokens: 1000
    });
    
    return {
      response: completion.choices[0].message.content || '',
      sources: relevantDocs.map(doc => ({
        id: doc.metadata.source_id,
        snippet: doc.content.substring(0, 200) + '...'
      })),
      usage: completion.usage
    };
  }
}

// Usage example
export async function handleRAGQuery(query: string) {
  const rag = new RAGSystem();
  
  try {
    const result = await rag.generateWithContext(query);
    
    return {
      answer: result.response,
      sources: result.sources,
      tokens: result.usage.total_tokens,
      cost: calculateCost(result.usage)
    };
  } catch (error) {
    console.error('RAG error:', error);
    throw error;
  }
}

// Supabase SQL function for similarity search
/*
CREATE OR REPLACE FUNCTION match_documents(
  query_embedding vector(1536),
  match_threshold float,
  match_count int
)
RETURNS TABLE (
  id uuid,
  content text,
  metadata jsonb,
  similarity float
)
LANGUAGE plpgsql
AS $$
BEGIN
  RETURN QUERY
  SELECT
    document_chunks.id,
    document_chunks.content,
    document_chunks.metadata,
    1 - (document_chunks.embedding <=> query_embedding) as similarity
  FROM document_chunks
  WHERE 1 - (document_chunks.embedding <=> query_embedding) > match_threshold
  ORDER BY document_chunks.embedding <=> query_embedding
  LIMIT match_count;
END;
$$;
*/

Advanced RAG Techniques
#

1. Hybrid Search (Keyword + Semantic)
#

class HybridRAG extends RAGSystem {
  async hybridSearch(
    query: string,
    topK: number = 5
  ): Promise<any[]> {
    // Semantic search
    const semanticResults = await this.retrieve(query, topK * 2);
    
    // Keyword search using Supabase full-text search
    const { data: keywordResults } = await this.supabase
      .from('document_chunks')
      .select('*')
      .textSearch('content', query)
      .limit(topK * 2);
    
    // Combine and re-rank results
    return this.rerank([...semanticResults, ...keywordResults], query, topK);
  }
  
  private async rerank(results: any[], query: string, topK: number): Promise<any[]> {
    // Use a cross-encoder for re-ranking
    // Implementation depends on your re-ranking model
    return results.slice(0, topK);
  }
}

2. Multi-Query RAG
#

async function multiQueryRAG(originalQuery: string): Promise<any[]> {
  // Generate multiple query variations
  const queryVariations = await generateQueryVariations(originalQuery);
  
  // Retrieve for each variation
  const allResults = await Promise.all(
    queryVariations.map(q => rag.retrieve(q, 3))
  );
  
  // Deduplicate and combine results
  const uniqueResults = deduplicateResults(allResults.flat());
  
  return uniqueResults;
}

Prompt Engineering
#

What is Prompt Engineering?
#

Prompt engineering involves structuring and customizing inputs to elicit specific responses from a pre-trained model without modifying the model’s weights.

When to Use Prompt Engineering
#

  • Low Cost: Achieve results without incurring the high costs of training.
  • Rapid Iterations: Quickly test and deploy changes.
  • Flexibility: Adapt to a variety of tasks and contexts on the fly.

Considerations
#

  • Accuracy: Results may vary without fine-tuning.
  • Limitations: Applicable within the constraints of the current model.
  • Prompt Sensitivity: Subtle changes can dramatically affect output.

Code Example
#

def generate_summary(text):
    prompt = f"Summarize the following content: {text}"
    response = requests.post(api_url, json={"prompt": prompt})
    return response.json().get('summary')

summary = generate_summary("Artificial Intelligence is transforming industries.")
print(summary)

Comparison Matrix
#

FeatureFine-TuningRAGPrompt Engineering
CostHighMedium to HighLow
CustomizationHighMediumLow to Medium
Real-Time DataNoYesNo
Setup ComplexityHighMediumLow
PerformanceHighHighVariable

Decision Framework
#

Consider These Factors
#

  • Budget Constraints: Smaller budgets benefit from prompt engineering.
  • Data Availability: RAG excels with rich, diverse data sources.
  • Task Complexity: Choose fine-tuning for highly specialized tasks.

Conclusion
#

Determining the right strategy depends on your specific requirements. For most applications:

  • Use Fine-Tuning when full model customization is imperative.
  • Implement RAG for dynamic information retrieval needs.
  • Apply Prompt Engineering when low cost and speed are priorities.

By carefully considering your use case, data availability, and budget, you can select the most suitable approach for your AI project.

Resources
#

Building Production AI Systems - This article is part of a series.
Part : This Article