Skip to main content
  1. Posts/

LLM Monitoring Guide 2025: Complete Tutorial for Production Observability

·5075 words·24 mins
Author
Steven
Software developer focusing on system-level debugging, performance optimization, and technical problem-solving
Table of Contents
Building Production AI Systems - This article is part of a series.
Part : This Article

Master the art of monitoring Large Language Model applications in production. This comprehensive tutorial covers everything from basic metrics to advanced observability strategies, helping you build reliable, cost-effective, and high-quality AI systems.

Table of Contents
#

Why LLM Observability Matters
#

Large Language Models (LLMs) introduce unique challenges for monitoring and debugging. Unlike traditional applications where behavior is deterministic, LLMs are probabilistic systems that can produce different outputs for the same input. This makes observability not just helpful, but essential for production deployments.

graph TB
    A[User Input] --> B[LLM Application]
    B --> C[Token Usage]
    B --> D[Latency]
    B --> E[Cost]
    B --> F[Quality]
    B --> G[Errors]
    
    C --> H[Observability Platform]
    D --> H
    E --> H
    F --> H
    G --> H
    
    H --> I[Insights & Alerts]
    
    style B fill:#f9f,stroke:#333,stroke-width:2px
    style H fill:#bbf,stroke:#333,stroke-width:2px

Key Challenges
#

  1. Non-deterministic outputs: Same prompt, different responses
  2. Cost management: Token usage can spiral out of control
  3. Quality assurance: How do you measure “good” responses?
  4. Performance variability: Latency can vary significantly
  5. Chain complexity: Multi-step LLM pipelines are hard to debug

Prerequisites
#

Before implementing LLM observability, ensure you have:

  • Basic understanding of LLM applications and APIs
  • Production LLM deployment or development environment
  • Access to monitoring tools (we’ll cover free options)
  • Understanding of metrics and logging concepts

Required Tools
#

# Check Node.js version
node --version  # Should be 18.x or higher

# Install dependencies
npm install openai langfuse @opentelemetry/api @sentry/node

Quick Start: Basic LLM Monitoring
#

Let’s start with a simple monitoring setup that tracks the essential metrics:

// basic-monitoring.ts
import { OpenAI } from 'openai';

interface LLMMetrics {
  requestId: string;
  timestamp: Date;
  model: string;
  promptTokens: number;
  completionTokens: number;
  totalTokens: number;
  latency: number;
  cost: number;
  error?: string;
}

class BasicLLMMonitor {
  private metrics: LLMMetrics[] = [];
  
  async monitoredCompletion(prompt: string, model = 'gpt-3.5-turbo') {
    const startTime = Date.now();
    const requestId = crypto.randomUUID();
    
    try {
      const response = await openai.chat.completions.create({
        model,
        messages: [{ role: 'user', content: prompt }],
      });
      
      const metrics: LLMMetrics = {
        requestId,
        timestamp: new Date(),
        model,
        promptTokens: response.usage?.prompt_tokens || 0,
        completionTokens: response.usage?.completion_tokens || 0,
        totalTokens: response.usage?.total_tokens || 0,
        latency: Date.now() - startTime,
        cost: this.calculateCost(response.usage, model),
      };
      
      this.metrics.push(metrics);
      console.log('LLM Metrics:', metrics);
      
      return response.choices[0].message.content;
    } catch (error) {
      this.metrics.push({
        requestId,
        timestamp: new Date(),
        model,
        promptTokens: 0,
        completionTokens: 0,
        totalTokens: 0,
        latency: Date.now() - startTime,
        cost: 0,
        error: error.message,
      });
      throw error;
    }
  }
  
  private calculateCost(usage: any, model: string): number {
    const pricing = {
      'gpt-3.5-turbo': { prompt: 0.0005, completion: 0.0015 },
      'gpt-4': { prompt: 0.03, completion: 0.06 },
      'gpt-4-turbo': { prompt: 0.01, completion: 0.03 },
    };
    
    const rates = pricing[model] || pricing['gpt-3.5-turbo'];
    return (usage.prompt_tokens * rates.prompt + 
            usage.completion_tokens * rates.completion) / 1000;
  }
  
  getMetricsSummary() {
    return {
      totalRequests: this.metrics.length,
      totalCost: this.metrics.reduce((sum, m) => sum + m.cost, 0),
      averageLatency: this.metrics.reduce((sum, m) => sum + m.latency, 0) / this.metrics.length,
      errorRate: this.metrics.filter(m => m.error).length / this.metrics.length,
    };
  }
}

// Usage
const monitor = new BasicLLMMonitor();
const response = await monitor.monitoredCompletion("Explain LLM monitoring");
console.log(monitor.getMetricsSummary());

Core Metrics to Monitor
#

1. Performance Metrics
#

interface LLMPerformanceMetrics {
  latency: {
    p50: number;  // Median latency
    p95: number;  // 95th percentile
    p99: number;  // 99th percentile
  };
  throughput: number;     // Requests per second
  concurrency: number;    // Parallel requests
  queueDepth: number;     // Waiting requests
}

2. Cost Metrics
#

interface LLMCostMetrics {
  tokens: {
    prompt: number;      // Input tokens
    completion: number;  // Output tokens
    total: number;       // Total tokens
  };
  cost: {
    perRequest: number;  // Cost per request
    perUser: number;     // Cost per user
    total: number;       // Total cost
  };
  model: string;         // Model used (gpt-4, claude-3, etc.)
}

3. Quality Metrics
#

interface LLMQualityMetrics {
  sentiment: number;       // User satisfaction
  accuracy: number;        // Correctness of responses
  relevance: number;       // How relevant the response is
  safety: number;          // Content safety score
  feedback: {
    positive: number;
    negative: number;
    neutral: number;
  };
}

Observability Tools Comparison
#

1. Langfuse - Open Source LLM Observability
#

Langfuse is an open-source platform specifically designed for LLM observability.

Implementation
#

import { Langfuse } from 'langfuse';

// Initialize Langfuse
const langfuse = new Langfuse({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
  secretKey: process.env.LANGFUSE_SECRET_KEY,
  baseUrl: 'https://cloud.langfuse.com', // or self-hosted URL
});

// Trace LLM calls
async function tracedLLMCall(prompt: string, userId: string) {
  // Create a trace
  const trace = langfuse.trace({
    name: 'chat-completion',
    userId: userId,
    metadata: {
      environment: 'production',
      version: '1.0.0',
    },
  });

  // Track the generation
  const generation = trace.generation({
    name: 'gpt-4-generation',
    model: 'gpt-4',
    modelParameters: {
      temperature: 0.7,
      maxTokens: 500,
    },
    input: prompt,
  });

  try {
    // Make the actual LLM call
    const response = await openai.chat.completions.create({
      model: 'gpt-4',
      messages: [{ role: 'user', content: prompt }],
      temperature: 0.7,
      max_tokens: 500,
    });

    // End generation with output
    generation.end({
      output: response.choices[0].message.content,
      usage: {
        promptTokens: response.usage?.prompt_tokens,
        completionTokens: response.usage?.completion_tokens,
      },
    });

    return response.choices[0].message.content;
  } catch (error) {
    generation.end({
      statusMessage: error.message,
      level: 'ERROR',
    });
    throw error;
  } finally {
    // Ensure trace is flushed
    await langfuse.flush();
  }
}

// Score responses for quality tracking
async function scoreResponse(traceId: string, score: number, comment?: string) {
  langfuse.score({
    traceId: traceId,
    name: 'user-feedback',
    value: score,
    comment: comment,
  });
}

Self-Hosting Langfuse with Docker
#

# docker-compose.yml
version: '3.8'
services:
  langfuse:
    image: langfuse/langfuse:latest
    environment:
      DATABASE_URL: postgresql://user:password@postgres:5432/langfuse
      NEXTAUTH_URL: http://localhost:3000
      NEXTAUTH_SECRET: your-secret-key
      SALT: your-salt-key
    ports:
      - "3000:3000"
    depends_on:
      - postgres

  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: password
      POSTGRES_DB: langfuse
    volumes:
      - langfuse_data:/var/lib/postgresql/data

volumes:
  langfuse_data:

2. Helicone - LLM Analytics Platform
#

Helicone provides analytics and caching for LLM applications.

Integration via Proxy
#

Helicone acts as a transparent proxy between your application and OpenAI, automatically capturing metrics without code changes. This TypeScript example shows how to configure OpenAI to route through Helicone’s proxy endpoint. The configuration includes:

  • Base Path: Points to Helicone’s proxy instead of OpenAI directly
  • Authentication: Uses your Helicone API key for access
  • Properties: Custom metadata for segmenting analytics by app and environment
  • User Tracking: Associates requests with specific users for usage analysis
  • Caching: Enables response caching to reduce costs and latency
// Use Helicone as a proxy for OpenAI
import { Configuration, OpenAIApi } from 'openai';

const configuration = new Configuration({
  apiKey: process.env.OPENAI_API_KEY,
  basePath: 'https://oai.hconeai.com/v1',
  defaultHeaders: {
    'Helicone-Auth': `Bearer ${process.env.HELICONE_API_KEY}`,
    'Helicone-Property-App': 'my-app',
    'Helicone-Property-Environment': 'production',
  },
});

const openai = new OpenAIApi(configuration);

// Use OpenAI as normal - Helicone automatically logs
async function generateText(prompt: string, userId: string) {
  const response = await openai.createChatCompletion({
    model: 'gpt-4',
    messages: [{ role: 'user', content: prompt }],
    user: userId, // Helicone tracks per-user usage
    headers: {
      'Helicone-Property-UserId': userId,
      'Helicone-Cache-Enabled': 'true', // Enable caching
    },
  });

  return response.data.choices[0].message?.content;
}

// Custom properties for better analytics
async function taggedGeneration(prompt: string, tags: Record<string, string>) {
  const headers = Object.entries(tags).reduce((acc, [key, value]) => {
    acc[`Helicone-Property-${key}`] = value;
    return acc;
  }, {} as Record<string, string>);

  const response = await openai.createChatCompletion({
    model: 'gpt-4',
    messages: [{ role: 'user', content: prompt }],
    headers,
  });

  return response.data;
}

3. OpenLLMetry - OpenTelemetry for LLMs
#

OpenLLMetry extends OpenTelemetry for LLM-specific observability.

Implementation with OpenTelemetry
#

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import * as Sentry from '@sentry/node';

// Initialize OpenTelemetry with Sentry
const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'llm-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations(),
    // Add LLM-specific instrumentation
    new LLMInstrumentation({
      capturePrompt: true,
      captureCompletion: true,
      captureTokenUsage: true,
    }),
  ],
});

// Initialize Sentry with OpenTelemetry integration
Sentry.init({
  dsn: process.env.SENTRY_DSN,
  integrations: [
    new Sentry.Integrations.OpenTelemetry({
      startIncomingSpanMiddleware: true,
    }),
  ],
  tracesSampleRate: 1.0,
});

sdk.start();

// Custom LLM instrumentation
class LLMInstrumentation {
  constructor(private config: any) {}

  instrument(llmClient: any) {
    const original = llmClient.complete;
    
    llmClient.complete = async function(...args: any[]) {
      const span = tracer.startSpan('llm.completion', {
        attributes: {
          'llm.model': args[0].model,
          'llm.temperature': args[0].temperature,
          'llm.max_tokens': args[0].max_tokens,
          'llm.prompt_length': args[0].prompt.length,
        },
      });

      try {
        const result = await original.apply(this, args);
        
        span.setAttributes({
          'llm.completion_length': result.text.length,
          'llm.prompt_tokens': result.usage.prompt_tokens,
          'llm.completion_tokens': result.usage.completion_tokens,
          'llm.total_tokens': result.usage.total_tokens,
        });

        return result;
      } catch (error) {
        span.recordException(error);
        span.setStatus({ code: SpanStatusCode.ERROR });
        throw error;
      } finally {
        span.end();
      }
    };
  }
}

4. Sentry for LLM Error Tracking
#

Sentry provides excellent error tracking and performance monitoring for LLM applications.

Advanced Sentry Integration
#

import * as Sentry from '@sentry/node';
import { ProfilingIntegration } from '@sentry/profiling-node';

// Initialize Sentry with LLM-specific configuration
Sentry.init({
  dsn: process.env.SENTRY_DSN,
  integrations: [
    new ProfilingIntegration(),
    new Sentry.Integrations.Http({ tracing: true }),
  ],
  tracesSampleRate: 1.0,
  profilesSampleRate: 1.0,
  beforeSend(event, hint) {
    // Redact sensitive information from LLM prompts
    if (event.extra?.prompt) {
      event.extra.prompt = redactSensitiveInfo(event.extra.prompt);
    }
    return event;
  },
});

// LLM-specific error boundary
class LLMErrorBoundary {
  static async wrap<T>(
    operation: () => Promise<T>,
    context: {
      model: string;
      prompt: string;
      userId: string;
      [key: string]: any;
    }
  ): Promise<T> {
    const transaction = Sentry.startTransaction({
      op: 'llm.request',
      name: `LLM ${context.model} Request`,
    });

    Sentry.getCurrentHub().configureScope((scope) => {
      scope.setContext('llm', {
        model: context.model,
        promptLength: context.prompt.length,
        userId: context.userId,
      });
    });

    try {
      const result = await operation();
      transaction.setStatus('ok');
      return result;
    } catch (error) {
      // Capture specific LLM errors with context
      if (error.code === 'rate_limit_exceeded') {
        Sentry.captureException(error, {
          level: 'warning',
          tags: {
            llm_error_type: 'rate_limit',
            model: context.model,
          },
          extra: {
            ...context,
            prompt: context.prompt.substring(0, 200), // Truncate for privacy
          },
        });
      } else {
        Sentry.captureException(error);
      }
      
      transaction.setStatus('internal_error');
      throw error;
    } finally {
      transaction.finish();
    }
  }
}

// Usage
const response = await LLMErrorBoundary.wrap(
  () => generateCompletion(prompt),
  {
    model: 'gpt-4',
    prompt: userPrompt,
    userId: user.id,
    requestId: generateRequestId(),
  }
);

Building Custom Observability
#

1. Metrics Collection Pipeline
#

import { EventEmitter } from 'events';
import { InfluxDB, Point } from '@influxdata/influxdb-client';

class LLMMetricsCollector extends EventEmitter {
  private influx: InfluxDB;
  private writeApi: any;

  constructor(config: {
    url: string;
    token: string;
    org: string;
    bucket: string;
  }) {
    super();
    this.influx = new InfluxDB({
      url: config.url,
      token: config.token,
    });
    this.writeApi = this.influx.getWriteApi(config.org, config.bucket);
  }

  trackRequest(metrics: {
    model: string;
    latency: number;
    tokens: { prompt: number; completion: number };
    cost: number;
    userId: string;
    success: boolean;
    error?: string;
  }) {
    const point = new Point('llm_request')
      .tag('model', metrics.model)
      .tag('user_id', metrics.userId)
      .tag('success', String(metrics.success))
      .floatField('latency', metrics.latency)
      .intField('prompt_tokens', metrics.tokens.prompt)
      .intField('completion_tokens', metrics.tokens.completion)
      .floatField('cost', metrics.cost)
      .timestamp(new Date());

    if (metrics.error) {
      point.tag('error_type', metrics.error);
    }

    this.writeApi.writePoint(point);
    this.emit('metrics', metrics);
  }

  async flush() {
    await this.writeApi.flush();
  }
}

// Middleware for automatic tracking
function createLLMMiddleware(collector: LLMMetricsCollector) {
  return async (req: any, res: any, next: any) => {
    const startTime = Date.now();
    
    // Intercept LLM calls
    const originalJson = res.json;
    res.json = function(data: any) {
      const latency = Date.now() - startTime;
      
      if (data.usage) {
        collector.trackRequest({
          model: req.body.model || 'unknown',
          latency,
          tokens: {
            prompt: data.usage.prompt_tokens,
            completion: data.usage.completion_tokens,
          },
          cost: calculateCost(data.usage, req.body.model),
          userId: req.user?.id || 'anonymous',
          success: true,
        });
      }
      
      return originalJson.call(this, data);
    };
    
    next();
  };
}

2. Real-time Monitoring Dashboard
#

import { WebSocket } from 'ws';
import { Gauge, Counter, Histogram } from 'prom-client';

class RealtimeMonitor {
  private wsClients = new Set<WebSocket>();
  
  // Prometheus metrics
  private requestCounter = new Counter({
    name: 'llm_requests_total',
    help: 'Total number of LLM requests',
    labelNames: ['model', 'status'],
  });

  private tokenGauge = new Gauge({
    name: 'llm_tokens_used',
    help: 'Number of tokens used',
    labelNames: ['model', 'type'],
  });

  private latencyHistogram = new Histogram({
    name: 'llm_request_duration_seconds',
    help: 'LLM request latency',
    labelNames: ['model'],
    buckets: [0.1, 0.5, 1, 2, 5, 10],
  });

  private costGauge = new Gauge({
    name: 'llm_cost_dollars',
    help: 'Cost in dollars',
    labelNames: ['model', 'user_id'],
  });

  broadcast(data: any) {
    const message = JSON.stringify({
      timestamp: new Date().toISOString(),
      ...data,
    });

    this.wsClients.forEach((client) => {
      if (client.readyState === WebSocket.OPEN) {
        client.send(message);
      }
    });
  }

  updateMetrics(event: any) {
    // Update Prometheus metrics
    this.requestCounter.inc({
      model: event.model,
      status: event.success ? 'success' : 'error',
    });

    this.tokenGauge.set(
      { model: event.model, type: 'prompt' },
      event.tokens.prompt
    );
    this.tokenGauge.set(
      { model: event.model, type: 'completion' },
      event.tokens.completion
    );

    this.latencyHistogram.observe(
      { model: event.model },
      event.latency / 1000
    );

    this.costGauge.set(
      { model: event.model, user_id: event.userId },
      event.cost
    );

    // Broadcast to WebSocket clients
    this.broadcast({
      type: 'metrics_update',
      data: event,
    });
  }
}

Advanced Monitoring Patterns
#

1. Conversation Flow Tracking
#

interface ConversationTrace {
  conversationId: string;
  turns: Array<{
    turnId: string;
    timestamp: Date;
    input: string;
    output: string;
    metrics: {
      latency: number;
      tokens: number;
      cost: number;
    };
    feedback?: {
      score: number;
      comment?: string;
    };
  }>;
}

class ConversationTracker {
  private conversations = new Map<string, ConversationTrace>();

  startConversation(userId: string): string {
    const conversationId = generateId();
    this.conversations.set(conversationId, {
      conversationId,
      turns: [],
    });
    return conversationId;
  }

  addTurn(
    conversationId: string,
    turn: ConversationTrace['turns'][0]
  ) {
    const conversation = this.conversations.get(conversationId);
    if (conversation) {
      conversation.turns.push(turn);
      
      // Analyze conversation patterns
      this.analyzeConversation(conversation);
    }
  }

  private analyzeConversation(conversation: ConversationTrace) {
    const metrics = {
      totalTurns: conversation.turns.length,
      totalTokens: conversation.turns.reduce(
        (sum, turn) => sum + turn.metrics.tokens,
        0
      ),
      totalCost: conversation.turns.reduce(
        (sum, turn) => sum + turn.metrics.cost,
        0
      ),
      averageLatency:
        conversation.turns.reduce(
          (sum, turn) => sum + turn.metrics.latency,
          0
        ) / conversation.turns.length,
      satisfactionScore:
        conversation.turns
          .filter((turn) => turn.feedback)
          .reduce((sum, turn) => sum + (turn.feedback?.score || 0), 0) /
        conversation.turns.filter((turn) => turn.feedback).length,
    };

    // Emit analytics event
    this.emit('conversation_metrics', {
      conversationId: conversation.conversationId,
      metrics,
    });
  }
}

2. Anomaly Detection
#

class LLMAnomalyDetector {
  private baselineMetrics: {
    latency: { mean: number; stdDev: number };
    tokenUsage: { mean: number; stdDev: number };
    cost: { mean: number; stdDev: number };
  };

  detectAnomalies(metrics: any): Array<{
    type: string;
    severity: 'low' | 'medium' | 'high';
    value: number;
    threshold: number;
    message: string;
  }> {
    const anomalies = [];

    // Latency anomaly detection
    if (metrics.latency > this.baselineMetrics.latency.mean + 
        3 * this.baselineMetrics.latency.stdDev) {
      anomalies.push({
        type: 'latency',
        severity: 'high',
        value: metrics.latency,
        threshold: this.baselineMetrics.latency.mean + 
                  3 * this.baselineMetrics.latency.stdDev,
        message: `Latency ${metrics.latency}ms exceeds threshold`,
      });
    }

    // Cost anomaly detection
    if (metrics.cost > this.baselineMetrics.cost.mean * 2) {
      anomalies.push({
        type: 'cost',
        severity: 'medium',
        value: metrics.cost,
        threshold: this.baselineMetrics.cost.mean * 2,
        message: `Cost $${metrics.cost} is unusually high`,
      });
    }

    // Token usage anomaly
    const tokenUsage = metrics.tokens.prompt + metrics.tokens.completion;
    if (tokenUsage > this.baselineMetrics.tokenUsage.mean + 
        2 * this.baselineMetrics.tokenUsage.stdDev) {
      anomalies.push({
        type: 'token_usage',
        severity: 'medium',
        value: tokenUsage,
        threshold: this.baselineMetrics.tokenUsage.mean + 
                   2 * this.baselineMetrics.tokenUsage.stdDev,
        message: `Token usage ${tokenUsage} exceeds normal range`,
      });
    }

    return anomalies;
  }
}

3. Quality Monitoring
#

interface QualityMetrics {
  responseRelevance: number;  // 0-1 score
  factualAccuracy: number;    // 0-1 score
  coherence: number;          // 0-1 score
  safety: number;             // 0-1 score
  userSatisfaction: number;   // 0-1 score
}

class QualityMonitor {
  private evaluators: Map<string, (input: string, output: string) => Promise<number>>;

  constructor() {
    this.evaluators = new Map([
      ['relevance', this.evaluateRelevance.bind(this)],
      ['accuracy', this.evaluateAccuracy.bind(this)],
      ['coherence', this.evaluateCoherence.bind(this)],
      ['safety', this.evaluateSafety.bind(this)],
    ]);
  }

  async evaluateResponse(
    input: string,
    output: string,
    expectedOutput?: string
  ): Promise<QualityMetrics> {
    const evaluations = await Promise.all([
      this.evaluateRelevance(input, output),
      this.evaluateAccuracy(input, output, expectedOutput),
      this.evaluateCoherence(input, output),
      this.evaluateSafety(input, output),
    ]);

    return {
      responseRelevance: evaluations[0],
      factualAccuracy: evaluations[1],
      coherence: evaluations[2],
      safety: evaluations[3],
      userSatisfaction: 0, // Set by user feedback
    };
  }

  private async evaluateRelevance(
    input: string, 
    output: string
  ): Promise<number> {
    // Use an LLM to evaluate relevance
    const prompt = `
      Rate the relevance of this response to the question on a scale of 0-1:
      Question: ${input}
      Response: ${output}
      Score (0-1):
    `;

    const score = await this.callEvaluationLLM(prompt);
    return parseFloat(score);
  }

  private async evaluateSafety(
    input: string, 
    output: string
  ): Promise<number> {
    // Check for harmful content
    const harmfulPatterns = [
      /\b(violence|harm|illegal)\b/i,
      /\b(personal|private) information\b/i,
    ];

    const hasHarmfulContent = harmfulPatterns.some(
      pattern => pattern.test(output)
    );

    return hasHarmfulContent ? 0 : 1;
  }
}

Deployment Strategies
#

1. Cloudflare Workers for Edge Monitoring
#

// Cloudflare Worker for LLM monitoring
export interface Env {
  METRICS_DB: KVNamespace;
  ANALYTICS: AnalyticsEngineDataset;
}

export default {
  async fetch(
    request: Request,
    env: Env,
    ctx: ExecutionContext
  ): Promise<Response> {
    const url = new URL(request.url);

    if (url.pathname === '/track') {
      const metrics = await request.json();
      
      // Store in Analytics Engine
      env.ANALYTICS.writeDataPoint({
        blobs: [metrics.model, metrics.userId],
        doubles: [
          metrics.latency,
          metrics.tokens.prompt,
          metrics.tokens.completion,
          metrics.cost,
        ],
        indexes: [metrics.success ? 1 : 0],
      });

      // Update aggregated metrics in KV
      const key = `metrics:${new Date().toISOString().split('T')[0]}`;
      const existing = await env.METRICS_DB.get(key, 'json') || {};
      
      const updated = {
        requests: (existing.requests || 0) + 1,
        totalTokens: (existing.totalTokens || 0) + 
                     metrics.tokens.prompt + metrics.tokens.completion,
        totalCost: (existing.totalCost || 0) + metrics.cost,
        errors: (existing.errors || 0) + (metrics.success ? 0 : 1),
      };

      await env.METRICS_DB.put(key, JSON.stringify(updated));

      return new Response('OK');
    }

    if (url.pathname === '/dashboard') {
      // Serve monitoring dashboard
      const html = await generateDashboardHTML(env);
      return new Response(html, {
        headers: { 'Content-Type': 'text/html' },
      });
    }

    return new Response('Not found', { status: 404 });
  },
};

2. Trigger.dev for Scheduled Analysis
#

import { TriggerClient, eventTrigger } from "@trigger.dev/sdk";
import { z } from "zod";

const client = new TriggerClient({
  id: "llm-monitoring",
  apiKey: process.env.TRIGGER_API_KEY!,
});

// Daily cost analysis job
client.defineJob({
  id: "daily-llm-cost-analysis",
  name: "Daily LLM Cost Analysis",
  version: "1.0.0",
  trigger: eventTrigger({
    name: "daily.cost.analysis",
    schema: z.object({
      date: z.string(),
    }),
  }),
  run: async (payload, io, ctx) => {
    // Fetch metrics from the last 24 hours
    const metrics = await io.runTask("fetch-metrics", async () => {
      return await fetchDailyMetrics(payload.date);
    });

    // Analyze cost trends
    const analysis = await io.runTask("analyze-costs", async () => {
      return {
        totalCost: metrics.reduce((sum, m) => sum + m.cost, 0),
        costByModel: groupBy(metrics, 'model'),
        costByUser: groupBy(metrics, 'userId'),
        anomalies: detectCostAnomalies(metrics),
      };
    });

    // Send alerts if needed
    if (analysis.totalCost > DAILY_COST_THRESHOLD) {
      await io.sendEvent("cost-alert", {
        type: "daily_limit_exceeded",
        cost: analysis.totalCost,
        threshold: DAILY_COST_THRESHOLD,
      });
    }

    // Generate and send report
    const report = await io.runTask("generate-report", async () => {
      return generateCostReport(analysis);
    });

    await io.sendEmail("cost-report", {
      to: process.env.ADMIN_EMAIL!,
      subject: `LLM Cost Report - ${payload.date}`,
      html: report,
    });

    return analysis;
  },
});

// Schedule the job to run daily
client.defineSchedule({
  id: "daily-cost-schedule",
  cron: "0 9 * * *", // 9 AM daily
  job: "daily-llm-cost-analysis",
  payload: {
    date: new Date().toISOString().split('T')[0],
  },
});

Implementation Patterns
#

1. Middleware Pattern for Express/Fastify
#

import { Request, Response, NextFunction } from 'express';

function createLLMMonitoringMiddleware(config: {
  collector: MetricsCollector;
  costController: CostController;
  logger: Logger;
}) {
  return async (
    req: Request & { llmMetrics?: any },
    res: Response,
    next: NextFunction
  ) => {
    const startTime = Date.now();
    const requestId = req.headers['x-request-id'] || crypto.randomUUID();

    // Attach metrics collection to request
    req.llmMetrics = {
      requestId,
      startTime,
      userId: req.user?.id || 'anonymous',
    };

    // Override response methods to capture metrics
    const originalSend = res.send;
    res.send = function(data: any) {
      const endTime = Date.now();
      const latency = endTime - startTime;

      // Extract LLM metrics from response
      if (data && typeof data === 'object' && data.usage) {
        config.collector.trackRequest({
          requestId,
          model: data.model || req.body?.model,
          latency,
          tokens: {
            prompt: data.usage.prompt_tokens,
            completion: data.usage.completion_tokens,
          },
          cost: calculateCost(data.usage, data.model),
          userId: req.llmMetrics.userId,
          endpoint: req.path,
          success: res.statusCode < 400,
        });
      }

      return originalSend.call(this, data);
    };

    next();
  };
}

2. Decorator Pattern for Class Methods
#

function MonitorLLM(options?: {
  trackCost?: boolean;
  trackQuality?: boolean;
  sampleRate?: number;
}) {
  return function (
    target: any,
    propertyKey: string,
    descriptor: PropertyDescriptor
  ) {
    const originalMethod = descriptor.value;

    descriptor.value = async function (...args: any[]) {
      const shouldTrack = !options?.sampleRate || 
                         Math.random() < options.sampleRate;
      
      if (!shouldTrack) {
        return originalMethod.apply(this, args);
      }

      const startTime = Date.now();
      const context = {
        method: propertyKey,
        class: target.constructor.name,
        timestamp: new Date(),
      };

      try {
        const result = await originalMethod.apply(this, args);
        
        // Track success metrics
        metricsCollector.track({
          ...context,
          latency: Date.now() - startTime,
          success: true,
          result: options?.trackQuality ? result : undefined,
        });

        return result;
      } catch (error) {
        // Track error metrics
        metricsCollector.track({
          ...context,
          latency: Date.now() - startTime,
          success: false,
          error: error.message,
        });
        
        throw error;
      }
    };

    return descriptor;
  };
}

// Usage
class LLMService {
  @MonitorLLM({ trackCost: true, sampleRate: 1.0 })
  async generateResponse(prompt: string): Promise<string> {
    return await this.llm.complete(prompt);
  }
}

Cost Monitoring and Optimization
#

Dynamic Model Selection Based on Cost
#

class CostOptimizedRouter {
  private models = [
    { name: 'gpt-3.5-turbo', costPer1k: 0.002, quality: 0.7 },
    { name: 'gpt-4', costPer1k: 0.06, quality: 0.95 },
    { name: 'claude-2', costPer1k: 0.008, quality: 0.85 },
    { name: 'llama-2-70b', costPer1k: 0.001, quality: 0.65 },
  ];

  selectModel(requirements: {
    minQuality: number;
    maxCostPer1k: number;
    estimatedTokens: number;
  }) {
    const eligibleModels = this.models
      .filter(m => 
        m.quality >= requirements.minQuality &&
        m.costPer1k <= requirements.maxCostPer1k
      )
      .sort((a, b) => a.costPer1k - b.costPer1k);

    if (eligibleModels.length === 0) {
      throw new Error('No models meet requirements');
    }

    const selected = eligibleModels[0];
    const estimatedCost = (requirements.estimatedTokens / 1000) * selected.costPer1k;

    return {
      model: selected.name,
      estimatedCost,
      qualityScore: selected.quality,
    };
  }

  async routeWithFallback(prompt: string, options: any) {
    const models = ['gpt-3.5-turbo', 'claude-2', 'llama-2-70b'];
    
    for (const model of models) {
      try {
        const result = await this.callModel(model, prompt, options);
        
        // Track successful routing
        this.metrics.track({
          event: 'model_routing',
          model,
          success: true,
          attempt: models.indexOf(model) + 1,
        });
        
        return result;
      } catch (error) {
        if (models.indexOf(model) === models.length - 1) {
          throw error; // Last model failed
        }
        // Try next model
      }
    }
  }
}

Quality Monitoring
#

Automated Quality Evaluation
#

class QualityEvaluator {
  private evaluationPrompts = {
    relevance: `Rate the relevance of this response (0-10):
Question: {question}
Answer: {answer}
Score:`,
    
    accuracy: `Evaluate the factual accuracy (0-10):
Question: {question}
Answer: {answer}
Ground Truth: {groundTruth}
Score:`,
    
    coherence: `Rate the coherence and clarity (0-10):
Text: {text}
Score:`,
  };

  async evaluateResponse(
    question: string,
    answer: string,
    groundTruth?: string
  ): Promise<QualityScores> {
    const evaluations = await Promise.all([
      this.evaluateMetric('relevance', { question, answer }),
      groundTruth ? 
        this.evaluateMetric('accuracy', { question, answer, groundTruth }) : 
        Promise.resolve(null),
      this.evaluateMetric('coherence', { text: answer }),
    ]);

    return {
      relevance: evaluations[0],
      accuracy: evaluations[1],
      coherence: evaluations[2],
      overall: this.calculateOverallScore(evaluations),
    };
  }

  private async evaluateMetric(
    metric: string,
    params: Record<string, string>
  ): Promise<number> {
    const prompt = this.fillTemplate(
      this.evaluationPrompts[metric],
      params
    );

    const response = await this.llm.complete(prompt);
    return this.parseScore(response);
  }

  private calculateOverallScore(scores: (number | null)[]): number {
    const validScores = scores.filter(s => s !== null) as number[];
    return validScores.reduce((a, b) => a + b, 0) / validScores.length;
  }
}

Performance Monitoring
#

Latency Optimization Strategies
#

class PerformanceOptimizer {
  private cache = new LRUCache<string, CachedResponse>({
    max: 1000,
    ttl: 1000 * 60 * 60, // 1 hour
  });

  async optimizedComplete(prompt: string, options: any) {
    // 1. Check cache first
    const cacheKey = this.generateCacheKey(prompt, options);
    const cached = this.cache.get(cacheKey);
    
    if (cached && !this.shouldInvalidate(cached)) {
      this.metrics.increment('cache_hits');
      return cached.response;
    }

    // 2. Use streaming for long responses
    if (options.maxTokens > 500) {
      return this.streamingComplete(prompt, options);
    }

    // 3. Parallel processing for multiple prompts
    if (Array.isArray(prompt)) {
      return this.batchComplete(prompt, options);
    }

    // 4. Standard completion with timeout
    const timeout = options.timeout || 30000;
    const result = await Promise.race([
      this.llm.complete(prompt, options),
      this.timeout(timeout),
    ]);

    // Cache successful responses
    this.cache.set(cacheKey, {
      response: result,
      timestamp: Date.now(),
      tokens: result.usage?.total_tokens || 0,
    });

    return result;
  }

  private async streamingComplete(prompt: string, options: any) {
    const stream = await this.llm.createStream(prompt, options);
    const chunks: string[] = [];
    let firstTokenTime: number | null = null;

    for await (const chunk of stream) {
      if (!firstTokenTime) {
        firstTokenTime = Date.now();
        this.metrics.record('time_to_first_token', firstTokenTime);
      }
      chunks.push(chunk);
      
      // Yield chunks for real-time processing
      yield chunk;
    }

    return chunks.join('');
  }
}

Alert Strategies
#

Multi-Level Alert System
#

class AlertManager {
  private alertChannels: Map<AlertSeverity, AlertChannel[]> = new Map([
    ['critical', [new PagerDutyChannel(), new SlackChannel('#critical-alerts')]],
    ['warning', [new SlackChannel('#llm-alerts'), new EmailChannel()]],
    ['info', [new LogChannel()]],
  ]);

  private alertRules: AlertRule[] = [
    {
      name: 'high_error_rate',
      condition: (metrics) => metrics.errorRate > 0.05,
      severity: 'critical',
      message: 'LLM error rate exceeds 5%',
    },
    {
      name: 'high_latency',
      condition: (metrics) => metrics.p95Latency > 5000,
      severity: 'warning',
      message: 'P95 latency exceeds 5 seconds',
    },
    {
      name: 'cost_spike',
      condition: (metrics) => metrics.hourlyCost > metrics.avgHourlyCost * 2,
      severity: 'warning',
      message: 'Hourly cost doubled compared to average',
    },
    {
      name: 'quality_degradation',
      condition: (metrics) => metrics.qualityScore < 0.7,
      severity: 'critical',
      message: 'Quality score below threshold',
    },
  ];

  async checkAlerts(currentMetrics: Metrics) {
    for (const rule of this.alertRules) {
      if (rule.condition(currentMetrics)) {
        await this.sendAlert({
          rule: rule.name,
          severity: rule.severity,
          message: rule.message,
          metrics: currentMetrics,
          timestamp: new Date(),
        });
      }
    }
  }

  private async sendAlert(alert: Alert) {
    const channels = this.alertChannels.get(alert.severity) || [];
    
    await Promise.all(
      channels.map(channel => 
        channel.send(alert).catch(err => 
          console.error(`Failed to send alert via ${channel.name}:`, err)
        )
      )
    );

    // Store alert history
    await this.storeAlertHistory(alert);
  }
}

Debugging Production Issues
#

Request Tracing and Replay
#

class LLMDebugger {
  private traceStore: TraceStore;

  async captureTrace(request: LLMRequest): Promise<string> {
    const traceId = generateTraceId();
    
    const trace: LLMTrace = {
      traceId,
      timestamp: new Date(),
      request: {
        prompt: request.prompt,
        model: request.model,
        parameters: request.parameters,
        headers: this.sanitizeHeaders(request.headers),
      },
      response: null,
      error: null,
      metadata: {
        userId: request.userId,
        sessionId: request.sessionId,
        environment: process.env.NODE_ENV,
        version: process.env.APP_VERSION,
      },
      timeline: [],
    };

    // Store initial trace
    await this.traceStore.save(trace);
    
    return traceId;
  }

  async replayRequest(traceId: string, options?: {
    modifyRequest?: (req: any) => any;
    compareResponses?: boolean;
  }): Promise<ReplayResult> {
    const trace = await this.traceStore.get(traceId);
    if (!trace) {
      throw new Error(`Trace ${traceId} not found`);
    }

    // Modify request if needed
    const request = options?.modifyRequest 
      ? options.modifyRequest(trace.request)
      : trace.request;

    // Replay the request
    const startTime = Date.now();
    try {
      const response = await this.llm.complete(request);
      
      const result: ReplayResult = {
        success: true,
        response,
        latency: Date.now() - startTime,
        originalTrace: trace,
      };

      // Compare with original if requested
      if (options?.compareResponses && trace.response) {
        result.comparison = this.compareResponses(
          trace.response,
          response
        );
      }

      return result;
    } catch (error) {
      return {
        success: false,
        error: error.message,
        latency: Date.now() - startTime,
        originalTrace: trace,
      };
    }
  }

  private compareResponses(original: any, replay: any) {
    return {
      textSimilarity: this.calculateSimilarity(
        original.content,
        replay.content
      ),
      tokenDifference: {
        prompt: replay.usage.prompt_tokens - original.usage.prompt_tokens,
        completion: replay.usage.completion_tokens - original.usage.completion_tokens,
      },
      costDifference: this.calculateCostDifference(original, replay),
    };
  }
}

Common Issues and Solutions
#

Issue 1: Token Limit Exceeded
#

Symptoms: 400 errors with “max_tokens exceeded” message

Solutions:

class TokenManager {
  async handleTokenLimit(prompt: string, maxTokens: number) {
    const estimatedTokens = this.estimateTokens(prompt);
    
    if (estimatedTokens > maxTokens) {
      // Strategy 1: Truncate prompt
      const truncated = this.truncateToTokenLimit(
        prompt,
        maxTokens * 0.8 // Leave room for completion
      );
      
      // Strategy 2: Split into chunks
      const chunks = this.splitIntoChunks(prompt, maxTokens * 0.5);
      const results = await Promise.all(
        chunks.map(chunk => this.processChunk(chunk))
      );
      
      return this.mergeResults(results);
    }
    
    return this.llm.complete(prompt);
  }
}

Issue 2: Rate Limiting
#

Symptoms: 429 errors, “Rate limit exceeded”

Solutions:

class RateLimitHandler {
  private queues = new Map<string, PQueue>();
  
  getQueue(model: string): PQueue {
    if (!this.queues.has(model)) {
      this.queues.set(model, new PQueue({
        concurrency: this.getConcurrencyLimit(model),
        interval: 60000, // 1 minute
        intervalCap: this.getIntervalCap(model),
      }));
    }
    return this.queues.get(model)!;
  }

  async executeWithRetry(
    fn: () => Promise<any>,
    options: {
      maxRetries: number;
      backoffMultiplier: number;
    }
  ) {
    let lastError;
    
    for (let i = 0; i <= options.maxRetries; i++) {
      try {
        return await fn();
      } catch (error) {
        if (error.status === 429) {
          const delay = this.calculateBackoff(i, options.backoffMultiplier);
          await this.sleep(delay);
          lastError = error;
        } else {
          throw error; // Non-retryable error
        }
      }
    }
    
    throw lastError;
  }
}

Issue 3: Inconsistent Quality
#

Symptoms: Varying response quality, user complaints

Solutions:

class QualityAssurance {
  async ensureQuality(
    prompt: string,
    options: QualityOptions
  ): Promise<string> {
    // Strategy 1: Multiple generations + selection
    if (options.useConsensus) {
      const responses = await Promise.all(
        Array(3).fill(null).map(() => 
          this.llm.complete(prompt, { temperature: 0.7 })
        )
      );
      
      return this.selectBestResponse(responses);
    }
    
    // Strategy 2: Self-critique and revision
    if (options.useSelfCritique) {
      const initial = await this.llm.complete(prompt);
      const critique = await this.llm.complete(
        `Critique this response: ${initial}`
      );
      const revised = await this.llm.complete(
        `Revise based on critique: ${critique}`
      );
      
      return revised;
    }
    
    // Strategy 3: Validation loop
    let response;
    let attempts = 0;
    
    do {
      response = await this.llm.complete(prompt);
      const isValid = await this.validateResponse(response, options);
      
      if (isValid) break;
      
      attempts++;
      prompt = this.refinePrompt(prompt, response);
    } while (attempts < options.maxAttempts);
    
    return response;
  }
}

FAQ
#

What metrics are most important for LLM monitoring?
#

The critical metrics to track are:

  1. Latency: Response time (p50, p95, p99)
  2. Cost: Token usage and $ spent per request/user
  3. Error Rate: Failed requests and error types
  4. Quality: User satisfaction and response accuracy
  5. Token Usage: Input/output token distribution

How do I monitor LLMs without storing sensitive data?
#

// Best practices for privacy-preserving monitoring
const privacyConfig = {
  // Hash prompts instead of storing raw text
  storePromptHash: true,
  storePromptText: false,
  
  // Aggregate metrics without user attribution
  anonymizeUserIds: true,
  
  // Redact PII before logging
  enablePIIRedaction: true,
  
  // Store only metadata
  metadataOnly: true,
};

What’s the difference between Langfuse and Helicone?
#

  • Langfuse: Open-source, self-hostable, focuses on traces and debugging
  • Helicone: Proxy-based, easier setup, better for analytics and caching

Choose Langfuse for full control and debugging, Helicone for quick setup and analytics.

How can I reduce monitoring overhead?
#

  1. Sampling: Monitor only a percentage of requests
  2. Async logging: Don’t block LLM calls for monitoring
  3. Batch metrics: Send metrics in batches, not per-request
  4. Edge monitoring: Use Cloudflare Workers for low-latency tracking

Should I use OpenTelemetry for LLM monitoring?
#

Yes, if you:

  • Already use OpenTelemetry in your stack
  • Need standardized observability
  • Want vendor-neutral instrumentation
  • Require distributed tracing

How do I set up alerts for LLM issues?
#

Implement multi-level alerts:

const alertConfig = {
  critical: {
    errorRate: 0.05,      // >5% errors
    downtime: 60,         // >1 minute down
    costSpike: 3.0,       // >3x normal cost
  },
  warning: {
    latencyP95: 5000,     // >5s P95 latency
    qualityScore: 0.7,    // <70% quality
    tokenUsage: 0.8,      // >80% of limit
  },
};

Can I monitor local LLMs the same way?
#

Yes, the patterns work for any LLM:

// Local LLM monitoring
const localLLMMonitor = new LLMMonitor({
  endpoint: 'http://localhost:11434', // Ollama
  metricsEndpoint: 'http://localhost:9090', // Prometheus
  customMetrics: {
    gpuUsage: true,
    memoryUsage: true,
    modelLoadTime: true,
  },
});

How do I debug slow LLM responses?
#

  1. Check token count: More tokens = slower response
  2. Monitor model load: Some models have cold starts
  3. Network latency: Check connection to LLM provider
  4. Rate limiting: You might be throttled
  5. Model selection: Larger models are slower

Best Practices
#

1. Privacy and Security
#

class PrivacyPreservingLogger {
  private sensitivePatterns = [
    /\b\d{3}-\d{2}-\d{4}\b/g,  // SSN
    /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g, // Email
    /\b(?:\d{4}[\s-]?){3}\d{4}\b/g, // Credit card
  ];

  sanitize(text: string): string {
    let sanitized = text;
    
    for (const pattern of this.sensitivePatterns) {
      sanitized = sanitized.replace(pattern, '[REDACTED]');
    }
    
    return sanitized;
  }

  logLLMInteraction(interaction: {
    prompt: string;
    response: string;
    metadata: any;
  }) {
    const sanitized = {
      prompt: this.sanitize(interaction.prompt),
      response: this.sanitize(interaction.response),
      metadata: {
        ...interaction.metadata,
        promptHash: this.hash(interaction.prompt),
        responseHash: this.hash(interaction.response),
      },
    };

    // Log sanitized version
    this.logger.info('LLM Interaction', sanitized);
  }

  private hash(text: string): string {
    return crypto
      .createHash('sha256')
      .update(text)
      .digest('hex');
  }
}

2. Cost Control
#

class CostController {
  private limits: Map<string, { daily: number; monthly: number }>;
  private usage: Map<string, { daily: number; monthly: number }>;

  async checkLimit(userId: string, estimatedCost: number): Promise<{
    allowed: boolean;
    reason?: string;
    remainingBudget?: number;
  }> {
    const userLimits = this.limits.get(userId);
    const userUsage = this.usage.get(userId) || { daily: 0, monthly: 0 };

    if (!userLimits) {
      return { allowed: true }; // No limits set
    }

    // Check daily limit
    if (userUsage.daily + estimatedCost > userLimits.daily) {
      return {
        allowed: false,
        reason: 'Daily limit exceeded',
        remainingBudget: Math.max(0, userLimits.daily - userUsage.daily),
      };
    }

    // Check monthly limit
    if (userUsage.monthly + estimatedCost > userLimits.monthly) {
      return {
        allowed: false,
        reason: 'Monthly limit exceeded',
        remainingBudget: Math.max(0, userLimits.monthly - userUsage.monthly),
      };
    }

    return {
      allowed: true,
      remainingBudget: Math.min(
        userLimits.daily - userUsage.daily,
        userLimits.monthly - userUsage.monthly
      ),
    };
  }
}

Conclusion
#

Effective LLM observability is crucial for running AI applications in production. Key takeaways:

  1. Monitor comprehensively: Track performance, cost, quality, and errors
  2. Use the right tools: Langfuse for traces, Helicone for analytics, Sentry for errors
  3. Build custom monitoring: Tailor metrics to your specific use case
  4. Automate analysis: Use scheduled jobs to detect trends and anomalies
  5. Protect privacy: Always sanitize sensitive data before logging
  6. Control costs: Implement budget limits and alerts

By implementing robust observability, you can ensure your LLM applications are reliable, cost-effective, and deliver value to users.

Resources
#

Building Production AI Systems - This article is part of a series.
Part : This Article