Master the art of monitoring Large Language Model applications in production. This comprehensive tutorial covers everything from basic metrics to advanced observability strategies, helping you build reliable, cost-effective, and high-quality AI systems.
Table of Contents#
- Why LLM Observability Matters
- Prerequisites
- Quick Start: Basic LLM Monitoring
- Core Metrics to Monitor
- Observability Tools Comparison
- Implementation Patterns
- Cost Monitoring and Optimization
- Quality Monitoring
- Performance Monitoring
- Alert Strategies
- Debugging Production Issues
- Common Issues and Solutions
- FAQ
- Conclusion
Why LLM Observability Matters#
Large Language Models (LLMs) introduce unique challenges for monitoring and debugging. Unlike traditional applications where behavior is deterministic, LLMs are probabilistic systems that can produce different outputs for the same input. This makes observability not just helpful, but essential for production deployments.
graph TB A[User Input] --> B[LLM Application] B --> C[Token Usage] B --> D[Latency] B --> E[Cost] B --> F[Quality] B --> G[Errors] C --> H[Observability Platform] D --> H E --> H F --> H G --> H H --> I[Insights & Alerts] style B fill:#f9f,stroke:#333,stroke-width:2px style H fill:#bbf,stroke:#333,stroke-width:2px
Key Challenges#
- Non-deterministic outputs: Same prompt, different responses
- Cost management: Token usage can spiral out of control
- Quality assurance: How do you measure “good” responses?
- Performance variability: Latency can vary significantly
- Chain complexity: Multi-step LLM pipelines are hard to debug
Prerequisites#
Before implementing LLM observability, ensure you have:
- Basic understanding of LLM applications and APIs
- Production LLM deployment or development environment
- Access to monitoring tools (we’ll cover free options)
- Understanding of metrics and logging concepts
Required Tools#
# Check Node.js version
node --version # Should be 18.x or higher
# Install dependencies
npm install openai langfuse @opentelemetry/api @sentry/node
Quick Start: Basic LLM Monitoring#
Let’s start with a simple monitoring setup that tracks the essential metrics:
// basic-monitoring.ts
import { OpenAI } from 'openai';
interface LLMMetrics {
requestId: string;
timestamp: Date;
model: string;
promptTokens: number;
completionTokens: number;
totalTokens: number;
latency: number;
cost: number;
error?: string;
}
class BasicLLMMonitor {
private metrics: LLMMetrics[] = [];
async monitoredCompletion(prompt: string, model = 'gpt-3.5-turbo') {
const startTime = Date.now();
const requestId = crypto.randomUUID();
try {
const response = await openai.chat.completions.create({
model,
messages: [{ role: 'user', content: prompt }],
});
const metrics: LLMMetrics = {
requestId,
timestamp: new Date(),
model,
promptTokens: response.usage?.prompt_tokens || 0,
completionTokens: response.usage?.completion_tokens || 0,
totalTokens: response.usage?.total_tokens || 0,
latency: Date.now() - startTime,
cost: this.calculateCost(response.usage, model),
};
this.metrics.push(metrics);
console.log('LLM Metrics:', metrics);
return response.choices[0].message.content;
} catch (error) {
this.metrics.push({
requestId,
timestamp: new Date(),
model,
promptTokens: 0,
completionTokens: 0,
totalTokens: 0,
latency: Date.now() - startTime,
cost: 0,
error: error.message,
});
throw error;
}
}
private calculateCost(usage: any, model: string): number {
const pricing = {
'gpt-3.5-turbo': { prompt: 0.0005, completion: 0.0015 },
'gpt-4': { prompt: 0.03, completion: 0.06 },
'gpt-4-turbo': { prompt: 0.01, completion: 0.03 },
};
const rates = pricing[model] || pricing['gpt-3.5-turbo'];
return (usage.prompt_tokens * rates.prompt +
usage.completion_tokens * rates.completion) / 1000;
}
getMetricsSummary() {
return {
totalRequests: this.metrics.length,
totalCost: this.metrics.reduce((sum, m) => sum + m.cost, 0),
averageLatency: this.metrics.reduce((sum, m) => sum + m.latency, 0) / this.metrics.length,
errorRate: this.metrics.filter(m => m.error).length / this.metrics.length,
};
}
}
// Usage
const monitor = new BasicLLMMonitor();
const response = await monitor.monitoredCompletion("Explain LLM monitoring");
console.log(monitor.getMetricsSummary());
Core Metrics to Monitor#
1. Performance Metrics#
interface LLMPerformanceMetrics {
latency: {
p50: number; // Median latency
p95: number; // 95th percentile
p99: number; // 99th percentile
};
throughput: number; // Requests per second
concurrency: number; // Parallel requests
queueDepth: number; // Waiting requests
}
2. Cost Metrics#
interface LLMCostMetrics {
tokens: {
prompt: number; // Input tokens
completion: number; // Output tokens
total: number; // Total tokens
};
cost: {
perRequest: number; // Cost per request
perUser: number; // Cost per user
total: number; // Total cost
};
model: string; // Model used (gpt-4, claude-3, etc.)
}
3. Quality Metrics#
interface LLMQualityMetrics {
sentiment: number; // User satisfaction
accuracy: number; // Correctness of responses
relevance: number; // How relevant the response is
safety: number; // Content safety score
feedback: {
positive: number;
negative: number;
neutral: number;
};
}
Observability Tools Comparison#
1. Langfuse - Open Source LLM Observability#
Langfuse is an open-source platform specifically designed for LLM observability.
Implementation#
import { Langfuse } from 'langfuse';
// Initialize Langfuse
const langfuse = new Langfuse({
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
secretKey: process.env.LANGFUSE_SECRET_KEY,
baseUrl: 'https://cloud.langfuse.com', // or self-hosted URL
});
// Trace LLM calls
async function tracedLLMCall(prompt: string, userId: string) {
// Create a trace
const trace = langfuse.trace({
name: 'chat-completion',
userId: userId,
metadata: {
environment: 'production',
version: '1.0.0',
},
});
// Track the generation
const generation = trace.generation({
name: 'gpt-4-generation',
model: 'gpt-4',
modelParameters: {
temperature: 0.7,
maxTokens: 500,
},
input: prompt,
});
try {
// Make the actual LLM call
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: prompt }],
temperature: 0.7,
max_tokens: 500,
});
// End generation with output
generation.end({
output: response.choices[0].message.content,
usage: {
promptTokens: response.usage?.prompt_tokens,
completionTokens: response.usage?.completion_tokens,
},
});
return response.choices[0].message.content;
} catch (error) {
generation.end({
statusMessage: error.message,
level: 'ERROR',
});
throw error;
} finally {
// Ensure trace is flushed
await langfuse.flush();
}
}
// Score responses for quality tracking
async function scoreResponse(traceId: string, score: number, comment?: string) {
langfuse.score({
traceId: traceId,
name: 'user-feedback',
value: score,
comment: comment,
});
}
Self-Hosting Langfuse with Docker#
# docker-compose.yml
version: '3.8'
services:
langfuse:
image: langfuse/langfuse:latest
environment:
DATABASE_URL: postgresql://user:password@postgres:5432/langfuse
NEXTAUTH_URL: http://localhost:3000
NEXTAUTH_SECRET: your-secret-key
SALT: your-salt-key
ports:
- "3000:3000"
depends_on:
- postgres
postgres:
image: postgres:15
environment:
POSTGRES_USER: user
POSTGRES_PASSWORD: password
POSTGRES_DB: langfuse
volumes:
- langfuse_data:/var/lib/postgresql/data
volumes:
langfuse_data:
2. Helicone - LLM Analytics Platform#
Helicone provides analytics and caching for LLM applications.
Integration via Proxy#
Helicone acts as a transparent proxy between your application and OpenAI, automatically capturing metrics without code changes. This TypeScript example shows how to configure OpenAI to route through Helicone’s proxy endpoint. The configuration includes:
- Base Path: Points to Helicone’s proxy instead of OpenAI directly
- Authentication: Uses your Helicone API key for access
- Properties: Custom metadata for segmenting analytics by app and environment
- User Tracking: Associates requests with specific users for usage analysis
- Caching: Enables response caching to reduce costs and latency
// Use Helicone as a proxy for OpenAI
import { Configuration, OpenAIApi } from 'openai';
const configuration = new Configuration({
apiKey: process.env.OPENAI_API_KEY,
basePath: 'https://oai.hconeai.com/v1',
defaultHeaders: {
'Helicone-Auth': `Bearer ${process.env.HELICONE_API_KEY}`,
'Helicone-Property-App': 'my-app',
'Helicone-Property-Environment': 'production',
},
});
const openai = new OpenAIApi(configuration);
// Use OpenAI as normal - Helicone automatically logs
async function generateText(prompt: string, userId: string) {
const response = await openai.createChatCompletion({
model: 'gpt-4',
messages: [{ role: 'user', content: prompt }],
user: userId, // Helicone tracks per-user usage
headers: {
'Helicone-Property-UserId': userId,
'Helicone-Cache-Enabled': 'true', // Enable caching
},
});
return response.data.choices[0].message?.content;
}
// Custom properties for better analytics
async function taggedGeneration(prompt: string, tags: Record<string, string>) {
const headers = Object.entries(tags).reduce((acc, [key, value]) => {
acc[`Helicone-Property-${key}`] = value;
return acc;
}, {} as Record<string, string>);
const response = await openai.createChatCompletion({
model: 'gpt-4',
messages: [{ role: 'user', content: prompt }],
headers,
});
return response.data;
}
3. OpenLLMetry - OpenTelemetry for LLMs#
OpenLLMetry extends OpenTelemetry for LLM-specific observability.
Implementation with OpenTelemetry#
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import * as Sentry from '@sentry/node';
// Initialize OpenTelemetry with Sentry
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'llm-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations(),
// Add LLM-specific instrumentation
new LLMInstrumentation({
capturePrompt: true,
captureCompletion: true,
captureTokenUsage: true,
}),
],
});
// Initialize Sentry with OpenTelemetry integration
Sentry.init({
dsn: process.env.SENTRY_DSN,
integrations: [
new Sentry.Integrations.OpenTelemetry({
startIncomingSpanMiddleware: true,
}),
],
tracesSampleRate: 1.0,
});
sdk.start();
// Custom LLM instrumentation
class LLMInstrumentation {
constructor(private config: any) {}
instrument(llmClient: any) {
const original = llmClient.complete;
llmClient.complete = async function(...args: any[]) {
const span = tracer.startSpan('llm.completion', {
attributes: {
'llm.model': args[0].model,
'llm.temperature': args[0].temperature,
'llm.max_tokens': args[0].max_tokens,
'llm.prompt_length': args[0].prompt.length,
},
});
try {
const result = await original.apply(this, args);
span.setAttributes({
'llm.completion_length': result.text.length,
'llm.prompt_tokens': result.usage.prompt_tokens,
'llm.completion_tokens': result.usage.completion_tokens,
'llm.total_tokens': result.usage.total_tokens,
});
return result;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
};
}
}
4. Sentry for LLM Error Tracking#
Sentry provides excellent error tracking and performance monitoring for LLM applications.
Advanced Sentry Integration#
import * as Sentry from '@sentry/node';
import { ProfilingIntegration } from '@sentry/profiling-node';
// Initialize Sentry with LLM-specific configuration
Sentry.init({
dsn: process.env.SENTRY_DSN,
integrations: [
new ProfilingIntegration(),
new Sentry.Integrations.Http({ tracing: true }),
],
tracesSampleRate: 1.0,
profilesSampleRate: 1.0,
beforeSend(event, hint) {
// Redact sensitive information from LLM prompts
if (event.extra?.prompt) {
event.extra.prompt = redactSensitiveInfo(event.extra.prompt);
}
return event;
},
});
// LLM-specific error boundary
class LLMErrorBoundary {
static async wrap<T>(
operation: () => Promise<T>,
context: {
model: string;
prompt: string;
userId: string;
[key: string]: any;
}
): Promise<T> {
const transaction = Sentry.startTransaction({
op: 'llm.request',
name: `LLM ${context.model} Request`,
});
Sentry.getCurrentHub().configureScope((scope) => {
scope.setContext('llm', {
model: context.model,
promptLength: context.prompt.length,
userId: context.userId,
});
});
try {
const result = await operation();
transaction.setStatus('ok');
return result;
} catch (error) {
// Capture specific LLM errors with context
if (error.code === 'rate_limit_exceeded') {
Sentry.captureException(error, {
level: 'warning',
tags: {
llm_error_type: 'rate_limit',
model: context.model,
},
extra: {
...context,
prompt: context.prompt.substring(0, 200), // Truncate for privacy
},
});
} else {
Sentry.captureException(error);
}
transaction.setStatus('internal_error');
throw error;
} finally {
transaction.finish();
}
}
}
// Usage
const response = await LLMErrorBoundary.wrap(
() => generateCompletion(prompt),
{
model: 'gpt-4',
prompt: userPrompt,
userId: user.id,
requestId: generateRequestId(),
}
);
Building Custom Observability#
1. Metrics Collection Pipeline#
import { EventEmitter } from 'events';
import { InfluxDB, Point } from '@influxdata/influxdb-client';
class LLMMetricsCollector extends EventEmitter {
private influx: InfluxDB;
private writeApi: any;
constructor(config: {
url: string;
token: string;
org: string;
bucket: string;
}) {
super();
this.influx = new InfluxDB({
url: config.url,
token: config.token,
});
this.writeApi = this.influx.getWriteApi(config.org, config.bucket);
}
trackRequest(metrics: {
model: string;
latency: number;
tokens: { prompt: number; completion: number };
cost: number;
userId: string;
success: boolean;
error?: string;
}) {
const point = new Point('llm_request')
.tag('model', metrics.model)
.tag('user_id', metrics.userId)
.tag('success', String(metrics.success))
.floatField('latency', metrics.latency)
.intField('prompt_tokens', metrics.tokens.prompt)
.intField('completion_tokens', metrics.tokens.completion)
.floatField('cost', metrics.cost)
.timestamp(new Date());
if (metrics.error) {
point.tag('error_type', metrics.error);
}
this.writeApi.writePoint(point);
this.emit('metrics', metrics);
}
async flush() {
await this.writeApi.flush();
}
}
// Middleware for automatic tracking
function createLLMMiddleware(collector: LLMMetricsCollector) {
return async (req: any, res: any, next: any) => {
const startTime = Date.now();
// Intercept LLM calls
const originalJson = res.json;
res.json = function(data: any) {
const latency = Date.now() - startTime;
if (data.usage) {
collector.trackRequest({
model: req.body.model || 'unknown',
latency,
tokens: {
prompt: data.usage.prompt_tokens,
completion: data.usage.completion_tokens,
},
cost: calculateCost(data.usage, req.body.model),
userId: req.user?.id || 'anonymous',
success: true,
});
}
return originalJson.call(this, data);
};
next();
};
}
2. Real-time Monitoring Dashboard#
import { WebSocket } from 'ws';
import { Gauge, Counter, Histogram } from 'prom-client';
class RealtimeMonitor {
private wsClients = new Set<WebSocket>();
// Prometheus metrics
private requestCounter = new Counter({
name: 'llm_requests_total',
help: 'Total number of LLM requests',
labelNames: ['model', 'status'],
});
private tokenGauge = new Gauge({
name: 'llm_tokens_used',
help: 'Number of tokens used',
labelNames: ['model', 'type'],
});
private latencyHistogram = new Histogram({
name: 'llm_request_duration_seconds',
help: 'LLM request latency',
labelNames: ['model'],
buckets: [0.1, 0.5, 1, 2, 5, 10],
});
private costGauge = new Gauge({
name: 'llm_cost_dollars',
help: 'Cost in dollars',
labelNames: ['model', 'user_id'],
});
broadcast(data: any) {
const message = JSON.stringify({
timestamp: new Date().toISOString(),
...data,
});
this.wsClients.forEach((client) => {
if (client.readyState === WebSocket.OPEN) {
client.send(message);
}
});
}
updateMetrics(event: any) {
// Update Prometheus metrics
this.requestCounter.inc({
model: event.model,
status: event.success ? 'success' : 'error',
});
this.tokenGauge.set(
{ model: event.model, type: 'prompt' },
event.tokens.prompt
);
this.tokenGauge.set(
{ model: event.model, type: 'completion' },
event.tokens.completion
);
this.latencyHistogram.observe(
{ model: event.model },
event.latency / 1000
);
this.costGauge.set(
{ model: event.model, user_id: event.userId },
event.cost
);
// Broadcast to WebSocket clients
this.broadcast({
type: 'metrics_update',
data: event,
});
}
}
Advanced Monitoring Patterns#
1. Conversation Flow Tracking#
interface ConversationTrace {
conversationId: string;
turns: Array<{
turnId: string;
timestamp: Date;
input: string;
output: string;
metrics: {
latency: number;
tokens: number;
cost: number;
};
feedback?: {
score: number;
comment?: string;
};
}>;
}
class ConversationTracker {
private conversations = new Map<string, ConversationTrace>();
startConversation(userId: string): string {
const conversationId = generateId();
this.conversations.set(conversationId, {
conversationId,
turns: [],
});
return conversationId;
}
addTurn(
conversationId: string,
turn: ConversationTrace['turns'][0]
) {
const conversation = this.conversations.get(conversationId);
if (conversation) {
conversation.turns.push(turn);
// Analyze conversation patterns
this.analyzeConversation(conversation);
}
}
private analyzeConversation(conversation: ConversationTrace) {
const metrics = {
totalTurns: conversation.turns.length,
totalTokens: conversation.turns.reduce(
(sum, turn) => sum + turn.metrics.tokens,
0
),
totalCost: conversation.turns.reduce(
(sum, turn) => sum + turn.metrics.cost,
0
),
averageLatency:
conversation.turns.reduce(
(sum, turn) => sum + turn.metrics.latency,
0
) / conversation.turns.length,
satisfactionScore:
conversation.turns
.filter((turn) => turn.feedback)
.reduce((sum, turn) => sum + (turn.feedback?.score || 0), 0) /
conversation.turns.filter((turn) => turn.feedback).length,
};
// Emit analytics event
this.emit('conversation_metrics', {
conversationId: conversation.conversationId,
metrics,
});
}
}
2. Anomaly Detection#
class LLMAnomalyDetector {
private baselineMetrics: {
latency: { mean: number; stdDev: number };
tokenUsage: { mean: number; stdDev: number };
cost: { mean: number; stdDev: number };
};
detectAnomalies(metrics: any): Array<{
type: string;
severity: 'low' | 'medium' | 'high';
value: number;
threshold: number;
message: string;
}> {
const anomalies = [];
// Latency anomaly detection
if (metrics.latency > this.baselineMetrics.latency.mean +
3 * this.baselineMetrics.latency.stdDev) {
anomalies.push({
type: 'latency',
severity: 'high',
value: metrics.latency,
threshold: this.baselineMetrics.latency.mean +
3 * this.baselineMetrics.latency.stdDev,
message: `Latency ${metrics.latency}ms exceeds threshold`,
});
}
// Cost anomaly detection
if (metrics.cost > this.baselineMetrics.cost.mean * 2) {
anomalies.push({
type: 'cost',
severity: 'medium',
value: metrics.cost,
threshold: this.baselineMetrics.cost.mean * 2,
message: `Cost $${metrics.cost} is unusually high`,
});
}
// Token usage anomaly
const tokenUsage = metrics.tokens.prompt + metrics.tokens.completion;
if (tokenUsage > this.baselineMetrics.tokenUsage.mean +
2 * this.baselineMetrics.tokenUsage.stdDev) {
anomalies.push({
type: 'token_usage',
severity: 'medium',
value: tokenUsage,
threshold: this.baselineMetrics.tokenUsage.mean +
2 * this.baselineMetrics.tokenUsage.stdDev,
message: `Token usage ${tokenUsage} exceeds normal range`,
});
}
return anomalies;
}
}
3. Quality Monitoring#
interface QualityMetrics {
responseRelevance: number; // 0-1 score
factualAccuracy: number; // 0-1 score
coherence: number; // 0-1 score
safety: number; // 0-1 score
userSatisfaction: number; // 0-1 score
}
class QualityMonitor {
private evaluators: Map<string, (input: string, output: string) => Promise<number>>;
constructor() {
this.evaluators = new Map([
['relevance', this.evaluateRelevance.bind(this)],
['accuracy', this.evaluateAccuracy.bind(this)],
['coherence', this.evaluateCoherence.bind(this)],
['safety', this.evaluateSafety.bind(this)],
]);
}
async evaluateResponse(
input: string,
output: string,
expectedOutput?: string
): Promise<QualityMetrics> {
const evaluations = await Promise.all([
this.evaluateRelevance(input, output),
this.evaluateAccuracy(input, output, expectedOutput),
this.evaluateCoherence(input, output),
this.evaluateSafety(input, output),
]);
return {
responseRelevance: evaluations[0],
factualAccuracy: evaluations[1],
coherence: evaluations[2],
safety: evaluations[3],
userSatisfaction: 0, // Set by user feedback
};
}
private async evaluateRelevance(
input: string,
output: string
): Promise<number> {
// Use an LLM to evaluate relevance
const prompt = `
Rate the relevance of this response to the question on a scale of 0-1:
Question: ${input}
Response: ${output}
Score (0-1):
`;
const score = await this.callEvaluationLLM(prompt);
return parseFloat(score);
}
private async evaluateSafety(
input: string,
output: string
): Promise<number> {
// Check for harmful content
const harmfulPatterns = [
/\b(violence|harm|illegal)\b/i,
/\b(personal|private) information\b/i,
];
const hasHarmfulContent = harmfulPatterns.some(
pattern => pattern.test(output)
);
return hasHarmfulContent ? 0 : 1;
}
}
Deployment Strategies#
1. Cloudflare Workers for Edge Monitoring#
// Cloudflare Worker for LLM monitoring
export interface Env {
METRICS_DB: KVNamespace;
ANALYTICS: AnalyticsEngineDataset;
}
export default {
async fetch(
request: Request,
env: Env,
ctx: ExecutionContext
): Promise<Response> {
const url = new URL(request.url);
if (url.pathname === '/track') {
const metrics = await request.json();
// Store in Analytics Engine
env.ANALYTICS.writeDataPoint({
blobs: [metrics.model, metrics.userId],
doubles: [
metrics.latency,
metrics.tokens.prompt,
metrics.tokens.completion,
metrics.cost,
],
indexes: [metrics.success ? 1 : 0],
});
// Update aggregated metrics in KV
const key = `metrics:${new Date().toISOString().split('T')[0]}`;
const existing = await env.METRICS_DB.get(key, 'json') || {};
const updated = {
requests: (existing.requests || 0) + 1,
totalTokens: (existing.totalTokens || 0) +
metrics.tokens.prompt + metrics.tokens.completion,
totalCost: (existing.totalCost || 0) + metrics.cost,
errors: (existing.errors || 0) + (metrics.success ? 0 : 1),
};
await env.METRICS_DB.put(key, JSON.stringify(updated));
return new Response('OK');
}
if (url.pathname === '/dashboard') {
// Serve monitoring dashboard
const html = await generateDashboardHTML(env);
return new Response(html, {
headers: { 'Content-Type': 'text/html' },
});
}
return new Response('Not found', { status: 404 });
},
};
2. Trigger.dev for Scheduled Analysis#
import { TriggerClient, eventTrigger } from "@trigger.dev/sdk";
import { z } from "zod";
const client = new TriggerClient({
id: "llm-monitoring",
apiKey: process.env.TRIGGER_API_KEY!,
});
// Daily cost analysis job
client.defineJob({
id: "daily-llm-cost-analysis",
name: "Daily LLM Cost Analysis",
version: "1.0.0",
trigger: eventTrigger({
name: "daily.cost.analysis",
schema: z.object({
date: z.string(),
}),
}),
run: async (payload, io, ctx) => {
// Fetch metrics from the last 24 hours
const metrics = await io.runTask("fetch-metrics", async () => {
return await fetchDailyMetrics(payload.date);
});
// Analyze cost trends
const analysis = await io.runTask("analyze-costs", async () => {
return {
totalCost: metrics.reduce((sum, m) => sum + m.cost, 0),
costByModel: groupBy(metrics, 'model'),
costByUser: groupBy(metrics, 'userId'),
anomalies: detectCostAnomalies(metrics),
};
});
// Send alerts if needed
if (analysis.totalCost > DAILY_COST_THRESHOLD) {
await io.sendEvent("cost-alert", {
type: "daily_limit_exceeded",
cost: analysis.totalCost,
threshold: DAILY_COST_THRESHOLD,
});
}
// Generate and send report
const report = await io.runTask("generate-report", async () => {
return generateCostReport(analysis);
});
await io.sendEmail("cost-report", {
to: process.env.ADMIN_EMAIL!,
subject: `LLM Cost Report - ${payload.date}`,
html: report,
});
return analysis;
},
});
// Schedule the job to run daily
client.defineSchedule({
id: "daily-cost-schedule",
cron: "0 9 * * *", // 9 AM daily
job: "daily-llm-cost-analysis",
payload: {
date: new Date().toISOString().split('T')[0],
},
});
Implementation Patterns#
1. Middleware Pattern for Express/Fastify#
import { Request, Response, NextFunction } from 'express';
function createLLMMonitoringMiddleware(config: {
collector: MetricsCollector;
costController: CostController;
logger: Logger;
}) {
return async (
req: Request & { llmMetrics?: any },
res: Response,
next: NextFunction
) => {
const startTime = Date.now();
const requestId = req.headers['x-request-id'] || crypto.randomUUID();
// Attach metrics collection to request
req.llmMetrics = {
requestId,
startTime,
userId: req.user?.id || 'anonymous',
};
// Override response methods to capture metrics
const originalSend = res.send;
res.send = function(data: any) {
const endTime = Date.now();
const latency = endTime - startTime;
// Extract LLM metrics from response
if (data && typeof data === 'object' && data.usage) {
config.collector.trackRequest({
requestId,
model: data.model || req.body?.model,
latency,
tokens: {
prompt: data.usage.prompt_tokens,
completion: data.usage.completion_tokens,
},
cost: calculateCost(data.usage, data.model),
userId: req.llmMetrics.userId,
endpoint: req.path,
success: res.statusCode < 400,
});
}
return originalSend.call(this, data);
};
next();
};
}
2. Decorator Pattern for Class Methods#
function MonitorLLM(options?: {
trackCost?: boolean;
trackQuality?: boolean;
sampleRate?: number;
}) {
return function (
target: any,
propertyKey: string,
descriptor: PropertyDescriptor
) {
const originalMethod = descriptor.value;
descriptor.value = async function (...args: any[]) {
const shouldTrack = !options?.sampleRate ||
Math.random() < options.sampleRate;
if (!shouldTrack) {
return originalMethod.apply(this, args);
}
const startTime = Date.now();
const context = {
method: propertyKey,
class: target.constructor.name,
timestamp: new Date(),
};
try {
const result = await originalMethod.apply(this, args);
// Track success metrics
metricsCollector.track({
...context,
latency: Date.now() - startTime,
success: true,
result: options?.trackQuality ? result : undefined,
});
return result;
} catch (error) {
// Track error metrics
metricsCollector.track({
...context,
latency: Date.now() - startTime,
success: false,
error: error.message,
});
throw error;
}
};
return descriptor;
};
}
// Usage
class LLMService {
@MonitorLLM({ trackCost: true, sampleRate: 1.0 })
async generateResponse(prompt: string): Promise<string> {
return await this.llm.complete(prompt);
}
}
Cost Monitoring and Optimization#
Dynamic Model Selection Based on Cost#
class CostOptimizedRouter {
private models = [
{ name: 'gpt-3.5-turbo', costPer1k: 0.002, quality: 0.7 },
{ name: 'gpt-4', costPer1k: 0.06, quality: 0.95 },
{ name: 'claude-2', costPer1k: 0.008, quality: 0.85 },
{ name: 'llama-2-70b', costPer1k: 0.001, quality: 0.65 },
];
selectModel(requirements: {
minQuality: number;
maxCostPer1k: number;
estimatedTokens: number;
}) {
const eligibleModels = this.models
.filter(m =>
m.quality >= requirements.minQuality &&
m.costPer1k <= requirements.maxCostPer1k
)
.sort((a, b) => a.costPer1k - b.costPer1k);
if (eligibleModels.length === 0) {
throw new Error('No models meet requirements');
}
const selected = eligibleModels[0];
const estimatedCost = (requirements.estimatedTokens / 1000) * selected.costPer1k;
return {
model: selected.name,
estimatedCost,
qualityScore: selected.quality,
};
}
async routeWithFallback(prompt: string, options: any) {
const models = ['gpt-3.5-turbo', 'claude-2', 'llama-2-70b'];
for (const model of models) {
try {
const result = await this.callModel(model, prompt, options);
// Track successful routing
this.metrics.track({
event: 'model_routing',
model,
success: true,
attempt: models.indexOf(model) + 1,
});
return result;
} catch (error) {
if (models.indexOf(model) === models.length - 1) {
throw error; // Last model failed
}
// Try next model
}
}
}
}
Quality Monitoring#
Automated Quality Evaluation#
class QualityEvaluator {
private evaluationPrompts = {
relevance: `Rate the relevance of this response (0-10):
Question: {question}
Answer: {answer}
Score:`,
accuracy: `Evaluate the factual accuracy (0-10):
Question: {question}
Answer: {answer}
Ground Truth: {groundTruth}
Score:`,
coherence: `Rate the coherence and clarity (0-10):
Text: {text}
Score:`,
};
async evaluateResponse(
question: string,
answer: string,
groundTruth?: string
): Promise<QualityScores> {
const evaluations = await Promise.all([
this.evaluateMetric('relevance', { question, answer }),
groundTruth ?
this.evaluateMetric('accuracy', { question, answer, groundTruth }) :
Promise.resolve(null),
this.evaluateMetric('coherence', { text: answer }),
]);
return {
relevance: evaluations[0],
accuracy: evaluations[1],
coherence: evaluations[2],
overall: this.calculateOverallScore(evaluations),
};
}
private async evaluateMetric(
metric: string,
params: Record<string, string>
): Promise<number> {
const prompt = this.fillTemplate(
this.evaluationPrompts[metric],
params
);
const response = await this.llm.complete(prompt);
return this.parseScore(response);
}
private calculateOverallScore(scores: (number | null)[]): number {
const validScores = scores.filter(s => s !== null) as number[];
return validScores.reduce((a, b) => a + b, 0) / validScores.length;
}
}
Performance Monitoring#
Latency Optimization Strategies#
class PerformanceOptimizer {
private cache = new LRUCache<string, CachedResponse>({
max: 1000,
ttl: 1000 * 60 * 60, // 1 hour
});
async optimizedComplete(prompt: string, options: any) {
// 1. Check cache first
const cacheKey = this.generateCacheKey(prompt, options);
const cached = this.cache.get(cacheKey);
if (cached && !this.shouldInvalidate(cached)) {
this.metrics.increment('cache_hits');
return cached.response;
}
// 2. Use streaming for long responses
if (options.maxTokens > 500) {
return this.streamingComplete(prompt, options);
}
// 3. Parallel processing for multiple prompts
if (Array.isArray(prompt)) {
return this.batchComplete(prompt, options);
}
// 4. Standard completion with timeout
const timeout = options.timeout || 30000;
const result = await Promise.race([
this.llm.complete(prompt, options),
this.timeout(timeout),
]);
// Cache successful responses
this.cache.set(cacheKey, {
response: result,
timestamp: Date.now(),
tokens: result.usage?.total_tokens || 0,
});
return result;
}
private async streamingComplete(prompt: string, options: any) {
const stream = await this.llm.createStream(prompt, options);
const chunks: string[] = [];
let firstTokenTime: number | null = null;
for await (const chunk of stream) {
if (!firstTokenTime) {
firstTokenTime = Date.now();
this.metrics.record('time_to_first_token', firstTokenTime);
}
chunks.push(chunk);
// Yield chunks for real-time processing
yield chunk;
}
return chunks.join('');
}
}
Alert Strategies#
Multi-Level Alert System#
class AlertManager {
private alertChannels: Map<AlertSeverity, AlertChannel[]> = new Map([
['critical', [new PagerDutyChannel(), new SlackChannel('#critical-alerts')]],
['warning', [new SlackChannel('#llm-alerts'), new EmailChannel()]],
['info', [new LogChannel()]],
]);
private alertRules: AlertRule[] = [
{
name: 'high_error_rate',
condition: (metrics) => metrics.errorRate > 0.05,
severity: 'critical',
message: 'LLM error rate exceeds 5%',
},
{
name: 'high_latency',
condition: (metrics) => metrics.p95Latency > 5000,
severity: 'warning',
message: 'P95 latency exceeds 5 seconds',
},
{
name: 'cost_spike',
condition: (metrics) => metrics.hourlyCost > metrics.avgHourlyCost * 2,
severity: 'warning',
message: 'Hourly cost doubled compared to average',
},
{
name: 'quality_degradation',
condition: (metrics) => metrics.qualityScore < 0.7,
severity: 'critical',
message: 'Quality score below threshold',
},
];
async checkAlerts(currentMetrics: Metrics) {
for (const rule of this.alertRules) {
if (rule.condition(currentMetrics)) {
await this.sendAlert({
rule: rule.name,
severity: rule.severity,
message: rule.message,
metrics: currentMetrics,
timestamp: new Date(),
});
}
}
}
private async sendAlert(alert: Alert) {
const channels = this.alertChannels.get(alert.severity) || [];
await Promise.all(
channels.map(channel =>
channel.send(alert).catch(err =>
console.error(`Failed to send alert via ${channel.name}:`, err)
)
)
);
// Store alert history
await this.storeAlertHistory(alert);
}
}
Debugging Production Issues#
Request Tracing and Replay#
class LLMDebugger {
private traceStore: TraceStore;
async captureTrace(request: LLMRequest): Promise<string> {
const traceId = generateTraceId();
const trace: LLMTrace = {
traceId,
timestamp: new Date(),
request: {
prompt: request.prompt,
model: request.model,
parameters: request.parameters,
headers: this.sanitizeHeaders(request.headers),
},
response: null,
error: null,
metadata: {
userId: request.userId,
sessionId: request.sessionId,
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
},
timeline: [],
};
// Store initial trace
await this.traceStore.save(trace);
return traceId;
}
async replayRequest(traceId: string, options?: {
modifyRequest?: (req: any) => any;
compareResponses?: boolean;
}): Promise<ReplayResult> {
const trace = await this.traceStore.get(traceId);
if (!trace) {
throw new Error(`Trace ${traceId} not found`);
}
// Modify request if needed
const request = options?.modifyRequest
? options.modifyRequest(trace.request)
: trace.request;
// Replay the request
const startTime = Date.now();
try {
const response = await this.llm.complete(request);
const result: ReplayResult = {
success: true,
response,
latency: Date.now() - startTime,
originalTrace: trace,
};
// Compare with original if requested
if (options?.compareResponses && trace.response) {
result.comparison = this.compareResponses(
trace.response,
response
);
}
return result;
} catch (error) {
return {
success: false,
error: error.message,
latency: Date.now() - startTime,
originalTrace: trace,
};
}
}
private compareResponses(original: any, replay: any) {
return {
textSimilarity: this.calculateSimilarity(
original.content,
replay.content
),
tokenDifference: {
prompt: replay.usage.prompt_tokens - original.usage.prompt_tokens,
completion: replay.usage.completion_tokens - original.usage.completion_tokens,
},
costDifference: this.calculateCostDifference(original, replay),
};
}
}
Common Issues and Solutions#
Issue 1: Token Limit Exceeded#
Symptoms: 400 errors with “max_tokens exceeded” message
Solutions:
class TokenManager {
async handleTokenLimit(prompt: string, maxTokens: number) {
const estimatedTokens = this.estimateTokens(prompt);
if (estimatedTokens > maxTokens) {
// Strategy 1: Truncate prompt
const truncated = this.truncateToTokenLimit(
prompt,
maxTokens * 0.8 // Leave room for completion
);
// Strategy 2: Split into chunks
const chunks = this.splitIntoChunks(prompt, maxTokens * 0.5);
const results = await Promise.all(
chunks.map(chunk => this.processChunk(chunk))
);
return this.mergeResults(results);
}
return this.llm.complete(prompt);
}
}
Issue 2: Rate Limiting#
Symptoms: 429 errors, “Rate limit exceeded”
Solutions:
class RateLimitHandler {
private queues = new Map<string, PQueue>();
getQueue(model: string): PQueue {
if (!this.queues.has(model)) {
this.queues.set(model, new PQueue({
concurrency: this.getConcurrencyLimit(model),
interval: 60000, // 1 minute
intervalCap: this.getIntervalCap(model),
}));
}
return this.queues.get(model)!;
}
async executeWithRetry(
fn: () => Promise<any>,
options: {
maxRetries: number;
backoffMultiplier: number;
}
) {
let lastError;
for (let i = 0; i <= options.maxRetries; i++) {
try {
return await fn();
} catch (error) {
if (error.status === 429) {
const delay = this.calculateBackoff(i, options.backoffMultiplier);
await this.sleep(delay);
lastError = error;
} else {
throw error; // Non-retryable error
}
}
}
throw lastError;
}
}
Issue 3: Inconsistent Quality#
Symptoms: Varying response quality, user complaints
Solutions:
class QualityAssurance {
async ensureQuality(
prompt: string,
options: QualityOptions
): Promise<string> {
// Strategy 1: Multiple generations + selection
if (options.useConsensus) {
const responses = await Promise.all(
Array(3).fill(null).map(() =>
this.llm.complete(prompt, { temperature: 0.7 })
)
);
return this.selectBestResponse(responses);
}
// Strategy 2: Self-critique and revision
if (options.useSelfCritique) {
const initial = await this.llm.complete(prompt);
const critique = await this.llm.complete(
`Critique this response: ${initial}`
);
const revised = await this.llm.complete(
`Revise based on critique: ${critique}`
);
return revised;
}
// Strategy 3: Validation loop
let response;
let attempts = 0;
do {
response = await this.llm.complete(prompt);
const isValid = await this.validateResponse(response, options);
if (isValid) break;
attempts++;
prompt = this.refinePrompt(prompt, response);
} while (attempts < options.maxAttempts);
return response;
}
}
FAQ#
What metrics are most important for LLM monitoring?#
The critical metrics to track are:
- Latency: Response time (p50, p95, p99)
- Cost: Token usage and $ spent per request/user
- Error Rate: Failed requests and error types
- Quality: User satisfaction and response accuracy
- Token Usage: Input/output token distribution
How do I monitor LLMs without storing sensitive data?#
// Best practices for privacy-preserving monitoring
const privacyConfig = {
// Hash prompts instead of storing raw text
storePromptHash: true,
storePromptText: false,
// Aggregate metrics without user attribution
anonymizeUserIds: true,
// Redact PII before logging
enablePIIRedaction: true,
// Store only metadata
metadataOnly: true,
};
What’s the difference between Langfuse and Helicone?#
- Langfuse: Open-source, self-hostable, focuses on traces and debugging
- Helicone: Proxy-based, easier setup, better for analytics and caching
Choose Langfuse for full control and debugging, Helicone for quick setup and analytics.
How can I reduce monitoring overhead?#
- Sampling: Monitor only a percentage of requests
- Async logging: Don’t block LLM calls for monitoring
- Batch metrics: Send metrics in batches, not per-request
- Edge monitoring: Use Cloudflare Workers for low-latency tracking
Should I use OpenTelemetry for LLM monitoring?#
Yes, if you:
- Already use OpenTelemetry in your stack
- Need standardized observability
- Want vendor-neutral instrumentation
- Require distributed tracing
How do I set up alerts for LLM issues?#
Implement multi-level alerts:
const alertConfig = {
critical: {
errorRate: 0.05, // >5% errors
downtime: 60, // >1 minute down
costSpike: 3.0, // >3x normal cost
},
warning: {
latencyP95: 5000, // >5s P95 latency
qualityScore: 0.7, // <70% quality
tokenUsage: 0.8, // >80% of limit
},
};
Can I monitor local LLMs the same way?#
Yes, the patterns work for any LLM:
// Local LLM monitoring
const localLLMMonitor = new LLMMonitor({
endpoint: 'http://localhost:11434', // Ollama
metricsEndpoint: 'http://localhost:9090', // Prometheus
customMetrics: {
gpuUsage: true,
memoryUsage: true,
modelLoadTime: true,
},
});
How do I debug slow LLM responses?#
- Check token count: More tokens = slower response
- Monitor model load: Some models have cold starts
- Network latency: Check connection to LLM provider
- Rate limiting: You might be throttled
- Model selection: Larger models are slower
Best Practices#
1. Privacy and Security#
class PrivacyPreservingLogger {
private sensitivePatterns = [
/\b\d{3}-\d{2}-\d{4}\b/g, // SSN
/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g, // Email
/\b(?:\d{4}[\s-]?){3}\d{4}\b/g, // Credit card
];
sanitize(text: string): string {
let sanitized = text;
for (const pattern of this.sensitivePatterns) {
sanitized = sanitized.replace(pattern, '[REDACTED]');
}
return sanitized;
}
logLLMInteraction(interaction: {
prompt: string;
response: string;
metadata: any;
}) {
const sanitized = {
prompt: this.sanitize(interaction.prompt),
response: this.sanitize(interaction.response),
metadata: {
...interaction.metadata,
promptHash: this.hash(interaction.prompt),
responseHash: this.hash(interaction.response),
},
};
// Log sanitized version
this.logger.info('LLM Interaction', sanitized);
}
private hash(text: string): string {
return crypto
.createHash('sha256')
.update(text)
.digest('hex');
}
}
2. Cost Control#
class CostController {
private limits: Map<string, { daily: number; monthly: number }>;
private usage: Map<string, { daily: number; monthly: number }>;
async checkLimit(userId: string, estimatedCost: number): Promise<{
allowed: boolean;
reason?: string;
remainingBudget?: number;
}> {
const userLimits = this.limits.get(userId);
const userUsage = this.usage.get(userId) || { daily: 0, monthly: 0 };
if (!userLimits) {
return { allowed: true }; // No limits set
}
// Check daily limit
if (userUsage.daily + estimatedCost > userLimits.daily) {
return {
allowed: false,
reason: 'Daily limit exceeded',
remainingBudget: Math.max(0, userLimits.daily - userUsage.daily),
};
}
// Check monthly limit
if (userUsage.monthly + estimatedCost > userLimits.monthly) {
return {
allowed: false,
reason: 'Monthly limit exceeded',
remainingBudget: Math.max(0, userLimits.monthly - userUsage.monthly),
};
}
return {
allowed: true,
remainingBudget: Math.min(
userLimits.daily - userUsage.daily,
userLimits.monthly - userUsage.monthly
),
};
}
}
Conclusion#
Effective LLM observability is crucial for running AI applications in production. Key takeaways:
- Monitor comprehensively: Track performance, cost, quality, and errors
- Use the right tools: Langfuse for traces, Helicone for analytics, Sentry for errors
- Build custom monitoring: Tailor metrics to your specific use case
- Automate analysis: Use scheduled jobs to detect trends and anomalies
- Protect privacy: Always sanitize sensitive data before logging
- Control costs: Implement budget limits and alerts
By implementing robust observability, you can ensure your LLM applications are reliable, cost-effective, and deliver value to users.