Skip to Content
Deep DivesObservability

Observability

Observability is the ability to understand the internal state of a system by examining its outputs. This guide covers the three pillars, instrumentation strategies, and debugging production systems.

Three Pillars of Observability

┌─────────────────────────────────────────────────────────────┐ │ Observability Pillars │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Logs │ │ Metrics │ │ Traces │ │ │ │ │ │ │ │ │ │ │ │ What happened│ │ How much/ │ │ Where in │ │ │ │ (events) │ │ how fast │ │ the system │ │ │ │ │ │ (numbers) │ │ (journey) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ Debug errors Alert on issues Find bottlenecks │ │ Audit trail Trend analysis Cross-service │ │ Forensics Dashboards debugging │ │ │ └─────────────────────────────────────────────────────────────┘

Logs

Discrete events with context.

Structured Logging

// ❌ Unstructured (hard to parse) "2024-01-15 10:30:45 ERROR User 123 failed to checkout: payment declined" // ✅ Structured (queryable) { "timestamp": "2024-01-15T10:30:45Z", "level": "ERROR", "service": "checkout", "event": "checkout_failed", "user_id": "123", "order_id": "ord_456", "error": "payment_declined", "payment_provider": "stripe", "trace_id": "abc123" }

Log Levels

LevelUsageExample
DEBUGDevelopment detailsVariable values, flow
INFONormal operationsRequest received, job started
WARNUnexpected but handledRetry succeeded, deprecated API
ERRORFailure requiring attentionPayment failed, DB connection lost
FATALSystem cannot continueOut of memory, config missing

Logging Best Practices

# ✅ Good logging logger.info("Order created", extra={ "order_id": order.id, "user_id": user.id, "total": order.total, "items_count": len(order.items) }) # ❌ Bad logging logger.info(f"Order {order.id} created for user {user.id}") # Not structured logger.info("Order created") # No context logger.debug(f"Processing order: {order.__dict__}") # Too verbose/sensitive

Log Aggregation Architecture

┌─────────┐ ┌─────────┐ ┌─────────┐ │Service A│ │Service B│ │Service C│ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ └────────────┼────────────┘ ┌──────▼──────┐ │ Log │ (Fluentd, Logstash, Vector) │ Shipper │ └──────┬──────┘ ┌──────▼──────┐ │ Message │ (Kafka, Kinesis) │ Queue │ └──────┬──────┘ ┌──────▼──────┐ │ Search │ (Elasticsearch, Loki) │ Store │ └──────┬──────┘ ┌──────▼──────┐ │ Dashboard │ (Kibana, Grafana) │ / Query │ └─────────────┘

Metrics

Numeric measurements over time.

Metric Types

TypeDescriptionExample
CounterCumulative, only increasesTotal requests, errors
GaugeCurrent value, can go up/downActive connections, queue size
HistogramDistribution of valuesRequest latency buckets
SummaryCalculated percentilesP50, P90, P99 latency

Prometheus Metrics

from prometheus_client import Counter, Histogram, Gauge # Counter requests_total = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'] ) # Histogram request_latency = Histogram( 'http_request_duration_seconds', 'Request latency', ['method', 'endpoint'], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0] ) # Gauge active_connections = Gauge( 'active_connections', 'Number of active connections' ) # Usage @app.route('/api/users') def get_users(): with request_latency.labels(method='GET', endpoint='/api/users').time(): result = fetch_users() requests_total.labels(method='GET', endpoint='/api/users', status='200').inc() return result

RED Method (Request-focused)

MetricDescription
RateRequests per second
ErrorsFailed requests per second
DurationLatency distribution

USE Method (Resource-focused)

MetricDescription
Utilization% time resource is busy
SaturationQueue depth, waiting work
ErrorsError events

Key Metrics to Track

CategoryMetrics
HTTPRequest rate, error rate, latency (P50, P90, P99)
DatabaseQuery time, connection pool, slow queries
CacheHit rate, evictions, memory usage
QueueDepth, processing time, dead letters
SystemCPU, memory, disk, network

Distributed Tracing

Follow requests across service boundaries.

Trace Structure

Trace (entire request journey): ├── Span: API Gateway (15ms) │ ├── Span: Auth Service (5ms) │ └── Span: Order Service (10ms) │ ├── Span: Inventory Check (3ms) │ ├── Span: Payment Service (6ms) │ │ └── Span: Stripe API (5ms) │ └── Span: Database Write (1ms)

Trace Context Propagation

┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Service A │────▶│ Service B │────▶│ Service C │ │ │ │ │ │ │ │ trace_id:123│ │ trace_id:123│ │ trace_id:123│ │ span_id: a │ │ span_id: b │ │ span_id: c │ │ parent: - │ │ parent: a │ │ parent: b │ └─────────────┘ └─────────────┘ └─────────────┘ HTTP Headers: traceparent: 00-trace_id-span_id-01 tracestate: vendor=value

OpenTelemetry

from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter # Setup provider = TracerProvider() processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="collector:4317")) provider.add_span_processor(processor) trace.set_tracer_provider(provider) tracer = trace.get_tracer(__name__) # Usage @app.route('/api/orders') def create_order(): with tracer.start_as_current_span("create_order") as span: span.set_attribute("user_id", user.id) with tracer.start_as_current_span("validate_inventory"): inventory = check_inventory(items) with tracer.start_as_current_span("process_payment"): payment = charge_customer(total) return order

Tracing Architecture

┌─────────┐ ┌─────────┐ ┌─────────┐ │Service A│ │Service B│ │Service C│ │ +SDK │ │ +SDK │ │ +SDK │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ └────────────┼────────────┘ ┌──────▼──────┐ │ Collector │ (OpenTelemetry Collector) │ │ └──────┬──────┘ ┌────────────┼────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Jaeger │ │ Zipkin │ │ Tempo │ └─────────┘ └─────────┘ └─────────┘

Correlating Signals

Connect logs, metrics, and traces.

┌─────────────────────────────────────────────────────────────┐ │ Correlation Strategy │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Request comes in with trace_id: abc123 │ │ │ │ Logs: │ │ {"trace_id": "abc123", "event": "order_created", ...} │ │ │ │ Metrics: │ │ http_requests_total{trace_id="abc123", status="200"} │ │ │ │ Traces: │ │ Span[trace_id=abc123, name="create_order"] │ │ │ │ Exemplars (link metrics to traces): │ │ histogram_bucket{le="0.1"} 45 # {trace_id="abc123"} │ │ │ └─────────────────────────────────────────────────────────────┘

Debugging Flow

1. Alert fires: "P99 latency > 500ms" 2. Check metrics dashboard - Which endpoint? /api/orders - When did it start? 10:30 AM - Which service? Order Service 3. Find exemplar trace - Click on slow data point - Get trace_id from exemplar 4. View trace - See span breakdown - Identify slow span: Payment Service (450ms) 5. Query logs with trace_id - Find error: "Payment provider timeout" - See retry attempts 6. Root cause identified

Alerting

Alert Design Principles

PrincipleDescription
ActionableEvery alert requires action
MeaningfulBased on user impact, not internal metrics
TimelyNot too early (noise) or too late (damage done)
UnderstandableClear what’s wrong and what to do

SLO-Based Alerting

# Alert on error budget consumption rate - alert: ErrorBudgetBurnRate expr: | ( sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > 14.4 * (1 - 0.999) # 14.4x burn rate for: 1h labels: severity: critical annotations: summary: "Error budget burning too fast" description: "At current rate, will exhaust monthly error budget in 5 days"

Multi-Window Alerting

Fast burn (2% budget in 1 hour): Window: 1 hour Threshold: 14.4x normal error rate Slow burn (5% budget in 6 hours): Window: 6 hours Threshold: 6x normal error rate Both must fire to avoid false positives

Alert Fatigue Prevention

StrategyImplementation
AggregateGroup related alerts
DeduplicateSame alert once until resolved
RouteRight team, right channel
EscalateAuto-escalate if unacknowledged
ReviewRegular alert hygiene

Dashboards

Dashboard Types

TypePurposeRefresh
OverviewSystem health at a glance1 min
ServiceDeep dive into single service30 sec
IncidentReal-time during incidents5 sec
BusinessKPIs for stakeholders5 min

Effective Dashboard Design

┌─────────────────────────────────────────────────────────────┐ │ Order Service Dashboard │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Top Row: Key Metrics (RED) │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │Requests/s│ │Error Rate│ │P99 Latency│ │SLO Status│ │ │ │ 1,234 │ │ 0.1% │ │ 145ms │ │ 99.92% │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ Middle: Time Series │ │ ┌────────────────────────────────────────────────────┐ │ │ │ Request Rate & Errors │ │ │ │ ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁ │ │ │ └────────────────────────────────────────────────────┘ │ │ │ │ Bottom: Breakdowns │ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │ │ By Endpoint │ │ By Status Code │ │ │ │ /orders 45% │ │ 200 98.5% │ │ │ │ /users 30% │ │ 400 1.0% │ │ │ │ /products 25% │ │ 500 0.5% │ │ │ └─────────────────────┘ └─────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘

Observability Stack

Common Tools

CategoryTools
LogsElasticsearch, Loki, Splunk, Datadog
MetricsPrometheus, InfluxDB, Datadog, CloudWatch
TracesJaeger, Zipkin, Tempo, Datadog, X-Ray
All-in-oneDatadog, New Relic, Dynatrace, Grafana Cloud
CollectorOpenTelemetry Collector, Fluent Bit
VisualizationGrafana, Kibana

OpenTelemetry Architecture

┌─────────────────────────────────────────────────────────────┐ │ Application │ │ ┌────────────────────────────────────────────────────┐ │ │ │ OpenTelemetry SDK │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ │ │ Traces │ │ Metrics │ │ Logs │ │ │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ └───────┼────────────┼────────────┼─────────────────┘ │ │ └────────────┼────────────┘ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ OTLP Exporter │ │ │ └────────┬────────┘ │ └───────────────────────┼─────────────────────────────────────┘ ┌─────────────────┐ │ OTel Collector │ │ - Receive │ │ - Process │ │ - Export │ └────────┬────────┘ ┌──────────────┼──────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Jaeger │ │Prometheus│ │ Loki │ └─────────┘ └─────────┘ └─────────┘

Interview Quick Reference

Common Questions

  1. “How would you debug a slow API endpoint?”

    • Check metrics: Which endpoint, when did it start?
    • Find slow traces: Identify bottleneck span
    • Correlate logs: Error messages, context
    • Root cause: Database, external service, CPU?
  2. “What metrics would you track for a new service?”

    • RED: Rate, Errors, Duration
    • Saturation: Queue depth, thread pool
    • Dependencies: Downstream latency, errors
    • Business: Orders/sec, revenue
  3. “How do you handle alert fatigue?”

    • SLO-based alerting (user impact)
    • Multi-window burn rate
    • Aggregate and deduplicate
    • Regular alert review

Observability Checklist

  • Structured logging with trace correlation?
  • Key metrics (RED/USE) instrumented?
  • Distributed tracing enabled?
  • Dashboards for each service?
  • SLO-based alerting?
  • Runbooks for common issues?
  • Log retention policy defined?
  • Trace sampling strategy?
Last updated on