Observability
Observability is the ability to understand the internal state of a system by examining its outputs. This guide covers the three pillars, instrumentation strategies, and debugging production systems.
Three Pillars of Observability
┌─────────────────────────────────────────────────────────────┐
│ Observability Pillars │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Logs │ │ Metrics │ │ Traces │ │
│ │ │ │ │ │ │ │
│ │ What happened│ │ How much/ │ │ Where in │ │
│ │ (events) │ │ how fast │ │ the system │ │
│ │ │ │ (numbers) │ │ (journey) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Debug errors Alert on issues Find bottlenecks │
│ Audit trail Trend analysis Cross-service │
│ Forensics Dashboards debugging │
│ │
└─────────────────────────────────────────────────────────────┘Logs
Discrete events with context.
Structured Logging
// ❌ Unstructured (hard to parse)
"2024-01-15 10:30:45 ERROR User 123 failed to checkout: payment declined"
// ✅ Structured (queryable)
{
"timestamp": "2024-01-15T10:30:45Z",
"level": "ERROR",
"service": "checkout",
"event": "checkout_failed",
"user_id": "123",
"order_id": "ord_456",
"error": "payment_declined",
"payment_provider": "stripe",
"trace_id": "abc123"
}Log Levels
| Level | Usage | Example |
|---|---|---|
| DEBUG | Development details | Variable values, flow |
| INFO | Normal operations | Request received, job started |
| WARN | Unexpected but handled | Retry succeeded, deprecated API |
| ERROR | Failure requiring attention | Payment failed, DB connection lost |
| FATAL | System cannot continue | Out of memory, config missing |
Logging Best Practices
# ✅ Good logging
logger.info("Order created", extra={
"order_id": order.id,
"user_id": user.id,
"total": order.total,
"items_count": len(order.items)
})
# ❌ Bad logging
logger.info(f"Order {order.id} created for user {user.id}") # Not structured
logger.info("Order created") # No context
logger.debug(f"Processing order: {order.__dict__}") # Too verbose/sensitiveLog Aggregation Architecture
┌─────────┐ ┌─────────┐ ┌─────────┐
│Service A│ │Service B│ │Service C│
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└────────────┼────────────┘
│
┌──────▼──────┐
│ Log │ (Fluentd, Logstash, Vector)
│ Shipper │
└──────┬──────┘
│
┌──────▼──────┐
│ Message │ (Kafka, Kinesis)
│ Queue │
└──────┬──────┘
│
┌──────▼──────┐
│ Search │ (Elasticsearch, Loki)
│ Store │
└──────┬──────┘
│
┌──────▼──────┐
│ Dashboard │ (Kibana, Grafana)
│ / Query │
└─────────────┘Metrics
Numeric measurements over time.
Metric Types
| Type | Description | Example |
|---|---|---|
| Counter | Cumulative, only increases | Total requests, errors |
| Gauge | Current value, can go up/down | Active connections, queue size |
| Histogram | Distribution of values | Request latency buckets |
| Summary | Calculated percentiles | P50, P90, P99 latency |
Prometheus Metrics
from prometheus_client import Counter, Histogram, Gauge
# Counter
requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
# Histogram
request_latency = Histogram(
'http_request_duration_seconds',
'Request latency',
['method', 'endpoint'],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
# Gauge
active_connections = Gauge(
'active_connections',
'Number of active connections'
)
# Usage
@app.route('/api/users')
def get_users():
with request_latency.labels(method='GET', endpoint='/api/users').time():
result = fetch_users()
requests_total.labels(method='GET', endpoint='/api/users', status='200').inc()
return resultRED Method (Request-focused)
| Metric | Description |
|---|---|
| Rate | Requests per second |
| Errors | Failed requests per second |
| Duration | Latency distribution |
USE Method (Resource-focused)
| Metric | Description |
|---|---|
| Utilization | % time resource is busy |
| Saturation | Queue depth, waiting work |
| Errors | Error events |
Key Metrics to Track
| Category | Metrics |
|---|---|
| HTTP | Request rate, error rate, latency (P50, P90, P99) |
| Database | Query time, connection pool, slow queries |
| Cache | Hit rate, evictions, memory usage |
| Queue | Depth, processing time, dead letters |
| System | CPU, memory, disk, network |
Distributed Tracing
Follow requests across service boundaries.
Trace Structure
Trace (entire request journey):
├── Span: API Gateway (15ms)
│ ├── Span: Auth Service (5ms)
│ └── Span: Order Service (10ms)
│ ├── Span: Inventory Check (3ms)
│ ├── Span: Payment Service (6ms)
│ │ └── Span: Stripe API (5ms)
│ └── Span: Database Write (1ms)Trace Context Propagation
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Service A │────▶│ Service B │────▶│ Service C │
│ │ │ │ │ │
│ trace_id:123│ │ trace_id:123│ │ trace_id:123│
│ span_id: a │ │ span_id: b │ │ span_id: c │
│ parent: - │ │ parent: a │ │ parent: b │
└─────────────┘ └─────────────┘ └─────────────┘
HTTP Headers:
traceparent: 00-trace_id-span_id-01
tracestate: vendor=valueOpenTelemetry
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
# Usage
@app.route('/api/orders')
def create_order():
with tracer.start_as_current_span("create_order") as span:
span.set_attribute("user_id", user.id)
with tracer.start_as_current_span("validate_inventory"):
inventory = check_inventory(items)
with tracer.start_as_current_span("process_payment"):
payment = charge_customer(total)
return orderTracing Architecture
┌─────────┐ ┌─────────┐ ┌─────────┐
│Service A│ │Service B│ │Service C│
│ +SDK │ │ +SDK │ │ +SDK │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└────────────┼────────────┘
│
┌──────▼──────┐
│ Collector │ (OpenTelemetry Collector)
│ │
└──────┬──────┘
│
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Jaeger │ │ Zipkin │ │ Tempo │
└─────────┘ └─────────┘ └─────────┘Correlating Signals
Connect logs, metrics, and traces.
┌─────────────────────────────────────────────────────────────┐
│ Correlation Strategy │
├─────────────────────────────────────────────────────────────┤
│ │
│ Request comes in with trace_id: abc123 │
│ │
│ Logs: │
│ {"trace_id": "abc123", "event": "order_created", ...} │
│ │
│ Metrics: │
│ http_requests_total{trace_id="abc123", status="200"} │
│ │
│ Traces: │
│ Span[trace_id=abc123, name="create_order"] │
│ │
│ Exemplars (link metrics to traces): │
│ histogram_bucket{le="0.1"} 45 # {trace_id="abc123"} │
│ │
└─────────────────────────────────────────────────────────────┘Debugging Flow
1. Alert fires: "P99 latency > 500ms"
│
▼
2. Check metrics dashboard
- Which endpoint? /api/orders
- When did it start? 10:30 AM
- Which service? Order Service
│
▼
3. Find exemplar trace
- Click on slow data point
- Get trace_id from exemplar
│
▼
4. View trace
- See span breakdown
- Identify slow span: Payment Service (450ms)
│
▼
5. Query logs with trace_id
- Find error: "Payment provider timeout"
- See retry attempts
│
▼
6. Root cause identifiedAlerting
Alert Design Principles
| Principle | Description |
|---|---|
| Actionable | Every alert requires action |
| Meaningful | Based on user impact, not internal metrics |
| Timely | Not too early (noise) or too late (damage done) |
| Understandable | Clear what’s wrong and what to do |
SLO-Based Alerting
# Alert on error budget consumption rate
- alert: ErrorBudgetBurnRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > 14.4 * (1 - 0.999) # 14.4x burn rate
for: 1h
labels:
severity: critical
annotations:
summary: "Error budget burning too fast"
description: "At current rate, will exhaust monthly error budget in 5 days"Multi-Window Alerting
Fast burn (2% budget in 1 hour):
Window: 1 hour
Threshold: 14.4x normal error rate
Slow burn (5% budget in 6 hours):
Window: 6 hours
Threshold: 6x normal error rate
Both must fire to avoid false positivesAlert Fatigue Prevention
| Strategy | Implementation |
|---|---|
| Aggregate | Group related alerts |
| Deduplicate | Same alert once until resolved |
| Route | Right team, right channel |
| Escalate | Auto-escalate if unacknowledged |
| Review | Regular alert hygiene |
Dashboards
Dashboard Types
| Type | Purpose | Refresh |
|---|---|---|
| Overview | System health at a glance | 1 min |
| Service | Deep dive into single service | 30 sec |
| Incident | Real-time during incidents | 5 sec |
| Business | KPIs for stakeholders | 5 min |
Effective Dashboard Design
┌─────────────────────────────────────────────────────────────┐
│ Order Service Dashboard │
├─────────────────────────────────────────────────────────────┤
│ │
│ Top Row: Key Metrics (RED) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Requests/s│ │Error Rate│ │P99 Latency│ │SLO Status│ │
│ │ 1,234 │ │ 0.1% │ │ 145ms │ │ 99.92% │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ Middle: Time Series │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Request Rate & Errors │ │
│ │ ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁ │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ Bottom: Breakdowns │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ By Endpoint │ │ By Status Code │ │
│ │ /orders 45% │ │ 200 98.5% │ │
│ │ /users 30% │ │ 400 1.0% │ │
│ │ /products 25% │ │ 500 0.5% │ │
│ └─────────────────────┘ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘Observability Stack
Common Tools
| Category | Tools |
|---|---|
| Logs | Elasticsearch, Loki, Splunk, Datadog |
| Metrics | Prometheus, InfluxDB, Datadog, CloudWatch |
| Traces | Jaeger, Zipkin, Tempo, Datadog, X-Ray |
| All-in-one | Datadog, New Relic, Dynatrace, Grafana Cloud |
| Collector | OpenTelemetry Collector, Fluent Bit |
| Visualization | Grafana, Kibana |
OpenTelemetry Architecture
┌─────────────────────────────────────────────────────────────┐
│ Application │
│ ┌────────────────────────────────────────────────────┐ │
│ │ OpenTelemetry SDK │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Traces │ │ Metrics │ │ Logs │ │ │
│ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │
│ └───────┼────────────┼────────────┼─────────────────┘ │
│ └────────────┼────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ OTLP Exporter │ │
│ └────────┬────────┘ │
└───────────────────────┼─────────────────────────────────────┘
│
▼
┌─────────────────┐
│ OTel Collector │
│ - Receive │
│ - Process │
│ - Export │
└────────┬────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Jaeger │ │Prometheus│ │ Loki │
└─────────┘ └─────────┘ └─────────┘Interview Quick Reference
Common Questions
-
“How would you debug a slow API endpoint?”
- Check metrics: Which endpoint, when did it start?
- Find slow traces: Identify bottleneck span
- Correlate logs: Error messages, context
- Root cause: Database, external service, CPU?
-
“What metrics would you track for a new service?”
- RED: Rate, Errors, Duration
- Saturation: Queue depth, thread pool
- Dependencies: Downstream latency, errors
- Business: Orders/sec, revenue
-
“How do you handle alert fatigue?”
- SLO-based alerting (user impact)
- Multi-window burn rate
- Aggregate and deduplicate
- Regular alert review
Observability Checklist
- Structured logging with trace correlation?
- Key metrics (RED/USE) instrumented?
- Distributed tracing enabled?
- Dashboards for each service?
- SLO-based alerting?
- Runbooks for common issues?
- Log retention policy defined?
- Trace sampling strategy?
Last updated on