Having metrics doesn't mean you have observability. Here's how I set up monitoring stacks that help teams actually understand what's happening.
Metrics vs Logs vs Traces
Metrics tell you what's happening (CPU usage, request rate). Logs tell you what happened (error messages, events). Traces tell you where it happened (request flow through services). You need all three.
Building Your Stack
I use: Prometheus for metrics collection, Grafana for visualization, ELK stack for log aggregation, Jaeger for distributed tracing. Start with metrics, add logs, then traces.
What to Monitor
The four golden signals: Latency, Traffic, Errors, Saturation. Monitor these for every service. Everything else is nice-to-have.
Alerting Done Right
Alert on symptoms, not causes. Alert on what users experience (slow responses), not what you think might be wrong (high CPU). Set meaningful thresholds. Test your alerts.
Observability isn't about collecting all the data. It's about collecting the right data and making it actionable.