Self-Hosted Observability: Community Stacks for Logs, Metrics & Traces

This Hacker News discussion delves into the self-hosted open-source stacks engineers are using for logs, traces, and metrics, offering a wealth of practical experiences and recommendations. The conversation highlights a pragmatic approach many take, balancing the desire for control and cost-effectiveness with the need to manage engineering resources efficiently.

Dominant OSS Stacks and Key Alternatives

The classic Prometheus and Grafana stack, often complemented by Loki for logs and Tempo for traces, continues to be a go-to choice, particularly favored for smaller projects. Its well-established ecosystem and community support make it a reliable starting point. Users also emphasize the growing importance of integrating OpenTelemetry collectors for more sophisticated data processing and to maintain vendor neutrality.

However, VictoriaMetrics emerges as a strong contender, frequently praised for its impressive performance and significantly lower resource consumption (e.g., a reported 7x RAM reduction compared to Prometheus). The suite, including vmagent and vmalert, is described as "rock solid, lean and performant." While highly recommended, some users point out that its documentation could be more comprehensive and that building Debian packages from source might be preferable for stability. VictoriaMetrics also offers VictoriaLogs, positioned as an alternative to Loki, with its CTO actively inviting feedback for usability improvements.

Challenges with Visualization and Management

A notable pain point surfaced regarding Grafana, particularly for users operating in airgapped, Infrastructure-as-Code (IaC) environments. Managing dashboards and plugins in such setups was described as "like pulling teeth." One user shared an experience where a Grafana container image update (pinned by tag, not SHA) broke dashboard and data-source links, prompting them to explore alternatives.

Logging Solutions: Beyond the ELK Stack

While the ELK (Elasticsearch, Logstash, Kibana) stack is a known quantity, the discussion highlighted other approaches:

Fluentd + Elasticsearch: Chosen by some for its power, full data control, and cost-effectiveness at scale for self-hosted logs.
Loki: Often paired with Grafana in the Prometheus ecosystem.
Simple Log Management: For solo developers or very small setups, tools like logrotate, systemd, and journalctl | grep are still in use. However, the limitations are clear: lack of centralized logging and data loss if the host machine becomes unavailable. Some mitigate this by storing critical events in a primary database like PostgreSQL.

Hybrid Approaches: The Best of Both Worlds?

Several contributors advocate for a pragmatic, hybrid approach. One company, Markhub, shared their strategy:

Self-host logs (Fluentd/Elasticsearch) for control and cost-effectiveness.
Use a managed SaaS (Datadog) for metrics and monitoring to save engineering time, leveraging its out-of-the-box dashboards and integrations.
They are also exploring OpenTelemetry for traces to ensure vendor neutrality, even while currently using Datadog APM.

This philosophy underscores a key theme: evaluating where engineering time is best spent versus the benefits of direct control or cost savings from self-hosting.

Other Notable Tools and Considerations

LibreNMS with SNMP: Mentioned as a standard way to gather a wide range of metrics.
Shynet: A very lightweight, self-hostable visitor tracking system for websites.
Vendor Lock-in: A recurring concern, with OpenTelemetry often cited as a strategy to mitigate this risk.

The discussion provides valuable insights for anyone considering or currently managing a self-hosted observability stack, offering a snapshot of popular tools, their real-world performance, and the practical trade-offs involved.