Create a Grafana Dashboard for Backstage Metrics on Mimir (OTel → Alloy → Mimir)

This post continues from:

Backstage metrics flow:

Backstage (OpenTelemetry SDK) → OTLP/HTTP → Alloy → remote_write → Mimir → Grafana

We’ll build a Grafana dashboard using the Mimir datasource.


Lab context

  • Namespace (Backstage): backstage
  • Namespace (LGTM): observability
  • Datasource: Mimir (Prometheus-compatible)

Prerequisites

Grafana → Explore → Mimir:

count({service_name="homelab-backstage"})

Notes on OTLP export cadence

If your app exports metrics every ~60s (e.g. exportIntervalMillis: 60000), prefer [5m] windows for rate() and histograms.


  1. Create dashboard variables

Constant variable: svc

  • Type: Constant
  • Name: svc
  • Value: homelab-backstage

Query variables (multi + include all + custom all value .*)

All use: Query type = Label values, datasource = Mimir, metric = http_server_duration_milliseconds_count.

  • route (Label: Route), label = http_route, filter: service_name = $svc
  • method (Label: Method), label = http_method, filter: service_name = $svc
  • status (Label: Status), label = http_status_code, filter: service_name = $svc

  1. Top stats (Stat)
  • Requests/sec (RPS)
sum(rate(http_server_duration_milliseconds_count{service_name="$svc"}[5m]))
  • 5xx error rate (%), zero-safe
100 * (
  (sum(rate(http_server_duration_milliseconds_count{service_name="$svc", http_status_code=~"5.."}[5m])) or vector(0))
  /
  clamp_min((sum(rate(http_server_duration_milliseconds_count{service_name="$svc"}[5m])) or vector(0)), 1e-9)
)
  • p95 latency (ms)
histogram_quantile(
  0.95,
  sum(rate(http_server_duration_milliseconds_bucket{service_name="$svc"}[5m])) by (le)
)

  1. HTTP panels
  • Top routes by requests/min (topk 10)

Chart type: Time series (points or bars)

With sparse traffic you may see periodic spikes; that’s normal.

topk(10,
  sum(increase(http_server_duration_milliseconds_count{service_name="$svc", http_route=~".+"}[2m]))
  by (http_route)
) / 2
  • p95 latency by route (ms)

Chart type: Time series (lines)

histogram_quantile(
  0.95,
  sum(rate(http_server_duration_milliseconds_bucket{service_name="$svc", http_route=~".+"}[5m]))
  by (le, http_route)
)
  • Status code breakdown (RPS)

Chart type: Time series

sum(rate(http_server_duration_milliseconds_count{service_name="$svc"}[5m]))
by (http_status_code)
  • Filtered p95 latency (route/method/status)

Chart type: Time series

histogram_quantile(
  0.95,
  sum(rate(http_server_duration_milliseconds_bucket{
    service_name="$svc",
    http_route=~"$route",
    http_method=~"$method",
    http_status_code=~"$status"
  }[5m])) by (le)
)

  1. Node.js / V8 runtime panels (Time series)
  • Event loop utilization
avg(nodejs_eventloop_utilization_ratio{service_name="$svc"})
  • Event loop delay p99 (s)
max(nodejs_eventloop_delay_p99_seconds{service_name="$svc"})
  • V8 heap used (bytes)
max(v8js_memory_heap_used_bytes{service_name="$svc"})

  1. Backstage + DB panels (Time series)
  • Catalog entities count (if available)
max(catalog_entities_count{service_name="$svc"})
  • Stitching queue length
max(catalog_stitching_queue_length{service_name="$svc"})
  • Stitching queue delay (avg, s)
sum(rate(catalog_stitching_queue_delay_seconds_sum{service_name="$svc"}[5m]))
/
clamp_min(sum(rate(catalog_stitching_queue_delay_seconds_count{service_name="$svc"}[5m])), 1e-9)
  • Processing queue delay (avg, s)
sum(rate(catalog_processing_queue_delay_seconds_sum{service_name="$svc"}[5m]))
/
clamp_min(sum(rate(catalog_processing_queue_delay_seconds_count{service_name="$svc"}[5m])), 1e-9)
  • DB operation duration by operation (avg, s)
sum(rate(db_client_operation_duration_seconds_sum{service_name="$svc"}[5m])) by (db_operation_name)
/
clamp_min(sum(rate(db_client_operation_duration_seconds_count{service_name="$svc"}[5m])) by (db_operation_name), 1e-9)

Did this guide save you time?

Support this site

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top