- Updated: March 22, 2026
- 6 min read
Per‑Tenant Observability in OpenClaw SaaS: Prometheus Metrics, Grafana Dashboards, and Alerting
Per‑tenant observability in OpenClaw SaaS is achieved by instrumenting each tenant with isolated Prometheus metrics, generating tenant‑specific Grafana dashboards, and routing alerts through Alertmanager so that every customer receives precise, actionable insights.
Introduction
Multi‑tenant SaaS platforms such as OpenClaw SaaS must deliver observability that isolates data per customer while remaining cost‑effective at scale. Operators, DevOps engineers, and SREs often ask: How can I collect metrics for hundreds of tenants without mixing data, and still provide each tenant with a personalized dashboard and reliable alerts? This guide answers that question step‑by‑step, covering Prometheus instrumentation, Grafana dashboard templating, Alertmanager routing, and proven scaling techniques.
Per‑Tenant Prometheus Instrumentation
Exporters per tenant
Prometheus gathers data via exporters—small HTTP servers that expose metrics in the /metrics endpoint. In a multi‑tenant environment you have two viable patterns:
- Dedicated exporter per tenant: Deploy a separate exporter instance (or sidecar) for each tenant’s micro‑service. This guarantees strict namespace isolation because each exporter only knows its tenant’s resources.
- Multi‑tenant exporter with label segregation: Use a single exporter that adds a
tenant_idlabel to every metric. This reduces pod count but requires careful RBAC and query hygiene.
For OpenClaw, the dedicated exporter model is recommended for high‑value customers, while the label‑segregated approach works for low‑volume tenants.
Scrape configuration
Prometheus’s scrape_configs block defines where to pull metrics. To keep tenant data separate, create a scrape_config per tenant:
scrape_configs:
- job_name: 'openclaw_tenant_{{tenant_id}}'
static_configs:
- targets: ['{{tenant_exporter_host}}:9090']
relabel_configs:
- source_labels: [__address__]
target_label: tenant_id
replacement: '{{tenant_id}}'
Using relabel_configs injects the tenant_id label automatically, ensuring that any downstream query can filter by tenant without additional logic.
Building Tenant‑Specific Grafana Dashboards
Data sources and variables
Grafana connects to Prometheus via a data source. Create a single Prometheus data source that points to the central Prometheus server, then define a global variable $tenant that lists all tenant IDs:
- Navigate to Settings → Variables → New.
- Set Query Options to
label_values(tenant_id). - Enable Multi‑value if you need composite views.
Every panel can now reference $tenant in its PromQL query, e.g., sum(rate(http_requests_total{tenant_id="$tenant"}[5m])).
Dashboard templating
Grafana’s templating engine lets you build a single “master” dashboard that automatically adapts to any tenant. Follow these steps:
- Design the layout with generic panels (CPU, memory, request latency, error rate).
- Replace hard‑coded metric names with variables (
$tenant,$instance, etc.). - Save the dashboard as a JSON model and use the
/api/dashboards/dbendpoint to clone it per tenant, injecting the tenant ID into thetemplatingsection.
When a new tenant is onboarded, a small automation script calls the Grafana API, creates a copy of the master dashboard, and sets the default $tenant value to the new tenant’s ID. This approach eliminates manual UI work and guarantees visual consistency.
Embedding an internal link
For a broader view of how UBOS enables multi‑tenant SaaS platforms, see the UBOS platform overview. The principles of isolation and templating are directly applicable to OpenClaw’s observability stack.
Configuring Alerts for Each Tenant
Alertmanager routing
Prometheus sends alerts to Alertmanager, which then decides where to deliver them. To keep alerts tenant‑aware, configure a receiver per tenant and use matchers on the tenant_id label:
route:
receiver: 'default'
routes:
- receiver: 'tenant-{{tenant_id}}-slack'
matchers:
- tenant_id="{{tenant_id}}"
- receiver: 'tenant-{{tenant_id}}-email'
matchers:
- tenant_id="{{tenant_id}}"
receivers:
- name: 'tenant-123-slack'
slack_configs:
- channel: '#tenant-123-alerts'
send_resolved: true
- name: 'tenant-456-email'
email_configs:
- to: 'alerts-456@example.com'
send_resolved: true
Automation can generate these sections whenever a tenant is added, ensuring that alerts never cross tenant boundaries.
Notification channels
OpenClaw supports multiple notification channels:
- Slack / Microsoft Teams: Ideal for real‑time incident response.
- Email: Works for compliance‑oriented customers.
- Webhook: Allows tenants to forward alerts to their own ticketing system (Jira, ServiceNow, etc.).
When configuring a webhook, include the tenant_id in the payload so the downstream system can tag the incident correctly.
Scaling Observability
Sharding Prometheus
As the number of tenants grows, a single Prometheus instance can become a bottleneck. Sharding distributes scrape load and storage across multiple instances:
| Shard | Tenant Range | Storage Size | Retention |
|---|---|---|---|
| Shard‑1 | 001‑1000 | 150 GB | 30 days |
| Shard‑2 | 1001‑2000 | 150 GB | 30 days |
| Shard‑N | … | … | … |
Each shard runs its own Prometheus server, but all shards share a common Alertmanager cluster, preserving a unified alerting experience.
Grafana performance tips
Grafana can become sluggish when dashboards query massive time series. Apply these optimizations:
- Use
max_over_timeandratewisely: Reduce the number of raw samples returned. - Enable query caching: Grafana Enterprise offers built‑in caching; open‑source can use
redisas a proxy. - Limit dashboard time range defaults: Set the default view to “last 1 hour” instead of “last 30 days”.
- Paginate large tables: Use the
limitclause in PromQL to avoid pulling thousands of rows.
Alerting scalability
Alertmanager itself can be horizontally scaled by running multiple replicas behind a load balancer. Ensure that the --cluster.peer flag points to all peers so that silences and inhibition rules are synchronized. Additionally, consider the following:
- Group alerts by
tenant_idto reduce the number of outbound messages. - Throttle webhook deliveries to avoid overwhelming tenant‑owned ticketing APIs.
- Persist alert state in a highly‑available KV store (e.g., etcd) if you need guaranteed delivery across restarts.
Best Practices and Tips
Below is a concise checklist that operators can copy‑paste into their runbooks:
- Assign a unique
tenant_idlabel to every metric at source. - Prefer dedicated exporters for high‑value tenants; use label‑segregated exporters for low‑volume tenants.
- Automate Grafana dashboard cloning via the
/api/dashboards/dbendpoint. - Generate Alertmanager routing rules programmatically whenever a tenant is added or removed.
- Shard Prometheus after 500‑1000 tenants to keep scrape latency < 2 seconds.
- Enable Grafana query caching and set sensible default time ranges.
- Monitor the health of the observability stack itself (Prometheus scrape failures, Grafana rendering latency, Alertmanager queue length).
Conclusion
Per‑tenant observability in OpenClaw SaaS is not a “nice‑to‑have” feature—it is a prerequisite for delivering trustworthy, compliant services to multiple customers. By instrumenting each tenant with isolated Prometheus exporters, leveraging Grafana’s templating engine for dynamic dashboards, and configuring Alertmanager to route alerts based on the tenant_id label, operators can achieve granular visibility without sacrificing scalability.
Remember to shard Prometheus as tenant count grows, apply Grafana performance best practices, and keep Alertmanager in a highly‑available configuration. When these patterns are combined, you get a robust, cost‑effective observability stack that scales with your SaaS business.
For the latest announcement on OpenClaw SaaS’s roadmap, see the OpenClaw SaaS announcement.