Skip to main content

Telemetry and metrics

The ToolHive Registry Server provides comprehensive observability through OpenTelemetry (OTel), supporting both distributed tracing and metrics collection via OTLP exporters. This enables you to monitor system behavior, diagnose issues, and improve performance.

Architecture overview

The Registry Server exports telemetry data (traces and metrics) via OTLP HTTP to an OpenTelemetry Collector, which can then forward to various backends:

Configuration

Add telemetry configuration to your Registry Server configuration file:

config.yaml
telemetry:
enabled: true
serviceName: thv-registry-api
serviceVersion: '1.0.0'
endpoint: otel-collector:4318
insecure: true
tracing:
enabled: true
sampling: 0.05
metrics:
enabled: true

Configuration options

OptionTypeDefaultDescription
enabledboolfalseEnable or disable all telemetry
serviceNamestringthv-registry-apiService name in telemetry data
serviceVersionstring"unknown"Service version in telemetry data
endpointstringlocalhost:4318OTLP HTTP endpoint (host:port)
insecureboolfalseUse insecure connection (no TLS)
tracing.enabledboolfalseEnable distributed tracing
tracing.samplingfloat0.05Trace sampling ratio (0.0 to 1.0)
metrics.enabledboolfalseEnable metrics collection
note

The endpoint is provided as a hostname and optional port, without a scheme or path (e.g., use api.honeycomb.io or api.honeycomb.io:443, not https://api.honeycomb.io). The server automatically uses HTTPS unless insecure: true is specified.

Metrics

The Registry Server exposes metrics prefixed with thv_reg_srv_ for easy identification.

Available metrics

MetricTypeLabelsDescription
thv_reg_srv_http_request_duration_secondsHistogrammethod, route, status_codeDuration of HTTP requests
thv_reg_srv_http_requests_totalCountermethod, route, status_codeTotal number of HTTP requests
thv_reg_srv_http_active_requestsUpDownCounter-Number of in-flight requests
thv_reg_srv_servers_totalGaugeregistryNumber of servers per registry
thv_reg_srv_sync_duration_secondsHistogramregistry, successDuration of sync operations

Histogram buckets

The metrics use the following histogram bucket boundaries:

  • HTTP request duration: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10 seconds
  • Sync duration: 0.1, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300 seconds

Distributed tracing

The Registry Server implements distributed tracing across two layers: HTTP requests and service operations.

Trace hierarchy

Traces follow a parent-child hierarchy that shows the complete request flow:

HTTP Request Span (root)
└── Service Span (child)
└── Database operations with db.system=postgresql
note

Background sync operations are monitored through metrics (see the thv_reg_srv_sync_duration_seconds metric above) rather than distributed traces, as they are internal operations without incoming request context.

HTTP layer spans

All HTTP requests (except health and readiness endpoints) are traced with the following attributes:

AttributeTypeDescription
http.request.methodstringHTTP method (GET, POST, etc.)
http.routestringRoute pattern (e.g., /v0.1/servers/{name})
url.pathstringActual URL path
user_agent.originalstringClient user agent (truncated to 256 chars)
http.response.status_codeintResponse status code

Service layer spans

Database service operations include these attributes:

AttributeTypeDescription
registry.namestringName of the registry
server.namestringName of the server
server.versionstringVersion of the server
pagination.limitintPage size limit
pagination.has_cursorboolWhether pagination cursor is used
result.countintNumber of results returned

Context propagation

The Registry Server supports W3C Trace Context propagation. Incoming requests with traceparent headers have their trace context extracted and used as the parent for all child spans, enabling distributed tracing across multiple services.

Sampling strategies

Adjust sampling rates based on your environment and traffic volume:

EnvironmentSampling rateUse case
Development1.0Capture all traces for debugging
Staging0.110% sampling for testing
Production0.01 - 0.051-5% sampling to balance cost and visibility
tip

Start with a higher sampling rate and reduce it as you understand your traffic patterns. For high-traffic production environments, even 1% sampling provides sufficient data for identifying issues.

Excluded endpoints

The /health and /readiness endpoints are intentionally excluded from tracing.

These endpoints generate a high volume of nearly identical spans that provide minimal diagnostic value while significantly increasing storage costs. HTTP metrics still capture latency and error rates for these endpoints.

Next steps