hive: add metrics for hive jobs (total, errors, duration)
As part of ongoing modularization, much of the old daemon initialization logic
has been redesigned as Hive lifecycle hooks or Hive jobs (for the asynchronous parts). As a result, parts
of the agent bootstrap metrics (cilium_agent_bootstrap_seconds) are gradually
being lost.
Therefore, this commit adds metrics to hive jobs.
-
cilium_hive_jobs_runs_total(counter): Total number of runs -
cilium_hive_jobs_runs_failed(counter): Number of failed runs (returned error) -
cilium_hive_jobs_oneshot_last_run_duration_seconds(gauge): Duration of the last run of a oneshot job in seconds (from the one that finished (successfully or with error) -
cilium_hive_jobs_observer_last_run_duration_seconds(gauge): Duration of the last run of a observer job in seconds -
cilium_hive_jobs_timer_last_run_duration_seconds(gauge): Duration of the last run of a timer job in seconds -
cilium_hive_jobs_observer_run_duration_seconds(histogram): Duration of a run of an observer job in seconds -
cilium_hive_jobs_timer_run_duration_seconds(histogram): Duration of a run of an timer job in seconds
IMO it does not make that much sense to create a histogram for the oneshot jobs (even if retries would be configured).
The metrics contain the labels module_id (hive cell) and job_name.
Example:
root@kind-worker:/home/cilium# cilium-dbg shell metrics hive_jobs
Metric Labels Value
cilium_hive_jobs_observer_last_run_duration_seconds job_name=auth-gc-identity-events module_id=auth 0.000010
cilium_hive_jobs_observer_last_run_duration_seconds job_name=default-gateway-route-change-tracker module_id=bgp-control-plane 0.000000
cilium_hive_jobs_observer_last_run_duration_seconds job_name=device-change-device-change-tracker module_id=bgp-control-plane 0.000062
cilium_hive_jobs_observer_last_run_duration_seconds job_name=k8s-secrets-resource-events-cilium-secrets module_id=envoy-proxy 0.000045
cilium_hive_jobs_observer_last_run_duration_seconds job_name=nat-map-next4 module_id=ct-nat-map-gc 0.000008
cilium_hive_jobs_observer_last_run_duration_seconds job_name=nat-map-next6 module_id=ct-nat-map-gc 0.000011
cilium_hive_jobs_observer_run_duration_seconds job_name=auth-gc-identity-events module_id=auth 250µs / 450µs / 495µs
cilium_hive_jobs_observer_run_duration_seconds job_name=default-gateway-route-change-tracker module_id=bgp-control-plane 250µs / 450µs / 495µs
cilium_hive_jobs_observer_run_duration_seconds job_name=device-change-device-change-tracker module_id=bgp-control-plane 250µs / 450µs / 495µs
cilium_hive_jobs_observer_run_duration_seconds job_name=k8s-secrets-resource-events-cilium-secrets module_id=envoy-proxy 250µs / 450µs / 495µs
cilium_hive_jobs_observer_run_duration_seconds job_name=nat-map-next4 module_id=ct-nat-map-gc 250µs / 450µs / 495µs
cilium_hive_jobs_observer_run_duration_seconds job_name=nat-map-next6 module_id=ct-nat-map-gc 250µs / 450µs / 495µs
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=certloader-server-tls module_id=hubble 0.000886
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=cleanup module_id=maps-cleanup 8.345183
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=clustermesh-nodemanager-notifier module_id=clustermesh 0.000001
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=cni-deletion-queue module_id=endpoint-api 4.245811
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=enable-gc module_id=ct-nat-map-gc 8.473397
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=endpoint-cleanup module_id=stale-endpoint-cleanup 8.347281
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=hubble module_id=hubble 0.001395
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=ipset-init-finalizer module_id=ipset 0.007239
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=legacy-start module_id=daemon 4.125838
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=per-endpoint-route-initializer module_id=loader 8.514638
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=proxy-bootstrapper module_id=dns-proxy 1.651428
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=proxy-ports-restore module_id=l7-proxy 0.000208
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=release-local-identities module_id=identity-restoration 38.462400
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=start-reconciler module_id=loadbalancer-reconciler 0.501552
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=unlock-lockfile module_id=endpoint-api 4.119883
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=update-config-metric module_id=enabled-features 0.000158
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=wait-for-endpoint-restore module_id=ep-bpf-prog-watchdog 8.347350
cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=wait-for-endpoint-restore module_id=namemanager 8.520708
cilium_hive_jobs_runs_total job_name=auth-gc-identity-events module_id=auth 4.000000
cilium_hive_jobs_runs_total job_name=certloader-server-tls module_id=hubble 1.000000
cilium_hive_jobs_runs_total job_name=cleanup module_id=maps-cleanup 1.000000
cilium_hive_jobs_runs_total job_name=clustermesh-nodemanager-notifier module_id=clustermesh 1.000000
cilium_hive_jobs_runs_total job_name=cni-deletion-queue module_id=endpoint-api 1.000000
cilium_hive_jobs_runs_total job_name=default-gateway-route-change-tracker module_id=bgp-control-plane 42.000000
cilium_hive_jobs_runs_total job_name=device-change-device-change-tracker module_id=bgp-control-plane 8.000000
cilium_hive_jobs_runs_total job_name=enable-gc module_id=ct-nat-map-gc 1.000000
cilium_hive_jobs_runs_total job_name=endpoint-cleanup module_id=stale-endpoint-cleanup 1.000000
cilium_hive_jobs_runs_total job_name=ep-bpf-prog-watchdog module_id=ep-bpf-prog-watchdog 1.000000
cilium_hive_jobs_runs_total job_name=hubble module_id=hubble 1.000000
cilium_hive_jobs_runs_total job_name=ipset-init-finalizer module_id=ipset 1.000000
cilium_hive_jobs_runs_total job_name=k8s-secrets-resource-events-cilium-secrets module_id=envoy-proxy 1.000000
cilium_hive_jobs_runs_total job_name=legacy-start module_id=daemon 1.000000
cilium_hive_jobs_runs_total job_name=nat-map-next4 module_id=ct-nat-map-gc 117.000000
cilium_hive_jobs_runs_total job_name=nat-map-next6 module_id=ct-nat-map-gc 26.000000
cilium_hive_jobs_runs_total job_name=nat-stats module_id=nat-stats 2.000000
cilium_hive_jobs_runs_total job_name=per-endpoint-route-initializer module_id=loader 1.000000
cilium_hive_jobs_runs_total job_name=pressure-metric-throttle module_id=bwmap 1.000000
cilium_hive_jobs_runs_total job_name=proxy-bootstrapper module_id=dns-proxy 1.000000
cilium_hive_jobs_runs_total job_name=proxy-ports-checkpoint module_id=l7-proxy 1.000000
cilium_hive_jobs_runs_total job_name=proxy-ports-restore module_id=l7-proxy 1.000000
cilium_hive_jobs_runs_total job_name=release-local-identities module_id=identity-restoration 1.000000
cilium_hive_jobs_runs_total job_name=start-reconciler module_id=loadbalancer-reconciler 1.000000
cilium_hive_jobs_runs_total job_name=sync module_id=link-cache 2.000000
cilium_hive_jobs_runs_total job_name=sync-userspace-and-datapath module_id=utime 1.000000
cilium_hive_jobs_runs_total job_name=unlock-lockfile module_id=endpoint-api 1.000000
cilium_hive_jobs_runs_total job_name=update-config-metric module_id=enabled-features 1.000000
cilium_hive_jobs_runs_total job_name=wait-for-endpoint-restore module_id=ep-bpf-prog-watchdog 1.000000
cilium_hive_jobs_runs_total job_name=wait-for-endpoint-restore module_id=namemanager 1.000000
cilium_hive_jobs_timer_last_run_duration_seconds job_name=ep-bpf-prog-watchdog module_id=ep-bpf-prog-watchdog 0.000570
cilium_hive_jobs_timer_last_run_duration_seconds job_name=nat-stats module_id=nat-stats 0.002096
cilium_hive_jobs_timer_last_run_duration_seconds job_name=pressure-metric-throttle module_id=bwmap 0.000002
cilium_hive_jobs_timer_last_run_duration_seconds job_name=proxy-ports-checkpoint module_id=l7-proxy 0.000721
cilium_hive_jobs_timer_last_run_duration_seconds job_name=sync module_id=link-cache 0.000352
cilium_hive_jobs_timer_last_run_duration_seconds job_name=sync-userspace-and-datapath module_id=utime 0.000201
cilium_hive_jobs_timer_run_duration_seconds job_name=ep-bpf-prog-watchdog module_id=ep-bpf-prog-watchdog 750µs / 950µs / 995µs
cilium_hive_jobs_timer_run_duration_seconds job_name=nat-stats module_id=nat-stats 1.75ms / 2.35ms / 2.485ms
cilium_hive_jobs_timer_run_duration_seconds job_name=pressure-metric-throttle module_id=bwmap 250µs / 450µs / 495µs
cilium_hive_jobs_timer_run_duration_seconds job_name=proxy-ports-checkpoint module_id=l7-proxy 750µs / 950µs / 995µs
cilium_hive_jobs_timer_run_duration_seconds job_name=sync module_id=link-cache 250µs / 450µs / 495µs
cilium_hive_jobs_timer_run_duration_seconds job_name=sync-userspace-and-datapath module_id=utime 250µs / 450µs / 495µs
Note: this doesn't cover the jobs that are created from an injected job.Registry (it would require to decorate the registry which currently isn't possible due to using unexported types in the API)