Skip to content

hive: add metrics for hive jobs (total, errors, duration)

As part of ongoing modularization, much of the old daemon initialization logic has been redesigned as Hive lifecycle hooks or Hive jobs (for the asynchronous parts). As a result, parts of the agent bootstrap metrics (cilium_agent_bootstrap_seconds) are gradually being lost.

Therefore, this commit adds metrics to hive jobs.

  • cilium_hive_jobs_runs_total (counter): Total number of runs
  • cilium_hive_jobs_runs_failed (counter): Number of failed runs (returned error)
  • cilium_hive_jobs_oneshot_last_run_duration_seconds (gauge): Duration of the last run of a oneshot job in seconds (from the one that finished (successfully or with error)
  • cilium_hive_jobs_observer_last_run_duration_seconds (gauge): Duration of the last run of a observer job in seconds
  • cilium_hive_jobs_timer_last_run_duration_seconds (gauge): Duration of the last run of a timer job in seconds
  • cilium_hive_jobs_observer_run_duration_seconds (histogram): Duration of a run of an observer job in seconds
  • cilium_hive_jobs_timer_run_duration_seconds (histogram): Duration of a run of an timer job in seconds

IMO it does not make that much sense to create a histogram for the oneshot jobs (even if retries would be configured).

The metrics contain the labels module_id (hive cell) and job_name.

Example:

root@kind-worker:/home/cilium# cilium-dbg shell metrics hive_jobs
Metric                                                Labels                                                                      Value
cilium_hive_jobs_observer_last_run_duration_seconds   job_name=auth-gc-identity-events module_id=auth                             0.000010
cilium_hive_jobs_observer_last_run_duration_seconds   job_name=default-gateway-route-change-tracker module_id=bgp-control-plane   0.000000
cilium_hive_jobs_observer_last_run_duration_seconds   job_name=device-change-device-change-tracker module_id=bgp-control-plane    0.000062
cilium_hive_jobs_observer_last_run_duration_seconds   job_name=k8s-secrets-resource-events-cilium-secrets module_id=envoy-proxy   0.000045
cilium_hive_jobs_observer_last_run_duration_seconds   job_name=nat-map-next4 module_id=ct-nat-map-gc                              0.000008
cilium_hive_jobs_observer_last_run_duration_seconds   job_name=nat-map-next6 module_id=ct-nat-map-gc                              0.000011
cilium_hive_jobs_observer_run_duration_seconds        job_name=auth-gc-identity-events module_id=auth                             250µs / 450µs / 495µs
cilium_hive_jobs_observer_run_duration_seconds        job_name=default-gateway-route-change-tracker module_id=bgp-control-plane   250µs / 450µs / 495µs
cilium_hive_jobs_observer_run_duration_seconds        job_name=device-change-device-change-tracker module_id=bgp-control-plane    250µs / 450µs / 495µs
cilium_hive_jobs_observer_run_duration_seconds        job_name=k8s-secrets-resource-events-cilium-secrets module_id=envoy-proxy   250µs / 450µs / 495µs
cilium_hive_jobs_observer_run_duration_seconds        job_name=nat-map-next4 module_id=ct-nat-map-gc                              250µs / 450µs / 495µs
cilium_hive_jobs_observer_run_duration_seconds        job_name=nat-map-next6 module_id=ct-nat-map-gc                              250µs / 450µs / 495µs
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=certloader-server-tls module_id=hubble                             0.000886
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=cleanup module_id=maps-cleanup                                     8.345183
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=clustermesh-nodemanager-notifier module_id=clustermesh             0.000001
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=cni-deletion-queue module_id=endpoint-api                          4.245811
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=enable-gc module_id=ct-nat-map-gc                                  8.473397
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=endpoint-cleanup module_id=stale-endpoint-cleanup                  8.347281
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=hubble module_id=hubble                                            0.001395
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=ipset-init-finalizer module_id=ipset                               0.007239
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=legacy-start module_id=daemon                                      4.125838
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=per-endpoint-route-initializer module_id=loader                    8.514638
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=proxy-bootstrapper module_id=dns-proxy                             1.651428
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=proxy-ports-restore module_id=l7-proxy                             0.000208
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=release-local-identities module_id=identity-restoration            38.462400
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=start-reconciler module_id=loadbalancer-reconciler                 0.501552
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=unlock-lockfile module_id=endpoint-api                             4.119883
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=update-config-metric module_id=enabled-features                    0.000158
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=wait-for-endpoint-restore module_id=ep-bpf-prog-watchdog           8.347350
cilium_hive_jobs_oneshot_last_run_duration_seconds    job_name=wait-for-endpoint-restore module_id=namemanager                    8.520708
cilium_hive_jobs_runs_total                           job_name=auth-gc-identity-events module_id=auth                             4.000000
cilium_hive_jobs_runs_total                           job_name=certloader-server-tls module_id=hubble                             1.000000
cilium_hive_jobs_runs_total                           job_name=cleanup module_id=maps-cleanup                                     1.000000
cilium_hive_jobs_runs_total                           job_name=clustermesh-nodemanager-notifier module_id=clustermesh             1.000000
cilium_hive_jobs_runs_total                           job_name=cni-deletion-queue module_id=endpoint-api                          1.000000
cilium_hive_jobs_runs_total                           job_name=default-gateway-route-change-tracker module_id=bgp-control-plane   42.000000
cilium_hive_jobs_runs_total                           job_name=device-change-device-change-tracker module_id=bgp-control-plane    8.000000
cilium_hive_jobs_runs_total                           job_name=enable-gc module_id=ct-nat-map-gc                                  1.000000
cilium_hive_jobs_runs_total                           job_name=endpoint-cleanup module_id=stale-endpoint-cleanup                  1.000000
cilium_hive_jobs_runs_total                           job_name=ep-bpf-prog-watchdog module_id=ep-bpf-prog-watchdog                1.000000
cilium_hive_jobs_runs_total                           job_name=hubble module_id=hubble                                            1.000000
cilium_hive_jobs_runs_total                           job_name=ipset-init-finalizer module_id=ipset                               1.000000
cilium_hive_jobs_runs_total                           job_name=k8s-secrets-resource-events-cilium-secrets module_id=envoy-proxy   1.000000
cilium_hive_jobs_runs_total                           job_name=legacy-start module_id=daemon                                      1.000000
cilium_hive_jobs_runs_total                           job_name=nat-map-next4 module_id=ct-nat-map-gc                              117.000000
cilium_hive_jobs_runs_total                           job_name=nat-map-next6 module_id=ct-nat-map-gc                              26.000000
cilium_hive_jobs_runs_total                           job_name=nat-stats module_id=nat-stats                                      2.000000
cilium_hive_jobs_runs_total                           job_name=per-endpoint-route-initializer module_id=loader                    1.000000
cilium_hive_jobs_runs_total                           job_name=pressure-metric-throttle module_id=bwmap                           1.000000
cilium_hive_jobs_runs_total                           job_name=proxy-bootstrapper module_id=dns-proxy                             1.000000
cilium_hive_jobs_runs_total                           job_name=proxy-ports-checkpoint module_id=l7-proxy                          1.000000
cilium_hive_jobs_runs_total                           job_name=proxy-ports-restore module_id=l7-proxy                             1.000000
cilium_hive_jobs_runs_total                           job_name=release-local-identities module_id=identity-restoration            1.000000
cilium_hive_jobs_runs_total                           job_name=start-reconciler module_id=loadbalancer-reconciler                 1.000000
cilium_hive_jobs_runs_total                           job_name=sync module_id=link-cache                                          2.000000
cilium_hive_jobs_runs_total                           job_name=sync-userspace-and-datapath module_id=utime                        1.000000
cilium_hive_jobs_runs_total                           job_name=unlock-lockfile module_id=endpoint-api                             1.000000
cilium_hive_jobs_runs_total                           job_name=update-config-metric module_id=enabled-features                    1.000000
cilium_hive_jobs_runs_total                           job_name=wait-for-endpoint-restore module_id=ep-bpf-prog-watchdog           1.000000
cilium_hive_jobs_runs_total                           job_name=wait-for-endpoint-restore module_id=namemanager                    1.000000
cilium_hive_jobs_timer_last_run_duration_seconds      job_name=ep-bpf-prog-watchdog module_id=ep-bpf-prog-watchdog                0.000570
cilium_hive_jobs_timer_last_run_duration_seconds      job_name=nat-stats module_id=nat-stats                                      0.002096
cilium_hive_jobs_timer_last_run_duration_seconds      job_name=pressure-metric-throttle module_id=bwmap                           0.000002
cilium_hive_jobs_timer_last_run_duration_seconds      job_name=proxy-ports-checkpoint module_id=l7-proxy                          0.000721
cilium_hive_jobs_timer_last_run_duration_seconds      job_name=sync module_id=link-cache                                          0.000352
cilium_hive_jobs_timer_last_run_duration_seconds      job_name=sync-userspace-and-datapath module_id=utime                        0.000201
cilium_hive_jobs_timer_run_duration_seconds           job_name=ep-bpf-prog-watchdog module_id=ep-bpf-prog-watchdog                750µs / 950µs / 995µs
cilium_hive_jobs_timer_run_duration_seconds           job_name=nat-stats module_id=nat-stats                                      1.75ms / 2.35ms / 2.485ms
cilium_hive_jobs_timer_run_duration_seconds           job_name=pressure-metric-throttle module_id=bwmap                           250µs / 450µs / 495µs
cilium_hive_jobs_timer_run_duration_seconds           job_name=proxy-ports-checkpoint module_id=l7-proxy                          750µs / 950µs / 995µs
cilium_hive_jobs_timer_run_duration_seconds           job_name=sync module_id=link-cache                                          250µs / 450µs / 495µs
cilium_hive_jobs_timer_run_duration_seconds           job_name=sync-userspace-and-datapath module_id=utime                        250µs / 450µs / 495µs

Note: this doesn't cover the jobs that are created from an injected job.Registry (it would require to decorate the registry which currently isn't possible due to using unexported types in the API)

Merge request reports

Loading