Skip to content

CFP: Configurable Jitter for IPsec Key Rotation

Cilium Feature Proposal

Is your proposed feature related to a problem?

Yes. In large production clusters (e.g., with hundreds of nodes) using Cilium with IPsec encryption enabled, weekly or periodic IPsec key rotations trigger a "thundering herd" effect. All Cilium agents detect the key file change (/etc/ipsec/keys) simultaneously via the key file watcher. The process looks like:

  • The agent's keyfilewatcher method detects changes to /etc/ipsec/keys

  • New keys are loaded by loadIPSecKeysFile

  • SetIPsecKeyIdentity() is then called.

  • SetIPsecKeyIdentity() gets the latest key id

  • Then the agent calls NodeHandler's AllNodeValidateImplementation method

  • Which triggers a nodeUpdate for the local node, which sends events to nodeDiscovery

  • updateCiliumNodeResource() method updates the CiliumNode resource and logs:

      n.logger.Info(
      	"Creating or updating CiliumNode resource",
      	logfields.Node, nodeTypes.GetName(),
  • This update to the CiliumNode resource is consumed by the K8sCiliumNodeWatcher and its onCiliumNodeUpdate() method is called, which in turn calls onCiliumNodeInsert, which finally calls the NodeUpdated() method for each node.

  • This is the method that logs the "Node updated" event for each node.

      func (m *manager) NodeUpdated(n nodeTypes.Node) {
      m.logger.Info(
      	"Node updated",
      	logfields.ClusterName, n.Cluster,
      	logfields.NodeName, n.Name,
      	logfields.SPI, n.EncryptionKey,
      )

This results in a massive spike of "Node updated" events (amplified by the number of control plane Cilium pods), overwhelming the Kubernetes API server and causing significant CPU spikes on control plane nodes. The issue is exacerbated in scaled deployments where the number of nodes amplifies the event volume, potentially leading to degraded control plane during rotations if the nodes are not vertically scaled appropriately. While key rotations are necessary for security, the synchronous nature of the watcher makes it challenging to manage in large-scale production environments.

Describe the feature you'd like?

Introduce a configurable jitter in the IPsec key file watcher to stagger key reloading across Cilium agents, reducing the peak load on the API server during rotations. The jitter should introduce a random delay (between 0 and a configurable maximum) after detecting a key file change, before proceeding with key loading and CiliumNode updates. This would spread out the "Node Updated" events over time, mitigating CPU spikes and API overload in large clusters. Key requirements: • The maximum jitter duration should be configurable via Helm values or Cilium config (e.g., encryption.ipsec.jitterMaxDuration), with a sensible default (e.g., 50% of encryption.ipsec.keyRotationDuration to align with existing rotation windows). • Jitter should be per-agent (randomized independently) to ensure even distribution without requiring cluster-wide coordination. • Preserve existing behavior for small clusters or when jitter is disabled (e.g., via a flag like encryption.ipsec.enableJitter: false). • Ensure the jitter does not interfere with security guarantees, such as timely adoption of new keys within the keyRotationDuration window.

Brief description of proposed solution

The implementation could modify the keyfileWatcher function in pkg/datapath/linux/ipsec/ipsec_linux.go to insert a jitter-based sleep after detecting a file change (Create or Write event) but before loading the keys via loadIPSecKeysFile. This keeps the change localized to the watcher loop without altering the core key loading or node update logic. Example pseudocode for the addition:

func (a *Agent) keyfileWatcher(ctx context.Context, watcher *fswatcher.Watcher, keyfilePath string, nodeHandler types.NodeHandler, health cell.Health) error {
    for {
        select {
        case event := <-watcher.Events:
            if event.Op&(fswatcher.Create|fswatcher.Write) == 0 {
                continue
            }
            // Existing pre-jitter logic (e.g., validation)...
//Introduce jitter: random duration between 0 and maxJitter
            maxJitter := a.GetIPSecJitterMaxDuration() // Fetched from config, e.g., 50% of keyRotationDuration
            jitter := time.Duration(rand.Int63n(int64(maxJitter)))
            time.Sleep(jitter)
//Existing post-jitter logic
            _, spi, err := a.loadIPSecKeysFile(keyfilePath)
            if err != nil {
                // Handle error...
            }
            // Proceed with SetIPsecKeyIdentity, node updates, etc.
        }
        // Rest of loop...
    }
}
• Configuration: Add jitterMaxDuration to the IPsec config struct, defaulting to keyRotationDuration / 2. Expose it in Helm charts under encryption.ipsec.
• Rationale for jitter placement: Delaying after detection but before loading ensures agents still pick up the new key promptly, while spreading the downstream CiliumNode PATCH requests and "Node Updated" logs.
• Edge cases:
	○ If jitter exceeds keyRotationDuration, cap it or log a warning.
	○ Test in large-scale simulations to verify even distribution.
	○ Compatibility: Backwards-compatible with existing Cilium versions; no impact on non-IPsec setups.
• Alternatives considered: Disabling the key watcher requires manual restarts, which is disruptive in production. Spreading rotations at the secret level isn't feasible for shared symmetric keys.

This approach draws from existing jitter usage in Cilium to avoid introducing novel patterns. I'd be happy to prototype this or discuss further.

Notify relevant community channels

Notify the members of any relevant code owners below from the [teams] list in the following form: @cilium/sig-k8s`` @cilium/ipsec