Skip to content

Plan status should reflect errors from upgrade jobs

Is your feature request related to a problem? Please describe. When one (or more) upgrade job fails, the Plan status does not reflect a failure in the upgrade process.

Describe the solution you'd like The Plan status should indicate that a failure in the upgrade process occured by at least updating the status.condtions.type and status.condtions.reason fields. This would ease tracking down a failure in the node upgrade process

Describe alternatives you've considered Of course, getting the status of the jobs is a way to have information, but given that the Plan is driving these jobs, having this information in its status as well would be a nice addition.

Additional context To reproduce, simply add the following plan that forces the job to fail (also, set the SYSTEM_UPGRADE_JOB_BACKOFF_LIMIT env var in the configmap to a low value like 2 to avoid waiting forever):

---
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: test
  labels:
    rke2-upgrade: agent
spec:
  concurrency: 1
  nodeSelector:
    matchExpressions:
      - {key: node-role.kubernetes.io/control-plane, operator: NotIn, values: ["true"]}
  prepare:
    command:
      - sh
      - "-c"
      - "/bin/true"
    image: rancher/rke2-upgrade
  serviceAccountName: default
  cordon: true
  drain:
    force: true
  upgrade:
    image: rancher/rke2-upgrade
    command:
      - sh
      - "-c"
      - "/bin/false"
  version: v1.25.5+rke2r2

After the retries, the job ends up in a failed state while the Plan status shows the following:

status:
  applying:
  - my-agent-node
  conditions:
  - lastUpdateTime: "2023-01-20T10:46:31Z"
    reason: Version
    status: "True"
    type: LatestResolved
  latestHash: 1f7d06e28d958431705fcce82a14c95ac4cc4b695cfb0cdf3e088bfb
  latestVersion: v1.25.5-rke2r2

There is no clear indication of a failure whatsoever. Setting the fields reason and type (and maybe other custom fields like the job name) to reflect a failure would be nice.

Thanks