Plan status should reflect errors from upgrade jobs
Is your feature request related to a problem? Please describe. When one (or more) upgrade job fails, the Plan status does not reflect a failure in the upgrade process.
Describe the solution you'd like
The Plan status should indicate that a failure in the upgrade process occured by at least updating the status.condtions.type and status.condtions.reason fields. This would ease tracking down a failure in the node upgrade process
Describe alternatives you've considered Of course, getting the status of the jobs is a way to have information, but given that the Plan is driving these jobs, having this information in its status as well would be a nice addition.
Additional context
To reproduce, simply add the following plan that forces the job to fail (also, set the SYSTEM_UPGRADE_JOB_BACKOFF_LIMIT env var in the configmap to a low value like 2 to avoid waiting forever):
---
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: test
labels:
rke2-upgrade: agent
spec:
concurrency: 1
nodeSelector:
matchExpressions:
- {key: node-role.kubernetes.io/control-plane, operator: NotIn, values: ["true"]}
prepare:
command:
- sh
- "-c"
- "/bin/true"
image: rancher/rke2-upgrade
serviceAccountName: default
cordon: true
drain:
force: true
upgrade:
image: rancher/rke2-upgrade
command:
- sh
- "-c"
- "/bin/false"
version: v1.25.5+rke2r2
After the retries, the job ends up in a failed state while the Plan status shows the following:
status:
applying:
- my-agent-node
conditions:
- lastUpdateTime: "2023-01-20T10:46:31Z"
reason: Version
status: "True"
type: LatestResolved
latestHash: 1f7d06e28d958431705fcce82a14c95ac4cc4b695cfb0cdf3e088bfb
latestVersion: v1.25.5-rke2r2
There is no clear indication of a failure whatsoever. Setting the fields reason and type (and maybe other custom fields like the job name) to reflect a failure would be nice.
Thanks