Can't upgrade RKE2 errors stating "error syncing" ...
Version Release 0.10.0 (https://github.com/rancher/system-upgrade-controller/releases/tag/v0.10.0)
Platform/Architecture RKE2 cluster
Describe the bug
I'm using the system-upgrade-controller to bump RKE2 version from v1.23.10+rke2r1 to v1.24.10+rke2r1 (https://github.com/rancher/rke2/releases/tag/v1.24.10+rke2r1) following RKE2 documentation and everything works smooth if using the official images. However, when I try to run it on a disconnected environment where we don't have direct access to public repos, it doesn't work. What we have is a proxy cache implemented by harbor which allows us to pull from public repos. However, the job that spins up does the cordon, but then Jobs gets terminated probably by the controller.
If I check logs from system-upgrade-controller it's constantly complaining:
Error logs:
time="2023-02-23T17:01:29Z" level=error msg="error syncing 'system-upgrade/server-check11': handler system-upgrade-controller: DesiredSet - Replace Wait batch/v1, Kind=Job system-upgrade/apply-server-check11-on-node-with-8cb523f17d0363daa544-1c8f9 for system-upgrade-controller system-upgrade/server-check11, requeuing" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"
time="2023-02-23T17:01:29Z" level=debug msg="PLAN STATUS HANDLER: plan=system-upgrade/server-check11@79747, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2023-02-23T17:01:28Z LastTransitionTime: Reason:Version Message:}] LatestVersion:v1.24.10-rke2r1 LatestHash:8cb523f17d0363daa5446f1aa3363b6a220e0e050435b4d3d40e253b Applying:[node]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"
time="2023-02-23T17:01:29Z" level=debug msg="PLAN GENERATING HANDLER: plan=system-upgrade/server-check11@79765, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2023-02-23T17:01:29Z LastTransitionTime: Reason:Version Message:}] LatestVersion:v1.24.10-rke2r1 LatestHash:8cb523f17d0363daa5446f1aa3363b6a220e0e050435b4d3d40e253b Applying:[node]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"
time="2023-02-23T17:01:30Z" level=debug msg="DesiredSet - Created batch/v1, Kind=Job system-upgrade/apply-server-check11-on-node-with-8cb523f17d0363daa544-1c8f9 for system-upgrade-controller system-upgrade/server-check11" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"
time="2023-02-23T17:01:30Z" level=debug msg="PLAN STATUS HANDLER: plan=system-upgrade/server-check11@79765, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2023-02-23T17:01:29Z LastTransitionTime: Reason:Version Message:}] LatestVersion:v1.24.10-rke2r1 LatestHash:8cb523f17d0363daa5446f1aa3363b6a220e0e050435b4d3d40e253b Applying:[node]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"
time="2023-02-23T17:01:30Z" level=debug msg="PLAN GENERATING HANDLER: plan=system-upgrade/server-check11@79782, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2023-02-23T17:01:30Z LastTransitionTime: Reason:Version Message:}] LatestVersion:v1.24.10-rke2r1 LatestHash:8cb523f17d0363daa5446f1aa3363b6a220e0e050435b4d3d40e253b Applying:[node]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"
time="2023-02-23T17:01:31Z" level=debug msg="DesiredSet - Delete batch/v1, Kind=Job system-upgrade/apply-server-check11-on-node-with-8cb523f17d0363daa544-1c8f9 for system-upgrade-controller system-upgrade/server-check11" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"
time="2023-02-23T17:01:31Z" level=error msg="error syncing 'system-upgrade/server-check11': handler system-upgrade-controller: DesiredSet - Replace Wait batch/v1, Kind=Job system-upgrade/apply-server-check11-on-node-with-8cb523f17d0363daa544-1c8f9 for system-upgrade-controller system-upgrade/server-check11, requeuing" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"
So basically, it says level=error msg="error syncing 'system-upgrade/server-check11': handler system-upgrade-controller: DesiredSet - Replace Wait batch/v1, Kind=Job system-upgrade/apply-server-check11-on-node-with-8cb523f17d0363daa544-1c8f9 for system-upgrade-controller system-upgrade/server-check11, requeuing"
Jobs are permanently recreated and so pod:
❯ kubectl get pods
NAME READY STATUS RESTARTS AGE
apply-server-check11-on-node-with-8cb523f17d0363daa544-h6vc6 0/1 Terminating 0 3s
apply-server-check11-on-node-with-8cb523f17d0363daa544-zjk2k 0/1 Pending 0 1s
system-upgrade-controller-957888bb5-vvv28 1/1 Running 0 39m
To Reproduce Our plan:
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: server-check11
namespace: system-upgrade
labels:
rke2-upgrade: server
spec:
concurrency: 1
nodeSelector:
matchExpressions:
- {key: node-role.kubernetes.io/control-plane, operator: In, values: ["true"]}
serviceAccountName: system-upgrade
cordon: true
# drain:
# force: true
upgrade:
image: our.private.registry/docker.io/rancher/rke2-upgrade
version: v1.24.10+rke2r1
Expected behavior It should the same it works when using official images, as the only change we do it's doing the proxy cache.
Actual behavior The init container works as it cordons the node:
❯ kubectl get node
NAME STATUS ROLES AGE VERSION
node Ready,SchedulingDisabled control-plane,etcd,master,worker 177m v1.23.10+rke2r1
However, the main container it's not even started. We don't even have time to check logs, but our guess is that there might be an issue with SHAs as it might internally do a check internally (system-upgrade-controller) as per SYSTEM_UPGRADE_PLAN_LATEST_HASH.