Skip to content

Can't upgrade RKE2 errors stating "error syncing" ...

Version Release 0.10.0 (https://github.com/rancher/system-upgrade-controller/releases/tag/v0.10.0)

Platform/Architecture RKE2 cluster

Describe the bug I'm using the system-upgrade-controller to bump RKE2 version from v1.23.10+rke2r1 to v1.24.10+rke2r1 (https://github.com/rancher/rke2/releases/tag/v1.24.10+rke2r1) following RKE2 documentation and everything works smooth if using the official images. However, when I try to run it on a disconnected environment where we don't have direct access to public repos, it doesn't work. What we have is a proxy cache implemented by harbor which allows us to pull from public repos. However, the job that spins up does the cordon, but then Jobs gets terminated probably by the controller.

If I check logs from system-upgrade-controller it's constantly complaining:

Error logs:

time="2023-02-23T17:01:29Z" level=error msg="error syncing 'system-upgrade/server-check11': handler system-upgrade-controller: DesiredSet - Replace Wait batch/v1, Kind=Job system-upgrade/apply-server-check11-on-node-with-8cb523f17d0363daa544-1c8f9 for system-upgrade-controller system-upgrade/server-check11, requeuing" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"
time="2023-02-23T17:01:29Z" level=debug msg="PLAN STATUS HANDLER: plan=system-upgrade/server-check11@79747, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2023-02-23T17:01:28Z LastTransitionTime: Reason:Version Message:}] LatestVersion:v1.24.10-rke2r1 LatestHash:8cb523f17d0363daa5446f1aa3363b6a220e0e050435b4d3d40e253b Applying:[node]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"
time="2023-02-23T17:01:29Z" level=debug msg="PLAN GENERATING HANDLER: plan=system-upgrade/server-check11@79765, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2023-02-23T17:01:29Z LastTransitionTime: Reason:Version Message:}] LatestVersion:v1.24.10-rke2r1 LatestHash:8cb523f17d0363daa5446f1aa3363b6a220e0e050435b4d3d40e253b Applying:[node]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"
time="2023-02-23T17:01:30Z" level=debug msg="DesiredSet - Created batch/v1, Kind=Job system-upgrade/apply-server-check11-on-node-with-8cb523f17d0363daa544-1c8f9 for system-upgrade-controller system-upgrade/server-check11" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"
time="2023-02-23T17:01:30Z" level=debug msg="PLAN STATUS HANDLER: plan=system-upgrade/server-check11@79765, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2023-02-23T17:01:29Z LastTransitionTime: Reason:Version Message:}] LatestVersion:v1.24.10-rke2r1 LatestHash:8cb523f17d0363daa5446f1aa3363b6a220e0e050435b4d3d40e253b Applying:[node]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"
time="2023-02-23T17:01:30Z" level=debug msg="PLAN GENERATING HANDLER: plan=system-upgrade/server-check11@79782, status={Conditions:[{Type:LatestResolved Status:True LastUpdateTime:2023-02-23T17:01:30Z LastTransitionTime: Reason:Version Message:}] LatestVersion:v1.24.10-rke2r1 LatestHash:8cb523f17d0363daa5446f1aa3363b6a220e0e050435b4d3d40e253b Applying:[node]}" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"
time="2023-02-23T17:01:31Z" level=debug msg="DesiredSet - Delete batch/v1, Kind=Job system-upgrade/apply-server-check11-on-node-with-8cb523f17d0363daa544-1c8f9 for system-upgrade-controller system-upgrade/server-check11" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"
time="2023-02-23T17:01:31Z" level=error msg="error syncing 'system-upgrade/server-check11': handler system-upgrade-controller: DesiredSet - Replace Wait batch/v1, Kind=Job system-upgrade/apply-server-check11-on-node-with-8cb523f17d0363daa544-1c8f9 for system-upgrade-controller system-upgrade/server-check11, requeuing" func="github.com/sirupsen/logrus.(*Entry).Logf" file="/go/pkg/mod/github.com/sirupsen/logrus@v1.4.2/entry.go:314"

So basically, it says level=error msg="error syncing 'system-upgrade/server-check11': handler system-upgrade-controller: DesiredSet - Replace Wait batch/v1, Kind=Job system-upgrade/apply-server-check11-on-node-with-8cb523f17d0363daa544-1c8f9 for system-upgrade-controller system-upgrade/server-check11, requeuing"

Jobs are permanently recreated and so pod:

❯ kubectl get pods
NAME                                                              READY   STATUS        RESTARTS   AGE
apply-server-check11-on-node-with-8cb523f17d0363daa544-h6vc6   0/1     Terminating   0          3s
apply-server-check11-on-node-with-8cb523f17d0363daa544-zjk2k   0/1     Pending       0          1s
system-upgrade-controller-957888bb5-vvv28                         1/1     Running       0          39m

To Reproduce Our plan:

apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: server-check11
  namespace: system-upgrade
  labels:
    rke2-upgrade: server
spec:
  concurrency: 1
  nodeSelector:
    matchExpressions:
       - {key: node-role.kubernetes.io/control-plane, operator: In, values: ["true"]}
  serviceAccountName: system-upgrade
  cordon: true
#  drain:
#    force: true
  upgrade:
    image: our.private.registry/docker.io/rancher/rke2-upgrade
  version: v1.24.10+rke2r1

Expected behavior It should the same it works when using official images, as the only change we do it's doing the proxy cache.

Actual behavior The init container works as it cordons the node:

❯ kubectl get node
NAME      STATUS                     ROLES                              AGE    VERSION
node   Ready,SchedulingDisabled   control-plane,etcd,master,worker   177m   v1.23.10+rke2r1

However, the main container it's not even started. We don't even have time to check logs, but our guess is that there might be an issue with SHAs as it might internally do a check internally (system-upgrade-controller) as per SYSTEM_UPGRADE_PLAN_LATEST_HASH.