Proper handling of reboot scenarios / drain and cordon max. one node etc.
Is your feature request related to a problem? Please describe.
We need to patch and reboot nodes in the cluster in a sequential fashion with ensuring that max. 1 master and 1 worker are "drained / cordoned / rebootet" in parallel. So we have to ensure that max. 1 node is not available during the process.
When looking at the example https://github.com/rancher/system-upgrade-controller/blob/master/examples/ubuntu/bionic/linux-kernel-virtual-hwe-18.04.yaml (as well as others that do a reboot of the node) - the problem is that SUC will start another node to be drained even before the first one is back in the cluster.
Describe the solution you'd like Somehow we need an option in the drain / cordon process to "wait" until the last node updated/rebooted is back and healthy. This also needs to take into account that a node might still show "ready" even though it is rebooted, because Kubernetes might not realize a down / unavailable node for some time.
It would be great if we could specify in the "drain" for how long we wait until we run the job on the next node if the first one is completed and it would be great if we could specify "wait for drain until at least 90% of the nodes are available in ready and not cordoned status" and "no more than 1 node not-available / cordoned".
Describe alternatives you've considered Using kured instead of system upgrade controller.