Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After a master node in a Cluster fails to recover, the role labels are inconsistent and the nodes are unable to rejoin the cluster #1234

Open
trynocoding opened this issue Feb 6, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@trynocoding
Copy link

operator log


{"level":"info","ts":"2025-02-06T16:48:11+08:00","msg":"Number of Redis nodes match desired","controller":"rediscluster","controllerGroup":"redis.redis.opstreelabs.in","controllerKind":"RedisCluster","RedisCluster":{"name":"redis-cluster","namespace":"default"},"namespace":"default","name":"redis-cluster","reconcileID":"5869bd27-c41b-4e00-ba18-c1813e484303"}
{"level":"error","ts":"2025-02-06T16:49:55+08:00","msg":"Could not execute command","controller":"rediscluster","controllerGroup":"redis.redis.opstreelabs.in","controllerKind":"RedisCluster","RedisCluster":{"name":"redis-cluster","namespace":"default"},"namespace":"default","name":"redis-cluster","reconcileID":"5869bd27-c41b-4e00-ba18-c1813e484303","Command":["redis-cli","--cluster","rebalance","redis-cluster-leader-1.redis-cluster-leader-headless.default.svc:6379","--cluster-use-empty-masters"],"Output":">>> Performing Cluster Check (using node redis-cluster-leader-1.redis-cluster-leader-headless.default.svc:6379)\n[OK] All nodes agree about slots configuration.\n>>> Check for open slots...\n>>> Check slots coverage...\n[ERR] Not all 16384 slots are covered by nodes.\n\n*** Please fix your cluster problems before rebalancing\n","error":"execute command with error: command terminated with exit code 1, stderr: ","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/pkg/k8sutils.executeCommand\n\t/home/workspace/redis-operator/pkg/k8sutils/redis.go:463\ngithub.com/OT-CONTAINER-KIT/redis-operator/pkg/k8sutils.RebalanceRedisClusterEmptyMasters\n\t/home/workspace/redis-operator/pkg/k8sutils/cluster-scaling.go:164\ngithub.com/OT-CONTAINER-KIT/redis-operator/pkg/k8sutils.CheckIfEmptyMasters\n\t/home/workspace/redis-operator/pkg/k8sutils/cluster-scaling.go:182\ngithub.com/OT-CONTAINER-KIT/redis-operator/pkg/controllers/rediscluster.(*Reconciler).Reconcile\n\t/home/workspace/redis-operator/pkg/controllers/rediscluster/rediscluster_controller.go:243\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}

{"level":"info","ts":"2025-02-06T16:55:38+08:00","msg":"Number of Redis nodes match desired","controller":"rediscluster","controllerGroup":"redis.redis.opstreelabs.in","controllerKind":"RedisCluster","RedisCluster":{"name":"redis-cluster","namespace":"default"},"namespace":"default","name":"redis-cluster","reconcileID":"559a9a27-b19a-4753-ab63-452767e0af87"}
{"level":"error","ts":"2025-02-06T16:55:38+08:00","msg":"Could not execute command","controller":"rediscluster","controllerGroup":"redis.redis.opstreelabs.in","controllerKind":"RedisCluster","RedisCluster":{"name":"redis-cluster","namespace":"default"},"namespace":"default","name":"redis-cluster","reconcileID":"559a9a27-b19a-4753-ab63-452767e0af87","Command":["redis-cli","--cluster","rebalance","redis-cluster-leader-1.redis-cluster-leader-headless.default.svc:6379","--cluster-use-empty-masters"],"Output":">>> Performing Cluster Check (using node redis-cluster-leader-1.redis-cluster-leader-headless.default.svc:6379)\n[OK] All nodes agree about slots configuration.\n>>> Check for open slots...\n>>> Check slots coverage...\n[ERR] Not all 16384 slots are covered by nodes.\n\n*** Please fix your cluster problems before rebalancing\n","error":"execute command with error: command terminated with exit code 1, stderr: ","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/pkg/k8sutils.executeCommand\n\t/home/workspace/redis-operator/pkg/k8sutils/redis.go:463\ngithub.com/OT-CONTAINER-KIT/redis-operator/pkg/k8sutils.RebalanceRedisClusterEmptyMasters\n\t/home/workspace/redis-operator/pkg/k8sutils/cluster-scaling.go:164\ngithub.com/OT-CONTAINER-KIT/redis-operator/pkg/k8sutils.CheckIfEmptyMasters\n\t/home/workspace/redis-operator/pkg/k8sutils/cluster-scaling.go:182\ngithub.com/OT-CONTAINER-KIT/redis-operator/pkg/controllers/rediscluster.(*Reconciler).Reconcile\n\t/home/workspace/redis-operator/pkg/controllers/rediscluster/rediscluster_controller.go:243\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}

redis-operator version: v0.19.0

What operating system and processor architecture are you using (kubectl version)?

[root@master ~]# kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.7", GitCommit:"07a61d861519c45ef5c89bc22dda289328f29343", GitTreeState:"clean", BuildDate:"2023-10-18T11:42:32Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.7", GitCommit:"07a61d861519c45ef5c89bc22dda289328f29343", GitTreeState:"clean", BuildDate:"2023-10-18T11:33:23Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}
[root@master ~]# 

What did you do?

[root@master redis]# cat cluster.yaml 
---
apiVersion: redis.redis.opstreelabs.in/v1beta2
kind: RedisCluster
metadata:
  name: redis-cluster
spec:
  clusterSize: 3
  clusterVersion: v7
  podSecurityContext:
    runAsUser: 1000
    fsGroup: 1000
  kubernetesConfig:
    image: quay.io/opstree/redis:v7.2.7
    imagePullPolicy: IfNotPresent
  persistenceEnabled: true
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: local-path
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 1Gi
  1. kubectl apply -f cluster.yaml
  2. Wait for the cluster to be created and working
  3. watch -n1 'kubectl delete po redis-cluster-leader-1 --force' ,waiting for redis-cluster-follower-1 to be promoted to master
  4. stop step3

What did you expect to see?

  1. During redis-cluster-leader-1 failure, the pod's label correctly reflects the node's master-slave role relationship
  2. After redis-cluster-leader-1 failure recovery, the pod's label correctly reflects the node's master-slave role relationship
  3. After redis-cluster-leader-1 failure recovery, redis-cluster-leader-1 rejoins the cluster and works properly

What did you see instead?

  1. redis-cluster-follower-1 role is follower, not master,but its real role is master
[root@master ~]# kubectl get po -owide --show-labels
NAME                                 READY   STATUS    RESTARTS   AGE     IP           NODE      NOMINATED NODE   READINESS GATES   LABELS
redis-cluster-follower-0             1/1     Running   0          4m29s   10.0.2.76    worker2   <none>           <none>            app=redis-cluster-follower,controller-revision-hash=redis-cluster-follower-555d687c77,redis_setup_type=cluster,role=follower,statefulset.kubernetes.io/pod-name=redis-cluster-follower-0
redis-cluster-follower-1             1/1     Running   0          4m23s   10.0.1.60    worker1   <none>           <none>            app=redis-cluster-follower,controller-revision-hash=redis-cluster-follower-555d687c77,redis_setup_type=cluster,role=follower,statefulset.kubernetes.io/pod-name=redis-cluster-follower-1
redis-cluster-follower-2             1/1     Running   0          4m18s   10.0.2.207   worker2   <none>           <none>            app=redis-cluster-follower,controller-revision-hash=redis-cluster-follower-555d687c77,redis_setup_type=cluster,role=follower,statefulset.kubernetes.io/pod-name=redis-cluster-follower-2
redis-cluster-leader-0               1/1     Running   0          4m45s   10.0.2.159   worker2   <none>           <none>            app=redis-cluster-leader,controller-revision-hash=redis-cluster-leader-749c78499c,redis_setup_type=cluster,role=leader,statefulset.kubernetes.io/pod-name=redis-cluster-leader-0
redis-cluster-leader-1               0/1     Pending   0          1s      <none>       worker1   <none>           <none>            app=redis-cluster-leader,controller-revision-hash=redis-cluster-leader-749c78499c,redis_setup_type=cluster,role=leader,statefulset.kubernetes.io/pod-name=redis-cluster-leader-1
redis-cluster-leader-2               1/1     Running   0          4m34s   10.0.1.241   worker1   <none>           <none>            app=redis-cluster-leader,controller-revision-hash=redis-cluster-leader-749c78499c,redis_setup_type=cluster,role=leader,statefulset.kubernetes.io/pod-name=redis-cluster-leader-2

[root@master ~]# kubectl exec -it redis-cluster-leader-0 bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
redis-cluster-leader-0:/data$ redis-cli -p 6379
127.0.0.1:6379> cluster nodes
198ede5448cc947d334af6c9abd3ae6526109e00 10.0.2.159:6379@16379,redis-cluster-leader-0 myself,master - 0 1738833249000 1 connected 0-5460
d1cccef8544f831d6c15251a79de3322ed33d8b5 10.0.1.241:6379@16379,redis-cluster-leader-2 master - 0 1738833250000 3 connected 10923-16383
d0a59648381396fcc85a34dc100ab514d0fb07dd 10.0.2.207:6379@16379,redis-cluster-follower-2 slave d1cccef8544f831d6c15251a79de3322ed33d8b5 0 1738833250146 3 connected
fd9c98638e8181f8383ae24bf9afa17ebee14d20 10.0.1.60:6379@16379,redis-cluster-follower-1 master - 0 1738833249042 4 connected 5461-10922
d8a87a96d4f47b84567a1bdfb86a8d3dbb5e1f0f 10.0.1.2:6379@16379,redis-cluster-leader-1 master,fail - 1738833210998 1738833208586 2 connected
35a8faa7324a392faef8432e4bfe3eb15f886bdd 10.0.2.76:6379@16379,redis-cluster-follower-0 slave 198ede5448cc947d334af6c9abd3ae6526109e00 0 1738833251148 1 connected
127.0.0.1:6379> 
  1. redis-cluster-follower-1 role is follower, not master,but its real role is master
[root@master ~]# kubectl get po -owide --show-labels
NAME                                 READY   STATUS    RESTARTS   AGE     IP           NODE      NOMINATED NODE   READINESS GATES   LABELS
redis-cluster-follower-0             1/1     Running   0          7m16s   10.0.2.76    worker2   <none>           <none>            app=redis-cluster-follower,controller-revision-hash=redis-cluster-follower-555d687c77,redis_setup_type=cluster,role=follower,statefulset.kubernetes.io/pod-name=redis-cluster-follower-0
redis-cluster-follower-1             1/1     Running   0          7m10s   10.0.1.60    worker1   <none>           <none>            app=redis-cluster-follower,controller-revision-hash=redis-cluster-follower-555d687c77,redis_setup_type=cluster,role=follower,statefulset.kubernetes.io/pod-name=redis-cluster-follower-1
redis-cluster-follower-2             1/1     Running   0          7m5s    10.0.2.207   worker2   <none>           <none>            app=redis-cluster-follower,controller-revision-hash=redis-cluster-follower-555d687c77,redis_setup_type=cluster,role=follower,statefulset.kubernetes.io/pod-name=redis-cluster-follower-2
redis-cluster-leader-0               1/1     Running   0          7m32s   10.0.2.159   worker2   <none>           <none>            app=redis-cluster-leader,controller-revision-hash=redis-cluster-leader-749c78499c,redis_setup_type=cluster,role=leader,statefulset.kubernetes.io/pod-name=redis-cluster-leader-0
redis-cluster-leader-1               1/1     Running   0          32s     10.0.1.50    worker1   <none>           <none>            app=redis-cluster-leader,controller-revision-hash=redis-cluster-leader-749c78499c,redis_setup_type=cluster,role=leader,statefulset.kubernetes.io/pod-name=redis-cluster-leader-1
redis-cluster-leader-2               1/1     Running   0          7m21s   10.0.1.241   worker1   <none>           <none>            app=redis-cluster-leader,controller-revision-hash=redis-cluster-leader-749c78499c,redis_setup_type=cluster,role=leader,statefulset.kubernetes.io/pod-name=redis-cluster-leader-2

  1. redis-cluster-leader-1 becomes orphaned and is not added to the cluster
[root@master ~]# kubectl exec -it redis-cluster-leader-1 bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
redis-cluster-leader-1:/data$ redis-cli -p 6379
127.0.0.1:6379> cluster info
cluster_state:fail
cluster_slots_assigned:0
cluster_slots_ok:0
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:1
cluster_size:0
cluster_current_epoch:0
cluster_my_epoch:0
cluster_stats_messages_sent:0
cluster_stats_messages_received:0
total_cluster_links_buffer_limit_exceeded:0
127.0.0.1:6379> cluster nodes
8bd3699418392435250c7084eca34228fb6f798c 10.0.1.50:6379@16379,redis-cluster-leader-1 myself,master - 0 0 0 connected
127.0.0.1:6379> 

[root@master ~]# kubectl exec -it redis-cluster-leader-0 bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
redis-cluster-leader-0:/data$ redis-cli -p 6379
127.0.0.1:6379> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:4
cluster_my_epoch:1
cluster_stats_messages_ping_sent:1052
cluster_stats_messages_pong_sent:1078
cluster_stats_messages_auth-ack_sent:1
cluster_stats_messages_sent:2131
cluster_stats_messages_ping_received:1075
cluster_stats_messages_pong_received:1052
cluster_stats_messages_meet_received:3
cluster_stats_messages_fail_received:1
cluster_stats_messages_auth-req_received:1
cluster_stats_messages_received:2132
total_cluster_links_buffer_limit_exceeded:0
127.0.0.1:6379> cluster nodes
198ede5448cc947d334af6c9abd3ae6526109e00 10.0.2.159:6379@16379,redis-cluster-leader-0 myself,master - 0 1738833678000 1 connected 0-5460
d1cccef8544f831d6c15251a79de3322ed33d8b5 10.0.1.241:6379@16379,redis-cluster-leader-2 master - 0 1738833679584 3 connected 10923-16383
d0a59648381396fcc85a34dc100ab514d0fb07dd 10.0.2.207:6379@16379,redis-cluster-follower-2 slave d1cccef8544f831d6c15251a79de3322ed33d8b5 0 1738833680787 3 connected
fd9c98638e8181f8383ae24bf9afa17ebee14d20 10.0.1.60:6379@16379,redis-cluster-follower-1 master - 0 1738833679000 4 connected 5461-10922
d8a87a96d4f47b84567a1bdfb86a8d3dbb5e1f0f 10.0.1.2:6379@16379,redis-cluster-leader-1 master,fail - 1738833210998 1738833208586 2 connected
35a8faa7324a392faef8432e4bfe3eb15f886bdd 10.0.2.76:6379@16379,redis-cluster-follower-0 slave 198ede5448cc947d334af6c9abd3ae6526109e00 0 1738833679784 1 connected
127.0.0.1:6379>

Additional
The above steps will report a different error if you replace redis-cluster-leader-1 with redis-cluster-leader-0 and reproduce the problem.

  1. operator log
{"level":"info","ts":"2025-02-06T17:28:43+08:00","msg":"Creating redis cluster by executing cluster creation commands","controller":"rediscluster","controllerGroup":"redis.redis.opstreelabs.in","controllerKind":"RedisCluster","RedisCluster":{"name":"redis-cluster","namespace":"default"},"namespace":"default","name":"redis-cluster","reconcileID":"3d61dafc-4653-4ea5-813c-f25bb10acf85"}
{"level":"info","ts":"2025-02-06T17:28:43+08:00","msg":"Not all leader are part of the cluster...","controller":"rediscluster","controllerGroup":"redis.redis.opstreelabs.in","controllerKind":"RedisCluster","RedisCluster":{"name":"redis-cluster","namespace":"default"},"namespace":"default","name":"redis-cluster","reconcileID":"3d61dafc-4653-4ea5-813c-f25bb10acf85","Leaders.Count":1,"Instance.Size":3}
{"level":"info","ts":"2025-02-06T17:28:46+08:00","msg":"Creating redis cluster by executing cluster creation commands","controller":"rediscluster","controllerGroup":"redis.redis.opstreelabs.in","controllerKind":"RedisCluster","RedisCluster":{"name":"redis-cluster","namespace":"default"},"namespace":"default","name":"redis-cluster","reconcileID":"3b1df7f2-7142-4e4f-9a65-77acc6834260"}
{"level":"info","ts":"2025-02-06T17:28:46+08:00","msg":"All leader are part of the cluster, adding follower/replicas","controller":"rediscluster","controllerGroup":"redis.redis.opstreelabs.in","controllerKind":"RedisCluster","RedisCluster":{"name":"redis-cluster","namespace":"default"},"namespace":"default","name":"redis-cluster","reconcileID":"3b1df7f2-7142-4e4f-9a65-77acc6834260","Leaders.Count":3,"Instance.Size":3,"Follower.Replicas":3}
{"level":"info","ts":"2025-02-06T17:30:46+08:00","msg":"Creating redis cluster by executing cluster creation commands","controller":"rediscluster","controllerGroup":"redis.redis.opstreelabs.in","controllerKind":"RedisCluster","RedisCluster":{"name":"redis-cluster","namespace":"default"},"namespace":"default","name":"redis-cluster","reconcileID":"304372c7-07a8-4712-8ed3-9c4567e12efc"}
{"level":"info","ts":"2025-02-06T17:30:46+08:00","msg":"Not all leader are part of the cluster...","controller":"rediscluster","controllerGroup":"redis.redis.opstreelabs.in","controllerKind":"RedisCluster","RedisCluster":{"name":"redis-cluster","namespace":"default"},"namespace":"default","name":"redis-cluster","reconcileID":"304372c7-07a8-4712-8ed3-9c4567e12efc","Leaders.Count":1,"Instance.Size":3}
{"level":"error","ts":"2025-02-06T17:30:46+08:00","msg":"Could not execute command","controller":"rediscluster","controllerGroup":"redis.redis.opstreelabs.in","controllerKind":"RedisCluster","RedisCluster":{"name":"redis-cluster","namespace":"default"},"namespace":"default","name":"redis-cluster","reconcileID":"304372c7-07a8-4712-8ed3-9c4567e12efc","Command":["redis-cli","--cluster","create","redis-cluster-leader-0.redis-cluster-leader-headless.default.svc:6379","redis-cluster-leader-1.redis-cluster-leader-headless.default.svc:6379","redis-cluster-leader-2.redis-cluster-leader-headless.default.svc:6379","--cluster-yes"],"Output":"[ERR] Node redis-cluster-leader-1.redis-cluster-leader-headless.default.svc:6379 is not empty. Either the node already knows other nodes (check with CLUSTER NODES) or contains some key in database 0.\n","error":"execute command with error: command terminated with exit code 1, stderr: ","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/pkg/k8sutils.executeCommand\n\t/home/workspace/redis-operator/pkg/k8sutils/redis.go:463\ngithub.com/OT-CONTAINER-KIT/redis-operator/pkg/k8sutils.ExecuteRedisClusterCommand\n\t/home/workspace/redis-operator/pkg/k8sutils/redis.go:180\ngithub.com/OT-CONTAINER-KIT/redis-operator/pkg/controllers/rediscluster.(*Reconciler).Reconcile\n\t/home/workspace/redis-operator/pkg/controllers/rediscluster/rediscluster_controller.go:183\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}
[root@master ~]# kubectl exec -it redis-cluster-leader-0 bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
redis-cluster-leader-0:/data$ redis-cli -p 6379
127.0.0.1:6379> cluster nodes
fc676626a93b542eedfefc01b021308b36672c7a 10.0.2.227:6379@16379,redis-cluster-leader-0 myself,master - 0 0 0 connected
127.0.0.1:6379> 
  1. I have not verified what happens when a slave node is deleted
@trynocoding trynocoding added the bug Something isn't working label Feb 6, 2025
@alv91
Copy link

alv91 commented Feb 12, 2025

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants