Data Loss After Restarting All Leaders and Followers in Redis Cluster #1164

captainpro-eng · 2024-12-12T05:32:22Z

We are currently running a Redis cluster with the following versions:
Redis Operator Helm Chart: 18.0.5
Redis Operator Image: 18.0.1
Redis Image: 7.0.12

We have tested multiple Redis failover scenarios, and in most cases, the cluster state is marked as "OK" and data is preserved after restarts. However, we encountered a scenario where restarting all the masters and slaves results in the cluster state being "OK", but all data is lost. Below is a summary of the test cases:

Tested Scenarios:

Restarted 3 leaders only: After all leaders came back up, the cluster state became "OK" and all data was preserved.
Restarted all the leaders : After all leaders came back up, the cluster state became "OK" and all data was secure.
Restarted all the followers : After all followers came back up, the cluster state became "OK" and all data was secure.
Restarted 3 leaders and 3 followers : After all leaders and followers came back up, the cluster state became "OK" and all data was secure.
Restarted all the leaders and all the followers: After all masters and slaves came back up, the cluster state was "OK", but all data was lost.

captainpro-eng · 2024-12-23T10:24:24Z

@drivebyer have u faced this?

drivebyer · 2024-12-24T03:29:28Z

@drivebyer have u faced this?

No. Did you use RDB or AOF?

captainpro-eng · 2024-12-24T03:59:57Z

I’m using AOF.

…

On Tue, 24 Dec 2024 at 8:59 AM, yangw ***@***.***> wrote: @drivebyer <https://github.com/drivebyer> have u faced this? No. Did you use RDB or AOF? — Reply to this email directly, view it on GitHub <#1164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APZA5LWOJP5VZICN5YOKFBL2HDIK3AVCNFSM6AAAAABTO7POZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRQGU4TEOJQG4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

captainpro-eng · 2024-12-27T15:14:42Z

@drivebyer have u check this?

MuhammadQadora · 2025-02-03T14:36:56Z

@drivebyer @captainpro-eng Hi, I am facing similar issue. I have set specStorage.KeepAfterDelete: true to persist the PVCs.
After a helm uninstall and reinstall, the data is lost. I can tell that the data is lost when the Sync happens between master and replica (before the state of the cluster changes to Ok).
I am using the latest version of both the operator and the RedisCluster

captainpro-eng · 2025-02-03T14:40:54Z

Can u share the redis cluster yaml?

…

On Mon, 3 Feb 2025 at 8:07 PM, Qadora ***@***.***> wrote: @drivebyer <https://github.com/drivebyer> @captainpro-eng <https://github.com/captainpro-eng> Hi, I am facing similar issue. I have set specStorage.KeepAfterDelete: true to persist the PVCs. After a helm uninstall and reinstall, the data is lost. I can tell that the data is lost when the Sync happens between master and replica (before the state of the cluster changes to Ok). I am using the latest version of both the operator and the RedisCluster — Reply to this email directly, view it on GitHub <#1164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APZA5LTBU422U47EPTYMJ4D2N55KBAVCNFSM6AAAAABTO7POZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMZRGE4DEOBSG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

MuhammadQadora · 2025-02-03T14:46:55Z

Sure thing!

---
redisCluster:
  name: "redis-cluster"
  clusterSize: 3
  clusterVersion: v7
  persistenceEnabled: true
  image: op-test
  tag: latest
  imagePullPolicy: IfNotPresent
  imagePullSecrets: {}
    # - name:  Secret with Registry credentials
  redisSecret:
    secretName: "redis-password"
    secretKey: "password"
  resources:
    requests:
      cpu: 400m
      memory: 1Gi
    limits:
      cpu: 400m
      memory: 2Gi
  minReadySeconds: 0
  # -- Some fields of statefulset are immutable, such as volumeClaimTemplates.
  # When set to true, the operator will delete the statefulset and recreate it. Default is false.
  recreateStatefulSetOnUpdateInvalid: false
  leader:
    replicas: 3
    serviceType: ClusterIP
    affinity: {}
      # nodeAffinity:
      #   requiredDuringSchedulingIgnoredDuringExecution:
      #     nodeSelectorTerms:
      #     - matchExpressions:
      #       - key: disktype
      #         operator: In
      #         values:
      #         - ssd
    tolerations: []
      # - key: "key"
      #   operator: "Equal"
      #   value: "value"
      #   effect: "NoSchedule"
    nodeSelector: null
      # memory: medium
    securityContext: {}
    pdb:
      enabled: false
      maxUnavailable: 1
      minAvailable: 1
    
  follower:
    replicas: 3
    serviceType: ClusterIP
    affinity: null
      # nodeAffinity:
      #   requiredDuringSchedulingIgnoredDuringExecution:
      #     nodeSelectorTerms:
      #     - matchExpressions:
      #       - key: disktype
      #         operator: In
      #         values:
      #         - ssd
    tolerations: []
      # - key: "key"
      #   operator: "Equal"
      #   value: "value"
      #   effect: "NoSchedule"
    nodeSelector: null
      # memory: medium
    securityContext: {}
    pdb:
      enabled: false
      maxUnavailable: 1
      minAvailable: 1

labels: {}
#   foo: bar
#   test: echo


externalConfig:
  enabled: true
  data: |
    loadmodule /FalkorDB/bin/src/falkordb.so

externalService:
  enabled: false
  # annotations:
  #   foo: bar
  serviceType: LoadBalancer
  port: 6379

serviceMonitor:
  enabled: false
  interval: 30s
  scrapeTimeout: 10s
  namespace: monitoring
  # -- extraLabels are added to the servicemonitor when enabled set to true
  extraLabels: {}
    # foo: bar
    # team: devops

redisExporter:
  enabled: false
  image: quay.io/opstree/redis-exporter
  tag: "v1.44.0"
  imagePullPolicy: IfNotPresent
  resources: {}
    # requests:
    #   cpu: 100m
    #   memory: 128Mi
    # limits:
    #   cpu: 100m
    #   memory: 128Mi
  env: []
    # - name: VAR_NAME
    #   value: "value1"

sidecars:
  name: ""
  image: ""
  imagePullPolicy: "IfNotPresent"
  resources:
    limits:
      cpu: "100m"
      memory: "128Mi"
    requests:
      cpu: "50m"
      memory: "64Mi"
  env: {}
    # - name: MY_ENV_VAR
    #   value: "my-env-var-value"

initContainer:
  enabled: false
  image: ""
  imagePullPolicy: "IfNotPresent"
  resources: {}
    # requests:
    #   memory: "64Mi"
    #   cpu: "250m"
    # limits:
    #   memory: "128Mi"
    #   cpu: "500m"
  env: []
  command: []
  args: []

priorityClassName: ""

storageSpec:
  keepAfterDelete: true
  volumeClaimTemplate:
    spec:
      # storageClassName: standard
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 1Gi
  nodeConfVolume: true
  nodeConfVolumeClaimTemplate:
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 1Gi
  #   selector: {}

podSecurityContext:
  runAsUser: 1000
  fsGroup: 1000


# serviceAccountName: redis-sa

TLS:
  ca: ca.crt
  cert: tls.crt
  key: tls.key
  secret:
    secretName: ""

acl:
  secret:
    secretName: ""

env: []
  # - name: VAR_NAME
  #   value: "value1"

serviceAccountName: ""

captainpro-eng · 2025-02-03T14:51:45Z

Hi Please try this clusterVersion: v6 U face the issue u r not using cluster-announce features

…

On Mon, 3 Feb 2025 at 8:17 PM, Qadora ***@***.***> wrote: Sure thing! --- redisCluster: name: "redis-cluster" clusterSize: 3 clusterVersion: v7 persistenceEnabled: true image: op-test tag: latest imagePullPolicy: IfNotPresent imagePullSecrets: {} # - name: Secret with Registry credentials redisSecret: secretName: "redis-password" secretKey: "password" resources: requests: cpu: 400m memory: 1Gi limits: cpu: 400m memory: 2Gi minReadySeconds: 0 # -- Some fields of statefulset are immutable, such as volumeClaimTemplates. # When set to true, the operator will delete the statefulset and recreate it. Default is false. recreateStatefulSetOnUpdateInvalid: false leader: replicas: 3 serviceType: ClusterIP affinity: {} # nodeAffinity: # requiredDuringSchedulingIgnoredDuringExecution: # nodeSelectorTerms: # - matchExpressions: # - key: disktype # operator: In # values: # - ssd tolerations: [] # - key: "key" # operator: "Equal" # value: "value" # effect: "NoSchedule" nodeSelector: null # memory: medium securityContext: {} pdb: enabled: false maxUnavailable: 1 minAvailable: 1 follower: replicas: 3 serviceType: ClusterIP affinity: null # nodeAffinity: # requiredDuringSchedulingIgnoredDuringExecution: # nodeSelectorTerms: # - matchExpressions: # - key: disktype # operator: In # values: # - ssd tolerations: [] # - key: "key" # operator: "Equal" # value: "value" # effect: "NoSchedule" nodeSelector: null # memory: medium securityContext: {} pdb: enabled: false maxUnavailable: 1 minAvailable: 1 labels: {} # foo: bar # test: echo externalConfig: enabled: true data: | loadmodule /FalkorDB/bin/src/falkordb.so externalService: enabled: false # annotations: # foo: bar serviceType: LoadBalancer port: 6379 serviceMonitor: enabled: false interval: 30s scrapeTimeout: 10s namespace: monitoring # -- extraLabels are added to the servicemonitor when enabled set to true extraLabels: {} # foo: bar # team: devops redisExporter: enabled: false image: quay.io/opstree/redis-exporter tag: "v1.44.0" imagePullPolicy: IfNotPresent resources: {} # requests: # cpu: 100m # memory: 128Mi # limits: # cpu: 100m # memory: 128Mi env: [] # - name: VAR_NAME # value: "value1" sidecars: name: "" image: "" imagePullPolicy: "IfNotPresent" resources: limits: cpu: "100m" memory: "128Mi" requests: cpu: "50m" memory: "64Mi" env: {} # - name: MY_ENV_VAR # value: "my-env-var-value" initContainer: enabled: false image: "" imagePullPolicy: "IfNotPresent" resources: {} # requests: # memory: "64Mi" # cpu: "250m" # limits: # memory: "128Mi" # cpu: "500m" env: [] command: [] args: [] priorityClassName: "" storageSpec: keepAfterDelete: true volumeClaimTemplate: spec: # storageClassName: standard accessModes: ["ReadWriteOnce"] resources: requests: storage: 1Gi nodeConfVolume: true nodeConfVolumeClaimTemplate: spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 1Gi # selector: {} podSecurityContext: runAsUser: 1000 fsGroup: 1000 # serviceAccountName: redis-sa TLS: ca: ca.crt cert: tls.crt key: tls.key secret: secretName: "" acl: secret: secretName: "" env: [] # - name: VAR_NAME # value: "value1" serviceAccountName: "" — Reply to this email directly, view it on GitHub <#1164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APZA5LXE4HOAG7QYBTTSITL2N56PLAVCNFSM6AAAAABTO7POZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMZRGIYDQOBTGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

MuhammadQadora · 2025-02-03T15:12:23Z

@captainpro-eng I am not sure I understand what announceme option is.
I think v6 is old for the newest operator version.
W0203 17:08:42.586251 68863 warnings.go:70] unknown field "spec.kubernetesConfig.serviceType"
W0203 17:08:42.586263 68863 warnings.go:70] unknown field "spec.redisFollower.serviceType"
W0203 17:08:42.586265 68863 warnings.go:70] unknown field "spec.redisLeader.serviceType"

MuhammadQadora · 2025-02-06T17:18:58Z

@drivebyer I have an update on this, it seems the Redis operator does a FLUSHALL in the k8sutils/redis.go if the CLUSTER RSET command fails (When masters have keys and are not empty) and this add a FLUSHALL at the end of the appendonly.aof1.incr file also.
#1069

captainpro-eng · 2025-03-04T06:05:42Z

@captainpro-eng I am not sure I understand what announceme option is. I think v6 is old for the newest operator version. W0203 17:08:42.586251 68863 warnings.go:70] unknown field "spec.kubernetesConfig.serviceType" W0203 17:08:42.586263 68863 warnings.go:70] unknown field "spec.redisFollower.serviceType" W0203 17:08:42.586265 68863 warnings.go:70] unknown field "spec.redisLeader.serviceType"

Hi @MuhammadQadora,
I'm using cluster version v6 because in v7, when a pod is restarted, it doesn't rejoin the cluster due to the pod IP changing. The default entry point in v7 doesn't include the necessary cluster-announce-ip and cluster-announce-hostname options, which are needed for the Redis pod to announce itself correctly to the cluster.

In my case, since v7 does not automatically include these options in the entry point, the pod fails to join the cluster after a restart because the Redis node doesn't announce its new IP to the cluster. To fix this, I have to use v6 and enable cluster announce so the pod can join through the host.

I’m using the redis:v7.0.12 image, where the cluster announce functionality is not included by default in that version.

Here is the entrypoint.sh in the master branch:

start_redis() {
    if [[ "${SETUP_MODE}" == "cluster" ]]; then
        echo "Starting redis service in cluster mode....."
        if [[ "${NODEPORT}" == "true" ]]; then
            CLUSTER_ANNOUNCE_IP_VAR="HOST_IP"
            CLUSTER_ANNOUNCE_IP="${!CLUSTER_ANNOUNCE_IP_VAR}"
        else
            CLUSTER_ANNOUNCE_IP="${POD_IP}"
        fi
        
        if [[ "${REDIS_MAJOR_VERSION}" != "v7" ]]; then
          exec redis-server /etc/redis/redis.conf \
          --cluster-announce-ip "${CLUSTER_ANNOUNCE_IP}"
        else
          {
            echo cluster-announce-ip "${CLUSTER_ANNOUNCE_IP}"
            echo cluster-announce-hostname "${POD_HOSTNAME}"
          } >> /etc/redis/redis.conf

          exec redis-server /etc/redis/redis.conf
        fi

    else
        echo "Starting redis service in standalone mode....."
        exec redis-server /etc/redis/redis.conf
    fi
}

For Redis version 7.0.12, the entry point doesn't automatically add cluster-announce-ip and cluster-announce-hostname, which is why the Redis node fails to announce its IP to the cluster after a restart. Here's the entry point logic for v7.0.12:

start_redis() {
    if [[ "${SETUP_MODE}" == "cluster" ]]; then
        echo "Starting redis service in cluster mode....."
        if [[ "${REDIS_MAJOR_VERSION}" != "v7" ]]; then
          redis-server /etc/redis/redis.conf \
          --cluster-announce-ip "${POD_IP}" \
          --cluster-announce-hostname "${POD_HOSTNAME}"
        else
          redis-server /etc/redis/redis.conf
        fi
    else
        echo "Starting redis service in standalone mode....."
        redis-server /etc/redis/redis.conf
    fi
}

Here’s the service definition I’m using for the Redis Cluster with Redis Operator version 0.18.5:

apiVersion: redis.redis.opstreelabs.in/v1beta2
kind: RedisCluster
metadata:
  name: tp-redis-cluster
spec:
  clusterSize: 3
  clusterVersion: v7
  kubernetesConfig:
    image: harbor.smartping.io/thirdparty/redis:v7.0.12
    imagePullPolicy: IfNotPresent
    service:
      serviceType: NodePort
    redisSecret:
      name: redis-secret
      key: password

captainpro-eng added the bug Something isn't working label Dec 12, 2024

drivebyer added the help wanted Extra attention is needed label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Loss After Restarting All Leaders and Followers in Redis Cluster #1164

Data Loss After Restarting All Leaders and Followers in Redis Cluster #1164

captainpro-eng commented Dec 12, 2024 •

edited

Loading

captainpro-eng commented Dec 23, 2024

drivebyer commented Dec 24, 2024

captainpro-eng commented Dec 24, 2024 via email

captainpro-eng commented Dec 27, 2024

MuhammadQadora commented Feb 3, 2025

captainpro-eng commented Feb 3, 2025 via email

MuhammadQadora commented Feb 3, 2025

captainpro-eng commented Feb 3, 2025 via email •

edited

Loading

MuhammadQadora commented Feb 3, 2025

MuhammadQadora commented Feb 6, 2025

captainpro-eng commented Mar 4, 2025 •

edited

Loading

Data Loss After Restarting All Leaders and Followers in Redis Cluster #1164

Data Loss After Restarting All Leaders and Followers in Redis Cluster #1164

Comments

captainpro-eng commented Dec 12, 2024 • edited Loading

captainpro-eng commented Dec 23, 2024

drivebyer commented Dec 24, 2024

captainpro-eng commented Dec 24, 2024 via email

captainpro-eng commented Dec 27, 2024

MuhammadQadora commented Feb 3, 2025

captainpro-eng commented Feb 3, 2025 via email

MuhammadQadora commented Feb 3, 2025

captainpro-eng commented Feb 3, 2025 via email • edited Loading

MuhammadQadora commented Feb 3, 2025

MuhammadQadora commented Feb 6, 2025

captainpro-eng commented Mar 4, 2025 • edited Loading

captainpro-eng commented Dec 12, 2024 •

edited

Loading

captainpro-eng commented Feb 3, 2025 via email •

edited

Loading

captainpro-eng commented Mar 4, 2025 •

edited

Loading