Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Loss After Restarting All Leaders and Followers in Redis Cluster #1164

Open
captainpro-eng opened this issue Dec 12, 2024 · 11 comments
Open
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@captainpro-eng
Copy link

captainpro-eng commented Dec 12, 2024

We are currently running a Redis cluster with the following versions:
Redis Operator Helm Chart: 18.0.5
Redis Operator Image: 18.0.1
Redis Image: 7.0.12

We have tested multiple Redis failover scenarios, and in most cases, the cluster state is marked as "OK" and data is preserved after restarts. However, we encountered a scenario where restarting all the masters and slaves results in the cluster state being "OK", but all data is lost. Below is a summary of the test cases:

Tested Scenarios:

  • Restarted 3 leaders only: After all leaders came back up, the cluster state became "OK" and all data was preserved.

  • Restarted all the leaders : After all leaders came back up, the cluster state became "OK" and all data was secure.

  • Restarted all the followers : After all followers came back up, the cluster state became "OK" and all data was secure.

  • Restarted 3 leaders and 3 followers : After all leaders and followers came back up, the cluster state became "OK" and all data was secure.

  • Restarted all the leaders and all the followers: After all masters and slaves came back up, the cluster state was "OK", but all data was lost.

@captainpro-eng captainpro-eng added the bug Something isn't working label Dec 12, 2024
@drivebyer drivebyer added the help wanted Extra attention is needed label Dec 12, 2024
@captainpro-eng
Copy link
Author

@drivebyer have u faced this?

@drivebyer
Copy link
Collaborator

@drivebyer have u faced this?

No. Did you use RDB or AOF?

@captainpro-eng
Copy link
Author

captainpro-eng commented Dec 24, 2024 via email

@captainpro-eng
Copy link
Author

@drivebyer have u check this?

@MuhammadQadora
Copy link
Contributor

@drivebyer @captainpro-eng Hi, I am facing similar issue. I have set specStorage.KeepAfterDelete: true to persist the PVCs.
After a helm uninstall and reinstall, the data is lost. I can tell that the data is lost when the Sync happens between master and replica (before the state of the cluster changes to Ok).
I am using the latest version of both the operator and the RedisCluster

@captainpro-eng
Copy link
Author

captainpro-eng commented Feb 3, 2025 via email

@MuhammadQadora
Copy link
Contributor

Sure thing!

---
redisCluster:
  name: "redis-cluster"
  clusterSize: 3
  clusterVersion: v7
  persistenceEnabled: true
  image: op-test
  tag: latest
  imagePullPolicy: IfNotPresent
  imagePullSecrets: {}
    # - name:  Secret with Registry credentials
  redisSecret:
    secretName: "redis-password"
    secretKey: "password"
  resources:
    requests:
      cpu: 400m
      memory: 1Gi
    limits:
      cpu: 400m
      memory: 2Gi
  minReadySeconds: 0
  # -- Some fields of statefulset are immutable, such as volumeClaimTemplates.
  # When set to true, the operator will delete the statefulset and recreate it. Default is false.
  recreateStatefulSetOnUpdateInvalid: false
  leader:
    replicas: 3
    serviceType: ClusterIP
    affinity: {}
      # nodeAffinity:
      #   requiredDuringSchedulingIgnoredDuringExecution:
      #     nodeSelectorTerms:
      #     - matchExpressions:
      #       - key: disktype
      #         operator: In
      #         values:
      #         - ssd
    tolerations: []
      # - key: "key"
      #   operator: "Equal"
      #   value: "value"
      #   effect: "NoSchedule"
    nodeSelector: null
      # memory: medium
    securityContext: {}
    pdb:
      enabled: false
      maxUnavailable: 1
      minAvailable: 1
    
  follower:
    replicas: 3
    serviceType: ClusterIP
    affinity: null
      # nodeAffinity:
      #   requiredDuringSchedulingIgnoredDuringExecution:
      #     nodeSelectorTerms:
      #     - matchExpressions:
      #       - key: disktype
      #         operator: In
      #         values:
      #         - ssd
    tolerations: []
      # - key: "key"
      #   operator: "Equal"
      #   value: "value"
      #   effect: "NoSchedule"
    nodeSelector: null
      # memory: medium
    securityContext: {}
    pdb:
      enabled: false
      maxUnavailable: 1
      minAvailable: 1

labels: {}
#   foo: bar
#   test: echo


externalConfig:
  enabled: true
  data: |
    loadmodule /FalkorDB/bin/src/falkordb.so

externalService:
  enabled: false
  # annotations:
  #   foo: bar
  serviceType: LoadBalancer
  port: 6379

serviceMonitor:
  enabled: false
  interval: 30s
  scrapeTimeout: 10s
  namespace: monitoring
  # -- extraLabels are added to the servicemonitor when enabled set to true
  extraLabels: {}
    # foo: bar
    # team: devops

redisExporter:
  enabled: false
  image: quay.io/opstree/redis-exporter
  tag: "v1.44.0"
  imagePullPolicy: IfNotPresent
  resources: {}
    # requests:
    #   cpu: 100m
    #   memory: 128Mi
    # limits:
    #   cpu: 100m
    #   memory: 128Mi
  env: []
    # - name: VAR_NAME
    #   value: "value1"

sidecars:
  name: ""
  image: ""
  imagePullPolicy: "IfNotPresent"
  resources:
    limits:
      cpu: "100m"
      memory: "128Mi"
    requests:
      cpu: "50m"
      memory: "64Mi"
  env: {}
    # - name: MY_ENV_VAR
    #   value: "my-env-var-value"

initContainer:
  enabled: false
  image: ""
  imagePullPolicy: "IfNotPresent"
  resources: {}
    # requests:
    #   memory: "64Mi"
    #   cpu: "250m"
    # limits:
    #   memory: "128Mi"
    #   cpu: "500m"
  env: []
  command: []
  args: []

priorityClassName: ""

storageSpec:
  keepAfterDelete: true
  volumeClaimTemplate:
    spec:
      # storageClassName: standard
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 1Gi
  nodeConfVolume: true
  nodeConfVolumeClaimTemplate:
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 1Gi
  #   selector: {}

podSecurityContext:
  runAsUser: 1000
  fsGroup: 1000


# serviceAccountName: redis-sa

TLS:
  ca: ca.crt
  cert: tls.crt
  key: tls.key
  secret:
    secretName: ""

acl:
  secret:
    secretName: ""

env: []
  # - name: VAR_NAME
  #   value: "value1"

serviceAccountName: ""

@captainpro-eng
Copy link
Author

captainpro-eng commented Feb 3, 2025 via email

@MuhammadQadora
Copy link
Contributor

@captainpro-eng I am not sure I understand what announceme option is.
I think v6 is old for the newest operator version.
W0203 17:08:42.586251 68863 warnings.go:70] unknown field "spec.kubernetesConfig.serviceType"
W0203 17:08:42.586263 68863 warnings.go:70] unknown field "spec.redisFollower.serviceType"
W0203 17:08:42.586265 68863 warnings.go:70] unknown field "spec.redisLeader.serviceType"

@MuhammadQadora
Copy link
Contributor

@drivebyer I have an update on this, it seems the Redis operator does a FLUSHALL in the k8sutils/redis.go if the CLUSTER RSET command fails (When masters have keys and are not empty) and this add a FLUSHALL at the end of the appendonly.aof1.incr file also.
#1069

@captainpro-eng
Copy link
Author

captainpro-eng commented Mar 4, 2025

@captainpro-eng I am not sure I understand what announceme option is. I think v6 is old for the newest operator version. W0203 17:08:42.586251 68863 warnings.go:70] unknown field "spec.kubernetesConfig.serviceType" W0203 17:08:42.586263 68863 warnings.go:70] unknown field "spec.redisFollower.serviceType" W0203 17:08:42.586265 68863 warnings.go:70] unknown field "spec.redisLeader.serviceType"

Hi @MuhammadQadora,
I'm using cluster version v6 because in v7, when a pod is restarted, it doesn't rejoin the cluster due to the pod IP changing. The default entry point in v7 doesn't include the necessary cluster-announce-ip and cluster-announce-hostname options, which are needed for the Redis pod to announce itself correctly to the cluster.

In my case, since v7 does not automatically include these options in the entry point, the pod fails to join the cluster after a restart because the Redis node doesn't announce its new IP to the cluster. To fix this, I have to use v6 and enable cluster announce so the pod can join through the host.

I’m using the redis:v7.0.12 image, where the cluster announce functionality is not included by default in that version.

Here is the entrypoint.sh in the master branch:

start_redis() {
    if [[ "${SETUP_MODE}" == "cluster" ]]; then
        echo "Starting redis service in cluster mode....."
        if [[ "${NODEPORT}" == "true" ]]; then
            CLUSTER_ANNOUNCE_IP_VAR="HOST_IP"
            CLUSTER_ANNOUNCE_IP="${!CLUSTER_ANNOUNCE_IP_VAR}"
        else
            CLUSTER_ANNOUNCE_IP="${POD_IP}"
        fi
        
        if [[ "${REDIS_MAJOR_VERSION}" != "v7" ]]; then
          exec redis-server /etc/redis/redis.conf \
          --cluster-announce-ip "${CLUSTER_ANNOUNCE_IP}"
        else
          {
            echo cluster-announce-ip "${CLUSTER_ANNOUNCE_IP}"
            echo cluster-announce-hostname "${POD_HOSTNAME}"
          } >> /etc/redis/redis.conf

          exec redis-server /etc/redis/redis.conf
        fi

    else
        echo "Starting redis service in standalone mode....."
        exec redis-server /etc/redis/redis.conf
    fi
}

For Redis version 7.0.12, the entry point doesn't automatically add cluster-announce-ip and cluster-announce-hostname, which is why the Redis node fails to announce its IP to the cluster after a restart. Here's the entry point logic for v7.0.12:

start_redis() {
    if [[ "${SETUP_MODE}" == "cluster" ]]; then
        echo "Starting redis service in cluster mode....."
        if [[ "${REDIS_MAJOR_VERSION}" != "v7" ]]; then
          redis-server /etc/redis/redis.conf \
          --cluster-announce-ip "${POD_IP}" \
          --cluster-announce-hostname "${POD_HOSTNAME}"
        else
          redis-server /etc/redis/redis.conf
        fi
    else
        echo "Starting redis service in standalone mode....."
        redis-server /etc/redis/redis.conf
    fi
}

Here’s the service definition I’m using for the Redis Cluster with Redis Operator version 0.18.5:

apiVersion: redis.redis.opstreelabs.in/v1beta2
kind: RedisCluster
metadata:
  name: tp-redis-cluster
spec:
  clusterSize: 3
  clusterVersion: v7
  kubernetesConfig:
    image: harbor.smartping.io/thirdparty/redis:v7.0.12
    imagePullPolicy: IfNotPresent
    service:
      serviceType: NodePort
    redisSecret:
      name: redis-secret
      key: password

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants