crashloopbackoff when using RRSet region based and another record already exists #3945

FernandoMiguel · 2023-09-20T10:18:12Z

What happened:
We are preforming blue/green replacement of EKS cluster.
Our workloads tend to use either AWS Route53 RRSet weight or region based.
In one of our Green clusters, we saw eDNS crashloopbackoff when it a workload was deployed that contained a RRSet region based, that the Blue cluster had previously created

What you expected to happen:
For eDNS to fail to create the record, log the event and continue to work.
Once Blue cluster workload was removed, and their eDNS owned records removed for Green eDNS to create new records point to Green LBs.

How to reproduce it (as minimally and precisely as possible):

kind: Deployment
apiVersion: apps/v1
metadata:
  name: external-dns-ue1-yw-cluster-domain
  namespace: sre-external-dns-system
  uid: 1f51cc24-aea1-44b2-9f91-8e2b3cebae7e
  resourceVersion: "11574637"
  generation: 1
  creationTimestamp: "2023-09-12T12:09:03Z"
  labels:
    app.kubernetes.io/instance: external-dns-ue1-yw
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: cluster-domain
    app.kubernetes.io/version: 0.13.6
    argocd.argoproj.io/instance: external-dns-ue1-yw
    helm.sh/chart: cluster-domain-1.13.1
  annotations:
    deployment.kubernetes.io/revision: "1"
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: external-dns-ue1-yw
      app.kubernetes.io/name: cluster-domain
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: external-dns-ue1-yw
        app.kubernetes.io/name: cluster-domain
    spec:
      containers:
      - name: external-dns
        image: registry.k8s.io/external-dns/external-dns:v0.13.6
        args:
        - --log-level=debug
        - --log-format=text
        - --interval=1m
        - --source=service
        - --source=ingress
        - --policy=sync
        - --registry=txt
        - --txt-owner-id=ue1-yw
        - --txt-prefix=_externaldns.
        - --provider=aws
        ports:
        - name: http
          containerPort: 7979
          protocol: TCP
        resources:
          limits:
            memory: 128Mi
          requests:
            cpu: 100m
            memory: 128Mi
        livenessProbe:
          httpGet:
            path: /healthz
            port: http
            scheme: HTTP
          initialDelaySeconds: 10
          timeoutSeconds: 5
          periodSeconds: 10
          successThreshold: 1
          failureThreshold: 2
        readinessProbe:
          httpGet:
            path: /healthz
            port: http
            scheme: HTTP
          initialDelaySeconds: 5
          timeoutSeconds: 5
          periodSeconds: 10
          successThreshold: 1
          failureThreshold: 6
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        imagePullPolicy: IfNotPresent
        securityContext:
          capabilities:
            drop:
            - ALL
          runAsUser: 65534
          runAsNonRoot: true
          readOnlyRootFilesystem: true
          allowPrivilegeEscalation: false
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      nodeSelector:
        eks.amazonaws.com/capacityType: ON_DEMAND
        kubernetes.io/arch: amd64
        kubernetes.io/os: linux
      serviceAccountName: edns
      serviceAccount: edns
      securityContext:
        fsGroup: 65534
        seccompProfile:
          type: RuntimeDefault
      schedulerName: default-scheduler
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
        effect: NoSchedule
  strategy:
    type: Recreate
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600
status:
  observedGeneration: 1
  replicas: 1
  updatedReplicas: 1
  readyReplicas: 1
  availableReplicas: 1
  conditions:
  - type: Progressing
    status: "True"
    lastUpdateTime: "2023-09-12T12:09:16Z"
    lastTransitionTime: "2023-09-12T12:09:04Z"
    reason: NewReplicaSetAvailable
    message: ReplicaSet "external-dns-ue1-yw-cluster-domain-74dd4bb964" has successfully progressed.
  - type: Available
    status: "True"
    lastUpdateTime: "2023-09-19T14:35:18Z"
    lastTransitionTime: "2023-09-19T14:35:18Z"
    reason: MinimumReplicasAvailable
    message: Deployment has minimum availability.

kind: Ingress
apiVersion: networking.k8s.io/v1
metadata:
  name: helloworld
  namespace: helloworld
  uid: 40c95bf0-1a07-4b6a-a635-dde825e547a0
  resourceVersion: "11700042"
  generation: 1
  creationTimestamp: "2023-09-19T16:28:34Z"
  labels:
    app.kubernetes.io/instance: ue1-yw-helloworld
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: helloworld
    argocd.argoproj.io/instance: ue1-yw-helloworld
    helm.sh/chart: enverus-deployment-1.3.0
  annotations:
    external-dns.alpha.kubernetes.io/aws-region: "us-east-1"
    external-dns.alpha.kubernetes.io/set-identifier: internal-ue1-yw-int-lb-B.us-east-1.elb.amazonaws.com
    external-dns.alpha.kubernetes.io/target: internal-ue1-yw-int-lb-B.us-east-1.elb.amazonaws.com
    external-dns.alpha.kubernetes.io/ttl: "200"
    internal.lbc.kyverno.example.com/clusterDomain: "true"
    internal.lbc.kyverno.example.com/target: "true"
    policies.kyverno.io/last-applied-patches: |
      add-internal-ingress-cluster-domain.add-configmap-values-to-ingress-config-cp.kyverno.io: replaced
        /spec/rules/0/host
spec:
  ingressClassName: nginx-internal
  rules:
  - host: helloworld.int.enverus.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: http-helloworld
            port:
              number: 80
status:
  loadBalancer:
    ingress:
    - ip: 172.20.248.196

Anything else we need to know?:

Environment:

External-DNS version (use external-dns --version): 0.13.6
DNS provider: aws
Others: eks 1.27

The text was updated successfully, but these errors were encountered:

FernandoMiguel · 2023-09-20T10:23:27Z

error logs

level=error msg="Failure in zone example.com. [Id: /hostedzone/ZXXXXXX] when submitting change batch: InvalidChangeBatch: 
[
RRSet with DNS name _externaldns.helloworld.example.com., type TXT, SetIdentifier internal-ue1-yw-int-lb-A.us-east-1.elb.amazonaws.com, and Region Name=us-east-1 cannot be created because a latency RRSet with the same name, type and region already exists.,
RRSet with DNS name _externaldns.cname-helloworld.example.com., type TXT, SetIdentifier internal-ue1-yw-int-lb-A.us-east-1.elb.amazonaws.com, and Region Name=us-east-1 cannot be created because a latency RRSet with the same name, type and region already exists.,
RRSet with DNS name helloworld.example.com., type A, SetIdentifier internal-ue1-yw-int-lb-B 5.us-east-1.elb.amazonaws.com, and Region Name=us-east-1 cannot be created because a latency RRSet with the same name, type and region already exists.
]

dmcdii · 2023-10-03T13:12:28Z

We started seeing the same types of issues after upgrading to external-dns:v0.13.6. In order to work around it for now we reverted to external-dns:v0.13.4.
Appears to be due to this change.
#3009

dmitry-mightydevops · 2023-10-05T00:59:57Z

in my case I had 2 services with the same hostname annotation

➜ kgsvc teleport  -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/hostname: tele.aaa.net
...

➜ kgsvc teleport-auth -o yaml         
apiVersion: v1
kind: Service
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/hostname: tele.aaa.net
    external-dns.alpha.kubernetes.io/ttl: "120"
...

which resulted in that error.

k8s-triage-robot · 2024-01-29T18:12:48Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

FernandoMiguel · 2024-01-29T18:38:18Z

/remove-lifecycle stale

k8s-triage-robot · 2024-04-28T18:55:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

FernandoMiguel · 2024-04-29T05:44:04Z

/remove-lifecycle stale

k8s-triage-robot · 2024-07-28T06:10:36Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-08-27T06:39:36Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-09-26T07:27:21Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-09-26T07:27:27Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

FernandoMiguel added the kind/bug Categorizes issue or PR as related to a bug. label Sep 20, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 28, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 27, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crashloopbackoff when using RRSet region based and another record already exists #3945

crashloopbackoff when using RRSet region based and another record already exists #3945

FernandoMiguel commented Sep 20, 2023

FernandoMiguel commented Sep 20, 2023

dmcdii commented Oct 3, 2023

dmitry-mightydevops commented Oct 5, 2023

k8s-triage-robot commented Jan 29, 2024

FernandoMiguel commented Jan 29, 2024

k8s-triage-robot commented Apr 28, 2024

FernandoMiguel commented Apr 29, 2024

k8s-triage-robot commented Jul 28, 2024

k8s-triage-robot commented Aug 27, 2024

k8s-triage-robot commented Sep 26, 2024

k8s-ci-robot commented Sep 26, 2024

crashloopbackoff when using RRSet region based and another record already exists #3945

crashloopbackoff when using RRSet region based and another record already exists #3945

Comments

FernandoMiguel commented Sep 20, 2023

FernandoMiguel commented Sep 20, 2023

dmcdii commented Oct 3, 2023

dmitry-mightydevops commented Oct 5, 2023

k8s-triage-robot commented Jan 29, 2024

FernandoMiguel commented Jan 29, 2024

k8s-triage-robot commented Apr 28, 2024

FernandoMiguel commented Apr 29, 2024

k8s-triage-robot commented Jul 28, 2024

k8s-triage-robot commented Aug 27, 2024

k8s-triage-robot commented Sep 26, 2024

k8s-ci-robot commented Sep 26, 2024