Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crashloopbackoff when using RRSet region based and another record already exists #3945

Closed
FernandoMiguel opened this issue Sep 20, 2023 · 11 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@FernandoMiguel
Copy link

What happened:
We are preforming blue/green replacement of EKS cluster.
Our workloads tend to use either AWS Route53 RRSet weight or region based.
In one of our Green clusters, we saw eDNS crashloopbackoff when it a workload was deployed that contained a RRSet region based, that the Blue cluster had previously created

What you expected to happen:
For eDNS to fail to create the record, log the event and continue to work.
Once Blue cluster workload was removed, and their eDNS owned records removed for Green eDNS to create new records point to Green LBs.

How to reproduce it (as minimally and precisely as possible):

kind: Deployment
apiVersion: apps/v1
metadata:
  name: external-dns-ue1-yw-cluster-domain
  namespace: sre-external-dns-system
  uid: 1f51cc24-aea1-44b2-9f91-8e2b3cebae7e
  resourceVersion: "11574637"
  generation: 1
  creationTimestamp: "2023-09-12T12:09:03Z"
  labels:
    app.kubernetes.io/instance: external-dns-ue1-yw
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: cluster-domain
    app.kubernetes.io/version: 0.13.6
    argocd.argoproj.io/instance: external-dns-ue1-yw
    helm.sh/chart: cluster-domain-1.13.1
  annotations:
    deployment.kubernetes.io/revision: "1"
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: external-dns-ue1-yw
      app.kubernetes.io/name: cluster-domain
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: external-dns-ue1-yw
        app.kubernetes.io/name: cluster-domain
    spec:
      containers:
      - name: external-dns
        image: registry.k8s.io/external-dns/external-dns:v0.13.6
        args:
        - --log-level=debug
        - --log-format=text
        - --interval=1m
        - --source=service
        - --source=ingress
        - --policy=sync
        - --registry=txt
        - --txt-owner-id=ue1-yw
        - --txt-prefix=_externaldns.
        - --provider=aws
        ports:
        - name: http
          containerPort: 7979
          protocol: TCP
        resources:
          limits:
            memory: 128Mi
          requests:
            cpu: 100m
            memory: 128Mi
        livenessProbe:
          httpGet:
            path: /healthz
            port: http
            scheme: HTTP
          initialDelaySeconds: 10
          timeoutSeconds: 5
          periodSeconds: 10
          successThreshold: 1
          failureThreshold: 2
        readinessProbe:
          httpGet:
            path: /healthz
            port: http
            scheme: HTTP
          initialDelaySeconds: 5
          timeoutSeconds: 5
          periodSeconds: 10
          successThreshold: 1
          failureThreshold: 6
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        imagePullPolicy: IfNotPresent
        securityContext:
          capabilities:
            drop:
            - ALL
          runAsUser: 65534
          runAsNonRoot: true
          readOnlyRootFilesystem: true
          allowPrivilegeEscalation: false
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      nodeSelector:
        eks.amazonaws.com/capacityType: ON_DEMAND
        kubernetes.io/arch: amd64
        kubernetes.io/os: linux
      serviceAccountName: edns
      serviceAccount: edns
      securityContext:
        fsGroup: 65534
        seccompProfile:
          type: RuntimeDefault
      schedulerName: default-scheduler
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
        effect: NoSchedule
  strategy:
    type: Recreate
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600
status:
  observedGeneration: 1
  replicas: 1
  updatedReplicas: 1
  readyReplicas: 1
  availableReplicas: 1
  conditions:
  - type: Progressing
    status: "True"
    lastUpdateTime: "2023-09-12T12:09:16Z"
    lastTransitionTime: "2023-09-12T12:09:04Z"
    reason: NewReplicaSetAvailable
    message: ReplicaSet "external-dns-ue1-yw-cluster-domain-74dd4bb964" has successfully progressed.
  - type: Available
    status: "True"
    lastUpdateTime: "2023-09-19T14:35:18Z"
    lastTransitionTime: "2023-09-19T14:35:18Z"
    reason: MinimumReplicasAvailable
    message: Deployment has minimum availability.
kind: Ingress
apiVersion: networking.k8s.io/v1
metadata:
  name: helloworld
  namespace: helloworld
  uid: 40c95bf0-1a07-4b6a-a635-dde825e547a0
  resourceVersion: "11700042"
  generation: 1
  creationTimestamp: "2023-09-19T16:28:34Z"
  labels:
    app.kubernetes.io/instance: ue1-yw-helloworld
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: helloworld
    argocd.argoproj.io/instance: ue1-yw-helloworld
    helm.sh/chart: enverus-deployment-1.3.0
  annotations:
    external-dns.alpha.kubernetes.io/aws-region: "us-east-1"
    external-dns.alpha.kubernetes.io/set-identifier: internal-ue1-yw-int-lb-B.us-east-1.elb.amazonaws.com
    external-dns.alpha.kubernetes.io/target: internal-ue1-yw-int-lb-B.us-east-1.elb.amazonaws.com
    external-dns.alpha.kubernetes.io/ttl: "200"
    internal.lbc.kyverno.example.com/clusterDomain: "true"
    internal.lbc.kyverno.example.com/target: "true"
    policies.kyverno.io/last-applied-patches: |
      add-internal-ingress-cluster-domain.add-configmap-values-to-ingress-config-cp.kyverno.io: replaced
        /spec/rules/0/host
spec:
  ingressClassName: nginx-internal
  rules:
  - host: helloworld.int.enverus.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: http-helloworld
            port:
              number: 80
status:
  loadBalancer:
    ingress:
    - ip: 172.20.248.196

Anything else we need to know?:

Environment:

  • External-DNS version (use external-dns --version): 0.13.6
  • DNS provider: aws
  • Others: eks 1.27
@FernandoMiguel FernandoMiguel added the kind/bug Categorizes issue or PR as related to a bug. label Sep 20, 2023
@FernandoMiguel
Copy link
Author

error logs

level=error msg="Failure in zone example.com. [Id: /hostedzone/ZXXXXXX] when submitting change batch: InvalidChangeBatch: 
[
RRSet with DNS name _externaldns.helloworld.example.com., type TXT, SetIdentifier internal-ue1-yw-int-lb-A.us-east-1.elb.amazonaws.com, and Region Name=us-east-1 cannot be created because a latency RRSet with the same name, type and region already exists.,
RRSet with DNS name _externaldns.cname-helloworld.example.com., type TXT, SetIdentifier internal-ue1-yw-int-lb-A.us-east-1.elb.amazonaws.com, and Region Name=us-east-1 cannot be created because a latency RRSet with the same name, type and region already exists.,
RRSet with DNS name helloworld.example.com., type A, SetIdentifier internal-ue1-yw-int-lb-B 5.us-east-1.elb.amazonaws.com, and Region Name=us-east-1 cannot be created because a latency RRSet with the same name, type and region already exists.
]

@dmcdii
Copy link

dmcdii commented Oct 3, 2023

We started seeing the same types of issues after upgrading to external-dns:v0.13.6. In order to work around it for now we reverted to external-dns:v0.13.4.
Appears to be due to this change.
#3009

@dmitry-mightydevops
Copy link

in my case I had 2 services with the same hostname annotation

➜ kgsvc teleport  -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/hostname: tele.aaa.net
...

➜ kgsvc teleport-auth -o yaml         
apiVersion: v1
kind: Service
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/hostname: tele.aaa.net
    external-dns.alpha.kubernetes.io/ttl: "120"
...

which resulted in that error.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024
@FernandoMiguel
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2024
@FernandoMiguel
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 28, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 27, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 26, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants