Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TXT records created for aliases in AWS Route 53 have wrong record type prefix #2903

Open
seh opened this issue Jul 21, 2022 · 27 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@seh
Copy link
Contributor

seh commented Jul 21, 2022

What happened:

Using the "aws" provider to create DNS records for hostnames that point at AWS ELBs (such as for endpoints extracted from a Kubernetes Service or Ingress), since the hostnames don't parse as IP addresses, ExternalDNS considers the endpoints warrant a record of type CNAME. As the target hostname discovered from the Ingress's status sits within a canonical hosted zone, ExternalDNS decides that the record should be an alias to the target ELB's DNS record. Later, when composing the changes to send to the Route 53 service, ExternalDNS changes its mind and decides to use an A record instead. At that point, ExternalDNS leaves the endpoint.Endpoint's "RecordType" field's value as the original endpoint.RecordTypeCNAME ("CNAME").

That sets us up to create an A record for an endpoint.Endpoint that still represents a CNAME record. ExternalDNS then goes on to add the TXT ownership records to the change batch, and consults the endpoint.Endpoint's "RecordType" field, finding it to be "CNAME." This leads to a TXT record prefix of "cname-" even though it should probably be "a-" instead, if the goal is to have the TXT records indicate which of several primary records they describe.

What you expected to happen:

ExternalDNS will create a TXT record with a prefix indicating the same primary record type that the TXT record describes. In this case, since the primary record type created in Route 53 turns out to be A, I expect the TXT record's prefix to be "a-" instead of "cname-."

How to reproduce it (as minimally and precisely as possible):

In a Kubernetes cluster running within AWS EC2, create a Service of type "LoadBalancer," and allow ExternalDNS to discover the endpoint and its target by using either the "service" or "ingress" source.

Inspect the Route 53 service to see that ExternalDNS creates a primary record of type A, as an alias to the target AWS-hosted load balancer. Note too that ExternalDNS creates a TXT record with a prefix of "cname-" instead of "a-."

Anything else we need to know?:

In order to align the record type mentioned by these primary and TXT records, we need to make the TXT registry portion of ExternalDNS aware of the late decision that the AWS provider makes to use an A record instead. I am not sure whether other providers make similar overriding decisions when composing changes.

Environment:

  • External-DNS version: 0.12.0
  • DNS provider: aws (AWS Route 53)
  • Others: Source is Kubernetes Ingress
@seh seh added the kind/bug Categorizes issue or PR as related to a bug. label Jul 21, 2022
@doctornkz
Copy link

Also faced with that issue. Thank you @seh for report.

@chonton
Copy link

chonton commented Sep 1, 2022

What I see is two guard records being produced; one with same name as 'A' record and one with 'cname-' prefix.

@seh
Copy link
Contributor Author

seh commented Sep 1, 2022

That's odd. Does your "A" record's name happen to begin with "a-," inducing false aliasing?

@chonton
Copy link

chonton commented Sep 1, 2022

Version
v0.12.2

Args

      containers:
      - args:
        - --log-level=info
        - --namespace=mis-feature
        - --publish-host-ip
        - --aws-batch-change-size=20
        - --domain-filter=mis.example.com
        - --interval=2m
        - --policy=upsert-only
        - --provider=aws
        - --source=ingress
        - --source=service
        - --registry=txt
        - --txt-owner-id=use-feature

Redacted Kubernetes Resources

---
apiVersion: v1
kind: Service
metadata:
  name: unified-theatre
  annotations:
    external-dns.alpha.kubernetes.io/alias: "true"
    external-dns.alpha.kubernetes.io/hostname: us.example.com
    external-dns.alpha.kubernetes.io/ingress-hostname-source: annotation-only
    external-dns.alpha.kubernetes.io/aws-weight: "255"
    external-dns.alpha.kubernetes.io/set-identifier: us-east-1
spec:
  type: ExternalName
  externalName: use.example.com
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: unified-region
  annotations:
    external-dns.alpha.kubernetes.io/alias: "true"
    external-dns.alpha.kubernetes.io/hostname: use.example.com
    external-dns.alpha.kubernetes.io/ingress-hostname-source: annotation-only

Redacted Route53 records

Record name Type Policy Weight Value/Route traffic to
cname-us-feature.mis.example.com TXT Weighted 255 "heritage=external-dns,external-dns/owner=use-feature,external-dns/resource=service/mis-feature/unified-theatre"
cname-use-feature.mis.example.com TXT Simple - "heritage=external-dns,external-dns/owner=use-feature,external-dns/resource=ingress/mis-feature/unified-region"
use.feature.mis.example.com A Simple - 10.93.177.118
us-feature.mis.example.com A Weighted 255 use-feature.mis.example.com.
us-feature.mis.example.com TXT Weighted 255 "heritage=external-dns,external-dns/owner=use-feature,external-dns/resource=service/mis-feature/unified-theatre"
use-feature.mis.example.com A Simple - internal-k8s-misfeatu-unifiedr-201835383a-1018808261.us-east-1.elb.amazonaws.com.
use-feature.mis.example.com TXT Simple - "heritage=external-dns,external-dns/owner=use-feature,external-dns/resource=ingress/mis-feature/unified-region"

@jwilf
Copy link

jwilf commented Sep 27, 2022

I'm also seeing this, and an additional problem is that when the k8s resource is deleted, the TXT record with prefix "cname-" is not deleted from route53. We have a zone with a large churn of resources and this resulted in reaching the limit on the number of records in the zone.

@nicocout
Copy link

nicocout commented Oct 14, 2022

I have the same problem. TXT records with prefix "cname-" are not deleted and cause an issue when I try to recreate k8s resources.

@erikdeweerdt
Copy link

We're seeing similar, but subtly different behavior: external-dns tries to delete cname- prefixed TXT records that were never created, failing the entire change batch and preventing all future updates until we manually intervene (by creating the record it wants to delete).

@dalvarezquiroga
Copy link

+1 with the same problem in AWS. External DNS created a lot of entries in Route53 that start with CNAME-{{name}}.local TXT

@Gladdstone
Copy link

Having recently come across this issue, it appears part of the problem with the creation of erroneous cname- prefixed TXT records has to do with the construction of the plan struct and how that is then passed to the registry and onward to the cloud provider.
The plan is comprised of create/update/delete arrays, and so the actual records have no association to each other insofar as the registry or the provider are aware. The changes to the endpoints are read and executed in order, resulting in AWS (or another supporting cloud provider) correctly recognizing a record identified as CNAME as an Alias, but still creating the cname- TXT record that was generated by the TXTRegistry in the prior stage of execution.
The registry is aware of the provider, because it has to call the ApplyChanges function as part of its own. Barring a total overhaul of how aliases are handled, I wonder would it be possible to call a function from the registry level down to the provider to check for an alias e.g. AWSProviders useAlias function?

@vitali-federau-fivestars

+1 experiencing the same issue, while creation of A(Alias) records TXT record uses incorrect prefix (cname instead of a)

@jbilliau-rcd
Copy link

Same, highly annoying. I'm having to delete Route53 records on a daily basis for dozens of clusters in order for the controller to properly create all the relevant records and go healthy with "all records are up to date".

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 26, 2023
@aaroniscode
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 26, 2023
@johngmyers
Copy link
Contributor

External-dns represents ALIAS records of type A to the planner as Endpoints of type CNAME and a ProviderSpecific attribute with key alias and value true. So it is an expected quirk that the new-format txt registry ownership records have a prefix of cname-. As the installed base has such ownership records, this would take an unreasonable amount of effort to change.

Problems with deletion would be separate bugs.

@stefkkkk
Copy link

stefkkkk commented Dec 9, 2023

So, will be any fix of that behaviour? I need to pin the tag version(0.11.1-debian-10-r27) due to this

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 7, 2024
@jcogilvie
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 8, 2024
@stefkkkk
Copy link

stefkkkk commented Apr 9, 2024

any updates?!

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 8, 2024
@stefkkkk
Copy link

stefkkkk commented Jul 8, 2024

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 8, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 6, 2024
@stefan-korchahin
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 7, 2024
@bwmills
Copy link

bwmills commented Oct 7, 2024

Experiencing this issue as well.

  • EKS running Kubernetes 1.24
  • Ingresses serviced by both Nginx and Kong
    • Records point to Nginx ingress controller (NLB)
    • Records point to ALB (Kong)
  • external-dns deployed with Helm chart
    • Image: v0.15.0
    • Chart: v0.15.0

Default chart config (no overrides) looks like this (snippet) when describing the pod:

spec:
  containers:
  - args:
    - --log-level=info
    - --log-format=text
    - --interval=1m
    - --source=service
    - --source=ingress
    - --policy=upsert-only
    - --registry=txt
    - --provider=aws
    env:
    - name: AWS_STS_REGIONAL_ENDPOINTS
      value: regional
    - name: AWS_DEFAULT_REGION
      value: us-east-1
    - name: AWS_REGION
      value: us-east-1
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::xxxxxxxxxxxx:role/xxxxx
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    image: registry.k8s.io/external-dns/external-dns:v0.15.0
    imagePullPolicy: IfNotPresent
...

All records created and managed successfully through automation, but each host: value in a given ingress manifest, e.g:

spec:
  ingressClassName: nginx
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /v1/...
            pathType: Prefix
            backend:
              service:
                name: microservice-api-svc
                port:
                  number: 1234

results in the creation of these three records in Route 53:

Record name           | Type  | Routing | Alias | Value/Route traffic to | ...
----------------------------------------------------------------------------------------------
api.example.com       | A     | Simple  | Yes   | nginx-ingress-contoller-xxxxx.elb.us-east-1.amazonaws.com.
api.example.com       | TXT   | Simple  | No    | "heritage=external-dns,external-dns/owner=default,...
cname-api.example.com | TXT   | Simple  | No    | "heritage=external-dns,external-dns/owner=default,...

@jaroslav-muller-jamf
Copy link

I think this issue is the root cause of #3977 (at least in some cases).

As an example consider the 3 records from previous comment:

  • api.example.com A
  • api.example.com TXT
  • cname-api.example.com TXT

Let's assume you have domain filter set to api.example.com. Then external dns will skip the attempt to create cname-api.example.com, because it is outside of the domain filter (because the separator is -, not .) As a result, the next time external-dns synchronizes the records, it doesn't find the cname-api.example.com and upserts all the records.

That is at least the behaviour we observe in our clusters.
As a workaround, you can change the domain filter to example.com (if you have control over that zone).

In my opinion, the - after the cname is the problem, and I didn't find a way to change it. The txt prefix won't change that. If txt prefix was txt., the record would be txt.cname-api.example.com.
A potential fix would be changing the logic so that it would create cname-txt.api.example.com

@ErcinDedeoglu
Copy link

This is frustrating. After more than 12 hours of effort, I discovered that this issue has been open for ages. I tried hard to use open-source, ready-to-use solutions, but you guys don't allow us to enjoy a seamless experience. Now, I have to write my own container to manage it smoothly. Wasted.

@nipr-jdoenges
Copy link

Similar to #3868 #4618

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests