Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/encoding: don't interpret file as UTF8 with binary encoding #3740

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nichtsundniemand
Copy link

Using, for example, @embed(type=binary) CUE would interpret the file's contents as UTF8. Since not all byte-sequences are valid UTF8, some will not be interpretable. Bytes sequences that fail to be properly interpreted will not raise an evaluation error but instead yield erroneus bytes to the evaluator.

This change makes it so the file's contents will be passed to the evaluator verbatim.

@mvdan
Copy link
Member

mvdan commented Feb 11, 2025

SGTM but this would need a test; see my longer response at #3741.

@mvdan mvdan self-assigned this Feb 11, 2025
@nichtsundniemand nichtsundniemand force-pushed the fix/binaryencoding_no_utf8 branch 5 times, most recently from e8361e6 to 3571f1f Compare February 14, 2025 11:47
@nichtsundniemand
Copy link
Author

nichtsundniemand commented Feb 14, 2025

@mvdan OK, so I took you idea from #3741 (comment)

There appears to be some issue with testscript, maybe, because I was trying strconv-unquote "Hello, world!" and I would get some syntax-errors. If the argument-strings don't contain any spaces the errors disappear :/

In my case it's not so important since I don't need to put spaces, just something I noticed.

The valid-UTF8 file contains a BOM at the start since the UTF8 decoder also does BOM stripping and I wanted to show that type=text will correctly strip that while type=binary preserves it.

The non-UTF8-file just contains some invalid byte sequence(s). Running that test against master without the fix produces a byteslice with 8 of those Unicode-Replacement chars.

@mvdan
Copy link
Member

mvdan commented Feb 21, 2025

There appears to be some issue with testscript, maybe, because I was trying strconv-unquote "Hello, world!" and I would get some syntax-errors. If the argument-strings don't contain any spaces the errors disappear :/

Ah, that is because testscript is not POSIX shell. It only has single quotes, not double quotes, and they behave differently than POSIX shell too. See https://pkg.go.dev/github.com/rogpeppe/go-internal/testscript.

So you can either do strconv-unquote "foo" or strconv-unquote '"foo bar baz"'. My apologies that my original example in #3741 (comment) omitted to note this pitfall.

Copy link
Member

@mvdan mvdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -4,6 +4,13 @@ env CUE_EXPERIMENT=embed=0
cmp stderr out/noembed
env CUE_EXPERIMENT=

# Create test files for type=binary. txtar can only hold valid UTF8 and strips any BOM - we need to test arbitrary bytes
# as input so we use quoted strings for storage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document what each of the two files contains, i.e. the first one is utf8 with a BOM, and the second one isn't valid utf8 at all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

As I was mistaken about BOM-stripping by txtar, that part of the comment has been removed. Although it's technically possible to losslessly pack a text-file with leading BOM in a txtar the "valid-UTF8-with-BOM" testfile is still created using strconv-unquote.

I believe it is much easier to miss a non-printing-character, like the BOM, in a git diff or while editing the testdata. Writing the BOM-bytes out makes it more visible.

@mvdan
Copy link
Member

mvdan commented Feb 21, 2025

Do you want to merge this as one commit, or as three?

If you want to merge as one commit, that's easy; please squash all three commits into one, making sure that the resulting commit has a good commit message, probably combining the three you already wrote.

If you want two or three commits, then the tests must pass at each commit for us to be able to merge this. Right now, if I checkout the second commit, the tests fail, as they expect the fix from the next commit.

Two commits rather than one is actually nice, where the first commit shows the wrong behavior in the test, and the fix changes the test to expect the fixed behavior. That way, both commits pass the tests, and you can see the effect of the fix - and also verify that the test is actually testing what it's meant to.

@nichtsundniemand
Copy link
Author

Two commits rather than one is actually nice, where the first commit shows the wrong behavior in the test, and the fix changes the test to expect the fixed behavior. That way, both commits pass the tests, and you can see the effect of the fix - and also verify that the test is actually testing what it's meant to.

Okay, i will rework the test like this because I was thinking a similar thing about showing that the fix works.

Since txtar can only store files with valid UTF8 content it is not possible to easily
store arbitraty bytes in one of the test archives.

This command makes it possible to generate some non-UTF8 testdata on the fly by unquoting
a set of go-quoted strings.

Signed-off-by: nichtsundniemand <rufus.schaefing@gmail.com>
@nichtsundniemand nichtsundniemand force-pushed the fix/binaryencoding_no_utf8 branch from 3571f1f to 3c3696f Compare February 22, 2025 17:40
These new tests prove that CUE will treat input bytes as UTF-8 when embedding a file with `type=binary`.

Signed-off-by: nichtsundniemand <rufus.schaefing@gmail.com>
Using, for example, `@embed(type=binary)` CUE would interpret the file's
contents as UTF8. Since not all byte-sequences are valid UTF8, some will not be
interpretable. Bytes sequences that fail to be properly interpreted will not
raise an evaluation error but instead yield erroneus bytes to the evaluator.

This change makes it so the file's contents will be passed to the evaluator
verbatim by skipping the UTF8 decoder and reading directly from the filereader.

The test-cases introduced in the previous commit have been fixed to reflect the
now correct behaviour.

Signed-off-by: nichtsundniemand <rufus.schaefing@gmail.com>
@nichtsundniemand nichtsundniemand force-pushed the fix/binaryencoding_no_utf8 branch from 3c3696f to b722be5 Compare February 22, 2025 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants