internal/encoding: don't interpret file as UTF8 with binary encoding #3740

nichtsundniemand · 2025-02-10T20:45:03Z

Using, for example, @embed(type=binary) CUE would interpret the file's contents as UTF8. Since not all byte-sequences are valid UTF8, some will not be interpretable. Bytes sequences that fail to be properly interpreted will not raise an evaluation error but instead yield erroneus bytes to the evaluator.

This change makes it so the file's contents will be passed to the evaluator verbatim.

mvdan · 2025-02-11T15:33:15Z

SGTM but this would need a test; see my longer response at #3741.

nichtsundniemand · 2025-02-14T12:11:37Z

@mvdan OK, so I took you idea from #3741 (comment)

There appears to be some issue with testscript, maybe, because I was trying strconv-unquote "Hello, world!" and I would get some syntax-errors. If the argument-strings don't contain any spaces the errors disappear :/

In my case it's not so important since I don't need to put spaces, just something I noticed.

The valid-UTF8 file contains a BOM at the start since the UTF8 decoder also does BOM stripping and I wanted to show that type=text will correctly strip that while type=binary preserves it.

The non-UTF8-file just contains some invalid byte sequence(s). Running that test against master without the fix produces a byteslice with 8 of those Unicode-Replacement chars.

mvdan · 2025-02-21T13:43:44Z

There appears to be some issue with testscript, maybe, because I was trying strconv-unquote "Hello, world!" and I would get some syntax-errors. If the argument-strings don't contain any spaces the errors disappear :/

Ah, that is because testscript is not POSIX shell. It only has single quotes, not double quotes, and they behave differently than POSIX shell too. See https://pkg.go.dev/github.com/rogpeppe/go-internal/testscript.

So you can either do strconv-unquote "foo" or strconv-unquote '"foo bar baz"'. My apologies that my original example in #3741 (comment) omitted to note this pitfall.

mvdan

Thanks!

mvdan · 2025-02-21T13:46:10Z

cmd/cue/cmd/testdata/script/embed.txtar

@@ -4,6 +4,13 @@ env CUE_EXPERIMENT=embed=0
 cmp stderr out/noembed
 env CUE_EXPERIMENT=

+# Create test files for type=binary. txtar can only hold valid UTF8 and strips any BOM - we need to test arbitrary bytes
+# as input so we use quoted strings for storage


please document what each of the two files contains, i.e. the first one is utf8 with a BOM, and the second one isn't valid utf8 at all.

Done

As I was mistaken about BOM-stripping by txtar, that part of the comment has been removed. Although it's technically possible to losslessly pack a text-file with leading BOM in a txtar the "valid-UTF8-with-BOM" testfile is still created using strconv-unquote.

I believe it is much easier to miss a non-printing-character, like the BOM, in a git diff or while editing the testdata. Writing the BOM-bytes out makes it more visible.

mvdan · 2025-02-21T13:48:33Z

Do you want to merge this as one commit, or as three?

If you want to merge as one commit, that's easy; please squash all three commits into one, making sure that the resulting commit has a good commit message, probably combining the three you already wrote.

If you want two or three commits, then the tests must pass at each commit for us to be able to merge this. Right now, if I checkout the second commit, the tests fail, as they expect the fix from the next commit.

Two commits rather than one is actually nice, where the first commit shows the wrong behavior in the test, and the fix changes the test to expect the fixed behavior. That way, both commits pass the tests, and you can see the effect of the fix - and also verify that the test is actually testing what it's meant to.

nichtsundniemand · 2025-02-21T14:14:36Z

Two commits rather than one is actually nice, where the first commit shows the wrong behavior in the test, and the fix changes the test to expect the fixed behavior. That way, both commits pass the tests, and you can see the effect of the fix - and also verify that the test is actually testing what it's meant to.

Okay, i will rework the test like this because I was thinking a similar thing about showing that the fix works.

Since txtar can only store files with valid UTF8 content it is not possible to easily store arbitraty bytes in one of the test archives. This command makes it possible to generate some non-UTF8 testdata on the fly by unquoting a set of go-quoted strings. Signed-off-by: nichtsundniemand <rufus.schaefing@gmail.com>

These new tests prove that CUE will treat input bytes as UTF-8 when embedding a file with `type=binary`. Signed-off-by: nichtsundniemand <rufus.schaefing@gmail.com>

Using, for example, `@embed(type=binary)` CUE would interpret the file's contents as UTF8. Since not all byte-sequences are valid UTF8, some will not be interpretable. Bytes sequences that fail to be properly interpreted will not raise an evaluation error but instead yield erroneus bytes to the evaluator. This change makes it so the file's contents will be passed to the evaluator verbatim by skipping the UTF8 decoder and reading directly from the filereader. The test-cases introduced in the previous commit have been fixed to reflect the now correct behaviour. Signed-off-by: nichtsundniemand <rufus.schaefing@gmail.com>

nichtsundniemand requested a review from cueckoo as a code owner February 10, 2025 20:45

nichtsundniemand mentioned this pull request Feb 10, 2025

internal/encoding: binary files are incorrectly treated as UTF-8 #3741

Open

nichtsundniemand force-pushed the fix/binaryencoding_no_utf8 branch from 793f1ee to 667070b Compare February 10, 2025 20:51

mvdan self-assigned this Feb 11, 2025

nichtsundniemand force-pushed the fix/binaryencoding_no_utf8 branch 5 times, most recently from e8361e6 to 3571f1f Compare February 14, 2025 11:47

mvdan approved these changes Feb 21, 2025

View reviewed changes

nichtsundniemand force-pushed the fix/binaryencoding_no_utf8 branch from 3571f1f to 3c3696f Compare February 22, 2025 17:40

nichtsundniemand added 2 commits February 22, 2025 19:35

cmd/cue/cmd: show incorrect handling of binary files during embedding

0553d85

These new tests prove that CUE will treat input bytes as UTF-8 when embedding a file with `type=binary`. Signed-off-by: nichtsundniemand <rufus.schaefing@gmail.com>

nichtsundniemand force-pushed the fix/binaryencoding_no_utf8 branch from 3c3696f to b722be5 Compare February 22, 2025 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internal/encoding: don't interpret file as UTF8 with binary encoding #3740

internal/encoding: don't interpret file as UTF8 with binary encoding #3740

nichtsundniemand commented Feb 10, 2025

mvdan commented Feb 11, 2025

nichtsundniemand commented Feb 14, 2025 •

edited

Loading

mvdan commented Feb 21, 2025

mvdan left a comment

mvdan Feb 21, 2025

nichtsundniemand Feb 22, 2025

mvdan commented Feb 21, 2025

nichtsundniemand commented Feb 21, 2025

internal/encoding: don't interpret file as UTF8 with binary encoding #3740

Are you sure you want to change the base?

internal/encoding: don't interpret file as UTF8 with binary encoding #3740

Conversation

nichtsundniemand commented Feb 10, 2025

mvdan commented Feb 11, 2025

nichtsundniemand commented Feb 14, 2025 • edited Loading

mvdan commented Feb 21, 2025

mvdan left a comment

Choose a reason for hiding this comment

mvdan Feb 21, 2025

Choose a reason for hiding this comment

nichtsundniemand Feb 22, 2025

Choose a reason for hiding this comment

mvdan commented Feb 21, 2025

nichtsundniemand commented Feb 21, 2025

nichtsundniemand commented Feb 14, 2025 •

edited

Loading