-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
internal/encoding: don't interpret file as UTF8 with binary encoding #3740
base: master
Are you sure you want to change the base?
internal/encoding: don't interpret file as UTF8 with binary encoding #3740
Conversation
793f1ee
to
667070b
Compare
SGTM but this would need a test; see my longer response at #3741. |
e8361e6
to
3571f1f
Compare
@mvdan OK, so I took you idea from #3741 (comment) There appears to be some issue with testscript, maybe, because I was trying In my case it's not so important since I don't need to put spaces, just something I noticed. The valid-UTF8 file contains a BOM at the start since the UTF8 decoder also does BOM stripping and I wanted to show that The non-UTF8-file just contains some invalid byte sequence(s). Running that test against |
Ah, that is because testscript is not POSIX shell. It only has single quotes, not double quotes, and they behave differently than POSIX shell too. See https://pkg.go.dev/github.com/rogpeppe/go-internal/testscript. So you can either do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -4,6 +4,13 @@ env CUE_EXPERIMENT=embed=0 | |||
cmp stderr out/noembed | |||
env CUE_EXPERIMENT= | |||
|
|||
# Create test files for type=binary. txtar can only hold valid UTF8 and strips any BOM - we need to test arbitrary bytes | |||
# as input so we use quoted strings for storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please document what each of the two files contains, i.e. the first one is utf8 with a BOM, and the second one isn't valid utf8 at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
As I was mistaken about BOM-stripping by txtar, that part of the comment has been removed. Although it's technically possible to losslessly pack a text-file with leading BOM in a txtar the "valid-UTF8-with-BOM" testfile is still created using strconv-unquote.
I believe it is much easier to miss a non-printing-character, like the BOM, in a git diff or while editing the testdata. Writing the BOM-bytes out makes it more visible.
Do you want to merge this as one commit, or as three? If you want to merge as one commit, that's easy; please squash all three commits into one, making sure that the resulting commit has a good commit message, probably combining the three you already wrote. If you want two or three commits, then the tests must pass at each commit for us to be able to merge this. Right now, if I checkout the second commit, the tests fail, as they expect the fix from the next commit. Two commits rather than one is actually nice, where the first commit shows the wrong behavior in the test, and the fix changes the test to expect the fixed behavior. That way, both commits pass the tests, and you can see the effect of the fix - and also verify that the test is actually testing what it's meant to. |
Okay, i will rework the test like this because I was thinking a similar thing about showing that the fix works. |
Since txtar can only store files with valid UTF8 content it is not possible to easily store arbitraty bytes in one of the test archives. This command makes it possible to generate some non-UTF8 testdata on the fly by unquoting a set of go-quoted strings. Signed-off-by: nichtsundniemand <rufus.schaefing@gmail.com>
3571f1f
to
3c3696f
Compare
These new tests prove that CUE will treat input bytes as UTF-8 when embedding a file with `type=binary`. Signed-off-by: nichtsundniemand <rufus.schaefing@gmail.com>
Using, for example, `@embed(type=binary)` CUE would interpret the file's contents as UTF8. Since not all byte-sequences are valid UTF8, some will not be interpretable. Bytes sequences that fail to be properly interpreted will not raise an evaluation error but instead yield erroneus bytes to the evaluator. This change makes it so the file's contents will be passed to the evaluator verbatim by skipping the UTF8 decoder and reading directly from the filereader. The test-cases introduced in the previous commit have been fixed to reflect the now correct behaviour. Signed-off-by: nichtsundniemand <rufus.schaefing@gmail.com>
3c3696f
to
b722be5
Compare
Using, for example,
@embed(type=binary)
CUE would interpret the file's contents as UTF8. Since not all byte-sequences are valid UTF8, some will not be interpretable. Bytes sequences that fail to be properly interpreted will not raise an evaluation error but instead yield erroneus bytes to the evaluator.This change makes it so the file's contents will be passed to the evaluator verbatim.