-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Utf-8 csv files for UN/LOCODE file #37
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #37 +/- ##
=======================================
Coverage 26.04% 26.04%
=======================================
Files 23 23
Lines 768 768
=======================================
Hits 200 200
Misses 533 533
Partials 35 35 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you checked https://github.com/datasets/un-locode? I've just discovered it and seems like it has exactly what we need (but this needs to be verified). Dealing with MDB is no fun, I'd prefer something simpler and if we have some source of cleansed data then it can be reused as well.
Or maybe it's just CP1252, at least as suggested by https://github.com/datasets/un-locode/blob/main/scripts/prepare.py and nearby code. |
Yes, when I was solving the problem, I peeked into this repository for script.
I tried cp1252 with files, that i get from https://service.unece.org/trade/locode/loc241csv.zip, and it not worked. I was only able to get files with the correct characters from the mdb file, so I'll try to search Go package to deal with mdb |
Does it have correct UTF-8 in their DBs? Because we can fetch them instead of UN DBs if they're fine. This just needs to be verified. |
Yes, because they just get files from unece.org and process them so they don't have encoding problems. How we can fetch there? Just wget the raw file from github? |
We can add it as git submodule or fetch raw files from GH. Both ways work for me. |
internal/parsers/table/csv/calls.go
Outdated
@@ -49,6 +49,8 @@ const ( | |||
subDivCountry | |||
subDivSubdivision | |||
subDivName | |||
_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have deleted this line in the previous commit. So no need to add them back
unlocode-mdb-to-csv.sh
Outdated
@@ -1,28 +0,0 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably commits in this PR should be rebased into 2 commits. One about new source of data and another one about version changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, add notes about it to the change log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As planned, upgrade to the new version will be in a separate pr
6a604ad
to
adf59f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not exactly "git" as it is said in the commit, git
!= GitHub. Commits usually start with an imperative verb (e.g. "Get unlocode files from..."). "Ref" can be changed to "Closes" and merging will automatically close the issue.
Otherwise, I am OK.
adf59f1
to
481bcb2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the DB is the same after this? I expected some updates to pkg/locodedb/data
.
Makefile
Outdated
@@ -3,8 +3,7 @@ | |||
VERSION ?= "$(shell git describe --tags --match "v*" --dirty --always --abbrev=8 2>/dev/null || cat VERSION 2>/dev/null || echo "develop")" | |||
LOCODECLI ?= locode-db | |||
LOCODEDB ?= pkg/locodedb/data | |||
UNLOCODEFILE = loc232csv.zip | |||
UNLOCODEPREFIX = "2023-2" | |||
UNLOCODEREVIOSION = 1b2c902d2f0057b36d800d8f25f40234412858cf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UNLOCODEREVISION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not 3648bfa776701c329d27136bef29fb3e21853f20, btw?
Makefile
Outdated
wget -c https://raw.githubusercontent.com/datasets/un-locode/${UNLOCODEREVIOSION}/data/code-list.csv -O in/CodeList.tmp.csv | ||
awk 'NR > 1' in/SubdivisionCodes.tmp.csv > in/SubdivisionCodes.csv | ||
awk 'NR > 1' in/CodeList.tmp.csv > in/CodeList.csv | ||
rm in/*.tmp.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a proper Makefile
, it's a script (see deleted lines for how it can work in a better way with dependencies).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split into files targets. Is that what you needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better this way. I'd still keep downloaded files, but this works too.
Now please reorganize commits (fixes to previous commits are not expected in general) and we're good to go. |
CSVs are easy to diff, so it's important to take a look at what are the changes.
A lot of changes like
as expected. Also there are fixes for cases like
Which are valid CSVs, but still not something you'd expect. And new ones are being added like
Seems to be OK overall. |
Get unlocode files from https://github.com/datasets/un-locode. Version of files 2023-2. Update pkg/locodedb/data/locodes.csv.bz2. Closes #25. Signed-off-by: Andrey Butusov <andrey@nspcc.io>
cbc6fd0
to
a6b65a7
Compare
Refs: #25.
I tried different encodings on the files and none worked. I also looked at other file sources and found a solution that if you pull files from the mdb extension, they are saved in utf8 and with the correct characters.