Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utf-8 csv files for UN/LOCODE file #37

Merged
merged 1 commit into from
Aug 19, 2024

Conversation

End-rey
Copy link
Contributor

@End-rey End-rey commented Aug 12, 2024

Refs: #25.
I tried different encodings on the files and none worked. I also looked at other file sources and found a solution that if you pull files from the mdb extension, they are saved in utf8 and with the correct characters.

Copy link

codecov bot commented Aug 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 26.04%. Comparing base (619dceb) to head (a6b65a7).
Report is 3 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master      #37   +/-   ##
=======================================
  Coverage   26.04%   26.04%           
=======================================
  Files          23       23           
  Lines         768      768           
=======================================
  Hits          200      200           
  Misses        533      533           
  Partials       35       35           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@roman-khimov roman-khimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked https://github.com/datasets/un-locode? I've just discovered it and seems like it has exactly what we need (but this needs to be verified). Dealing with MDB is no fun, I'd prefer something simpler and if we have some source of cleansed data then it can be reused as well.

@roman-khimov
Copy link
Member

Or maybe it's just CP1252, at least as suggested by https://github.com/datasets/un-locode/blob/main/scripts/prepare.py and nearby code.

@End-rey
Copy link
Contributor Author

End-rey commented Aug 12, 2024

Have you checked https://github.com/datasets/un-locode?

Yes, when I was solving the problem, I peeked into this repository for script.

Or maybe it's just CP1252, at least as suggested by https://github.com/datasets/un-locode/blob/main/scripts/prepare.py and nearby code.

I tried cp1252 with files, that i get from https://service.unece.org/trade/locode/loc241csv.zip, and it not worked.
I tried running the python code "prepare.py" and the same encoding problem occurs there. But if you have data with correct characters and run "integrate.py", it seems that pandas will update the data with the merge function and will not break the encoding.

I was only able to get files with the correct characters from the mdb file, so I'll try to search Go package to deal with mdb

@roman-khimov
Copy link
Member

I peeked into this repository for script.

Does it have correct UTF-8 in their DBs? Because we can fetch them instead of UN DBs if they're fine. This just needs to be verified.

@End-rey
Copy link
Contributor Author

End-rey commented Aug 12, 2024

Does it have correct UTF-8 in their DBs?

Yes, because they just get files from unece.org and process them so they don't have encoding problems.

How we can fetch there? Just wget the raw file from github?

@roman-khimov
Copy link
Member

We can add it as git submodule or fetch raw files from GH. Both ways work for me.

@@ -49,6 +49,8 @@ const (
subDivCountry
subDivSubdivision
subDivName
_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have deleted this line in the previous commit. So no need to add them back

@@ -1,28 +0,0 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably commits in this PR should be rebased into 2 commits. One about new source of data and another one about version changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, add notes about it to the change log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As planned, upgrade to the new version will be in a separate pr

@End-rey End-rey force-pushed the 25-unknown-symbols-in-subdivnames branch from 6a604ad to adf59f1 Compare August 16, 2024 10:54
Copy link
Member

@carpawell carpawell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not exactly "git" as it is said in the commit, git != GitHub. Commits usually start with an imperative verb (e.g. "Get unlocode files from..."). "Ref" can be changed to "Closes" and merging will automatically close the issue.
Otherwise, I am OK.

@End-rey End-rey force-pushed the 25-unknown-symbols-in-subdivnames branch from adf59f1 to 481bcb2 Compare August 16, 2024 19:03
Copy link
Member

@roman-khimov roman-khimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the DB is the same after this? I expected some updates to pkg/locodedb/data.

Makefile Outdated
@@ -3,8 +3,7 @@
VERSION ?= "$(shell git describe --tags --match "v*" --dirty --always --abbrev=8 2>/dev/null || cat VERSION 2>/dev/null || echo "develop")"
LOCODECLI ?= locode-db
LOCODEDB ?= pkg/locodedb/data
UNLOCODEFILE = loc232csv.zip
UNLOCODEPREFIX = "2023-2"
UNLOCODEREVIOSION = 1b2c902d2f0057b36d800d8f25f40234412858cf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNLOCODEREVISION

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not 3648bfa776701c329d27136bef29fb3e21853f20, btw?

Makefile Outdated
wget -c https://raw.githubusercontent.com/datasets/un-locode/${UNLOCODEREVIOSION}/data/code-list.csv -O in/CodeList.tmp.csv
awk 'NR > 1' in/SubdivisionCodes.tmp.csv > in/SubdivisionCodes.csv
awk 'NR > 1' in/CodeList.tmp.csv > in/CodeList.csv
rm in/*.tmp.csv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a proper Makefile, it's a script (see deleted lines for how it can work in a better way with dependencies).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split into files targets. Is that what you needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better this way. I'd still keep downloaded files, but this works too.

@roman-khimov
Copy link
Member

Now please reorganize commits (fixes to previous commits are not expected in general) and we're good to go.

@roman-khimov
Copy link
Member

CSVs are easy to diff, so it's important to take a look at what are the changes.

 pr-37 |20805 +++++++++++++++++++++++++++++++++---------------------------------
 1 file changed, 10428 insertions(+), 10377 deletions(-)

A lot of changes like

-AEAMF,Musaffah,5,AZ,Abu Z¸aby [Abu Dhabi],24.383333,54.483334
+AEAMF,Musaffah,5,AZ,Abū Z̧aby [Abu Dhabi],24.383333,54.483334
...
-AERWP,Ruwais Port,5,AZ,Abu Z¸aby [Abu Dhabi],24.1,52.716667
-AESHJ,Sharjah,5,SH,Ash Shariqah [Sharjah],25.35,55.383335
+AERWP,Ruwais Port,5,AZ,Abū Z̧aby [Abu Dhabi],24.1,52.716667
+AESHJ,Sharjah,5,SH,Ash Shāriqah [Sharjah],25.35,55.383335
...
-AOMTL,Matala,2,HUI,Huيla,-14.733334,15.033334
+AOMTL,Matala,2,HUI,Huíla,-14.733334,15.033334

as expected.

Also there are fixes for cases like

SISTS,Sveta Trojica v Slovenskih goricah,1,204,"Sveta Trojica v Slovenskih
goricah",46.566666,15.866667

Which are valid CSVs, but still not something you'd expect. And new ones are being added like

+VNVUT,Vung Tau,5,43,Bà Rịa - Vũng Tàu,10.35,107.066666

Seems to be OK overall.

Get unlocode files from https://github.com/datasets/un-locode. Version of files
2023-2. Update pkg/locodedb/data/locodes.csv.bz2.

Closes #25.

Signed-off-by: Andrey Butusov <andrey@nspcc.io>
@End-rey End-rey force-pushed the 25-unknown-symbols-in-subdivnames branch from cbc6fd0 to a6b65a7 Compare August 19, 2024 08:20
@roman-khimov roman-khimov merged commit f4e1ab3 into master Aug 19, 2024
11 checks passed
@roman-khimov roman-khimov deleted the 25-unknown-symbols-in-subdivnames branch August 19, 2024 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants