Utf-8 csv files for UN/LOCODE file #37

End-rey · 2024-08-12T08:24:26Z

Refs: #25.
I tried different encodings on the files and none worked. I also looked at other file sources and found a solution that if you pull files from the mdb extension, they are saved in utf8 and with the correct characters.

codecov · 2024-08-12T08:25:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 26.04%. Comparing base (619dceb) to head (a6b65a7).
Report is 3 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master      #37   +/-   ##
=======================================
  Coverage   26.04%   26.04%           
=======================================
  Files          23       23           
  Lines         768      768           
=======================================
  Hits          200      200           
  Misses        533      533           
  Partials       35       35

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

roman-khimov

Have you checked https://github.com/datasets/un-locode? I've just discovered it and seems like it has exactly what we need (but this needs to be verified). Dealing with MDB is no fun, I'd prefer something simpler and if we have some source of cleansed data then it can be reused as well.

Makefile

roman-khimov · 2024-08-12T12:09:29Z

Or maybe it's just CP1252, at least as suggested by https://github.com/datasets/un-locode/blob/main/scripts/prepare.py and nearby code.

End-rey · 2024-08-12T12:36:15Z

Have you checked https://github.com/datasets/un-locode?

Yes, when I was solving the problem, I peeked into this repository for script.

Or maybe it's just CP1252, at least as suggested by https://github.com/datasets/un-locode/blob/main/scripts/prepare.py and nearby code.

I tried cp1252 with files, that i get from https://service.unece.org/trade/locode/loc241csv.zip, and it not worked.
I tried running the python code "prepare.py" and the same encoding problem occurs there. But if you have data with correct characters and run "integrate.py", it seems that pandas will update the data with the merge function and will not break the encoding.

I was only able to get files with the correct characters from the mdb file, so I'll try to search Go package to deal with mdb

roman-khimov · 2024-08-12T13:51:19Z

I peeked into this repository for script.

Does it have correct UTF-8 in their DBs? Because we can fetch them instead of UN DBs if they're fine. This just needs to be verified.

End-rey · 2024-08-12T15:42:07Z

Does it have correct UTF-8 in their DBs?

Yes, because they just get files from unece.org and process them so they don't have encoding problems.

How we can fetch there? Just wget the raw file from github?

roman-khimov · 2024-08-12T15:50:22Z

We can add it as git submodule or fetch raw files from GH. Both ways work for me.

AliceInHunterland · 2024-08-15T08:42:30Z

internal/parsers/table/csv/calls.go

@@ -49,6 +49,8 @@ const (
 	subDivCountry
 	subDivSubdivision
 	subDivName
+	_


You have deleted this line in the previous commit. So no need to add them back

AliceInHunterland · 2024-08-15T08:49:45Z

unlocode-mdb-to-csv.sh

@@ -1,28 +0,0 @@
-#!/bin/bash


Probably commits in this PR should be rebased into 2 commits. One about new source of data and another one about version changes.

Also, add notes about it to the change log.

As planned, upgrade to the new version will be in a separate pr

carpawell

That is not exactly "git" as it is said in the commit, git != GitHub. Commits usually start with an imperative verb (e.g. "Get unlocode files from..."). "Ref" can be changed to "Closes" and merging will automatically close the issue.
Otherwise, I am OK.

Makefile

roman-khimov

So, the DB is the same after this? I expected some updates to pkg/locodedb/data.

roman-khimov · 2024-08-18T10:34:50Z

Makefile

@@ -3,8 +3,7 @@
 VERSION ?= "$(shell git describe --tags --match "v*" --dirty --always --abbrev=8 2>/dev/null || cat VERSION 2>/dev/null || echo "develop")"
 LOCODECLI ?= locode-db
 LOCODEDB ?= pkg/locodedb/data
-UNLOCODEFILE = loc232csv.zip
-UNLOCODEPREFIX = "2023-2"
+UNLOCODEREVIOSION = 1b2c902d2f0057b36d800d8f25f40234412858cf


UNLOCODEREVISION

Why not 3648bfa776701c329d27136bef29fb3e21853f20, btw?

roman-khimov · 2024-08-18T10:37:49Z

Makefile

+	wget -c https://raw.githubusercontent.com/datasets/un-locode/${UNLOCODEREVIOSION}/data/code-list.csv -O in/CodeList.tmp.csv
+	awk 'NR > 1' in/SubdivisionCodes.tmp.csv > in/SubdivisionCodes.csv
+	awk 'NR > 1' in/CodeList.tmp.csv > in/CodeList.csv
+	rm in/*.tmp.csv


This is not a proper Makefile, it's a script (see deleted lines for how it can work in a better way with dependencies).

Split into files targets. Is that what you needed?

Much better this way. I'd still keep downloaded files, but this works too.

roman-khimov · 2024-08-19T07:12:45Z

Now please reorganize commits (fixes to previous commits are not expected in general) and we're good to go.

roman-khimov · 2024-08-19T07:41:38Z

CSVs are easy to diff, so it's important to take a look at what are the changes.

 pr-37 |20805 +++++++++++++++++++++++++++++++++---------------------------------
 1 file changed, 10428 insertions(+), 10377 deletions(-)

A lot of changes like

-AEAMF,Musaffah,5,AZ,Abu Z¸aby [Abu Dhabi],24.383333,54.483334
+AEAMF,Musaffah,5,AZ,Abū Z̧aby [Abu Dhabi],24.383333,54.483334
...
-AERWP,Ruwais Port,5,AZ,Abu Z¸aby [Abu Dhabi],24.1,52.716667
-AESHJ,Sharjah,5,SH,Ash Shariqah [Sharjah],25.35,55.383335
+AERWP,Ruwais Port,5,AZ,Abū Z̧aby [Abu Dhabi],24.1,52.716667
+AESHJ,Sharjah,5,SH,Ash Shāriqah [Sharjah],25.35,55.383335
...
-AOMTL,Matala,2,HUI,Huيla,-14.733334,15.033334
+AOMTL,Matala,2,HUI,Huíla,-14.733334,15.033334

as expected.

Also there are fixes for cases like

SISTS,Sveta Trojica v Slovenskih goricah,1,204,"Sveta Trojica v Slovenskih
goricah",46.566666,15.866667

Which are valid CSVs, but still not something you'd expect. And new ones are being added like

+VNVUT,Vung Tau,5,43,Bà Rịa - Vũng Tàu,10.35,107.066666

Seems to be OK overall.

Get unlocode files from https://github.com/datasets/un-locode. Version of files 2023-2. Update pkg/locodedb/data/locodes.csv.bz2. Closes #25. Signed-off-by: Andrey Butusov <andrey@nspcc.io>

End-rey requested review from carpawell, cthulhu-rider, roman-khimov and AliceInHunterland as code owners August 12, 2024 08:24

roman-khimov reviewed Aug 12, 2024

View reviewed changes

Makefile Outdated Show resolved Hide resolved

Makefile Outdated Show resolved Hide resolved

Makefile Outdated Show resolved Hide resolved

AliceInHunterland reviewed Aug 15, 2024

View reviewed changes

End-rey force-pushed the 25-unknown-symbols-in-subdivnames branch from 6a604ad to adf59f1 Compare August 16, 2024 10:54

carpawell reviewed Aug 16, 2024

View reviewed changes

Makefile Outdated Show resolved Hide resolved

Makefile Outdated Show resolved Hide resolved

End-rey force-pushed the 25-unknown-symbols-in-subdivnames branch from adf59f1 to 481bcb2 Compare August 16, 2024 19:03

End-rey requested review from roman-khimov, carpawell and AliceInHunterland August 16, 2024 19:17

roman-khimov reviewed Aug 18, 2024

View reviewed changes

locodedb: get unlocode files from github

a6b65a7

Get unlocode files from https://github.com/datasets/un-locode. Version of files 2023-2. Update pkg/locodedb/data/locodes.csv.bz2. Closes #25. Signed-off-by: Andrey Butusov <andrey@nspcc.io>

End-rey force-pushed the 25-unknown-symbols-in-subdivnames branch from cbc6fd0 to a6b65a7 Compare August 19, 2024 08:20

roman-khimov approved these changes Aug 19, 2024

View reviewed changes

roman-khimov merged commit f4e1ab3 into master Aug 19, 2024
11 checks passed

roman-khimov deleted the 25-unknown-symbols-in-subdivnames branch August 19, 2024 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utf-8 csv files for UN/LOCODE file #37

Utf-8 csv files for UN/LOCODE file #37

End-rey commented Aug 12, 2024

codecov bot commented Aug 12, 2024 •

edited

Loading

roman-khimov left a comment

roman-khimov commented Aug 12, 2024

End-rey commented Aug 12, 2024

roman-khimov commented Aug 12, 2024

End-rey commented Aug 12, 2024

roman-khimov commented Aug 12, 2024

AliceInHunterland Aug 15, 2024

AliceInHunterland Aug 15, 2024

AliceInHunterland Aug 15, 2024

End-rey Aug 16, 2024

carpawell left a comment

roman-khimov left a comment

roman-khimov Aug 18, 2024

roman-khimov Aug 18, 2024

roman-khimov Aug 18, 2024

End-rey Aug 19, 2024

roman-khimov Aug 19, 2024

roman-khimov commented Aug 19, 2024

roman-khimov commented Aug 19, 2024

Utf-8 csv files for UN/LOCODE file #37

Utf-8 csv files for UN/LOCODE file #37

Conversation

End-rey commented Aug 12, 2024

codecov bot commented Aug 12, 2024 • edited Loading

Codecov Report

roman-khimov left a comment

Choose a reason for hiding this comment

roman-khimov commented Aug 12, 2024

End-rey commented Aug 12, 2024

roman-khimov commented Aug 12, 2024

End-rey commented Aug 12, 2024

roman-khimov commented Aug 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carpawell left a comment

Choose a reason for hiding this comment

roman-khimov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roman-khimov commented Aug 19, 2024

roman-khimov commented Aug 19, 2024

codecov bot commented Aug 12, 2024 •

edited

Loading