-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathfct_standardisation.qmd
572 lines (335 loc) · 24.3 KB
/
fct_standardisation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
# Food Composition Table & Databases: Standardisation
## Introduction
{#fig-framework}
### Selecting food composition data
When selecting the food composition table or database (FCT) that will be used, it is good to reflect on the following questions:
1. Relevancy for the study/context (e.g., is that FCT/FCBD geographically and culturally close to our survey scope?).
2. FCT availability & missing values (e.g., are relevant foods and nutrients reported?).
3. Data quality and reporting (e.g., what are the method of analysis and metadata available?).
### Objective
This document provide, together with the template document, the steps and description for cleaning and standardising FCTs from diverse sources. More details about the cleaned data that can be found in the repository is documented in this folder (documentation).
For easy navigation and use of this script it is recommended to use Rstudio. In RStudio please click the "Show Document Outline" button to the right of the source button, at the top right of this window. This will allow for easier navigation of the script.
### Environment Prep
First we need to check what (@sec-packages) are installed. If you have run this template before in this RStudio project and are sure these packages are already installed, you can comment out (put a hash at the start of) line 20, and skip it.
```{r}
# Run this to clean the environment
rm(list = ls())
# Loading libraries
library(readxl) # reading and writing excel files
library(stringr) # character string handling
library(dplyr) # cleaning data
library(here) # file management
```
## Obtaining the raw (FCT) file
### Data license
Before using any dataset, we recommend to check licensing conditions & record the data source, you can use the [README template](link-to-readme-template).
### Downloading the data {#sec-download}
If the data is publicly available online, usually you only need to run the code below to obtain the raw files. Remember you only need to do it the first time! Then, the data will be stored in the folder of your choice (see below).
For instance, many raw files can be found provided by the FAO [here](https://www.fao.org/infoods/infoods/tables-and-databases/en/), in various formats.
Once the link to the data is found, check what file type it is, and paste the direct file link to replace the fill-in value below.
### File names conventions
We advise to use the ISO code (2 digits) (see [ISO 3166 2-alpha code](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) for further information) of the country or the region of the FCT scope, plus the two last digits of the year of publication to name, both the folder which will contain the data and the scripts related to the FCT. For instance, Western Africa FCT, 2019 will be coded as WA19. This will help with the interoperability, reusability and findability of the data. Also, to streamline the work in the future. That name convention will be used also as the identifier of the FCT.
Note that you need to create the folders to store the FCT.
```{r warning=FALSE}
f <- "https://www.fao.org/fileadmin/user_upload/faoweb/2020/WAFCT_2019.xlsx"
download.file(f,
destfile = here::here("data", # data folder
'WA19', #FCT folder
#FCT file
"WAFCT_2019.xlsx"),
method="wininet", #use "curl" for OS X / Linux, "wininet" for Windows
mode="wb")
```
If using an RStudio project, and you put the .R file and the data file in the same folder as the RStudio project or within a subfolder, files and folders are much easier to navigate as your project/here::here location automatically moves to the main project folder.
### Getting your file into R
If you are using an RStudio project but used a different download method, or already have the file you want to process on your computer, or are using base R we can still use the `here::here()` function, however we will have to find the file first.
The best practice is to put the file in the same folder as this script, or in a folder within the project. If this is done, then use `here::here()` to find your current working directory, and then navigate to the file's folder. More information about the here package can be found [here](https://here.r-lib.org/).
```{r eval=FALSE}
# Run this script to see where is your directory
here::here()
```
In order to navigate R to the file, you have to include each subfolder between the `here::here()` location and the file itself. So for our file structure, we have the 'data' folder, the 'FCT' folder and the FCT file).
Find your file in your project, and then direct `here::here()` to it.
```{r}
# This identifies the file and file path, and saves it as a variable
FCT_file_location <- here::here('data','WA19', "WAFCT_2019.xlsx")
```
## Importing the data (loading the data)
### Importing Files
After having found the file on your system that we want to import it. FCT files come in many different forms - the most common being ".xlsx" files and ".csv" files. Methods to import both of these file types will be covered - please navigate to the relevant subsection.
During import, a identifier for the FCT is created and added to the table. Please replace 'WA19' from the next code chunck with the FCT id., comprised of the countries [ISO 3166 2-alpha code](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes), and the year the FCT was produced (e.g. for the Western Africa FCT from 2019, the reference would be 'MWA9').
This should be the same as the folder name explained in (section 1.3)\[link-to-section\].
```{r}
# This is an example of the name
FCT_id <- 'WA19' # Change two first letter for your ISO 2 code & the two digits for the last two digits of the year of publication.
```
If you used the download method above @sec-download then we will see the same location as specified there to specify the file. We can simply use the object `FCT_file_location`, or copy the contents inside the `here::here()` and use it to fill the `here::here()` in the line of code below.
#### Importing .xlsx files
For the spreadsheet (excel-type of files), first, you need to check what information is provided and which of the sheet is providing the FC data.
```{r}
# Checking the sheets
readxl::excel_sheets(FCT_file_location)
data.df <- readxl::read_excel(FCT_file_location, #The file location, as identified in section 2.1
sheet = 5 # Change to the excel sheet where the FCT is stored in the excel file
) %>%
mutate(source_fct = FCT_id) #Creates the source_fct column and fills with a id for this FCT, as filled in in section 2.2.
```
#### Importing .csv files
```{r eval = FALSE}
data.df <- read.csv2(FCT_file_location, #The file location, as identified in section 2.1
sep = ",") %>% # Replace w/ other symbol if needed
mutate(source_fct = FCT_id) #Creates the source_fct column and fills with a id for this FCT, as filled in in section 2.2.
```
Once imported, it is important to check the data.frame created from the csv, by using `head(data.df)` or clicking on its entry in the Environment panel of RStudio (This second option is not advised with very large files, however, as it can be slow).
If the data shown by doing this has all its columns combined, with a symbol in-between, then that symbol (e.g. ';') is the separator for that csv. Replace comma in the `sep = ","` line from the code block above with the new symbol, and run the entire block again.
#### Visually checking the data
```{r}
# Checking the dataframe
head(data.df)
```
#### Checking the loaded data
:::{.callout-note}
## Question
How many rows & columns have the data?
:::
You can use the function `dim()` to answer to check the number of rows and column.
::: {.callout-tip collapse="true"}
## Answer
```{r }
dim(data.df) # rows & columns
```
:::
Other useful functions to evaluate the structure of the data are:
```{r eval=FALSE}
# Structure (variable names, class, etc.)
str(data.df)
# Checking the last rows and columns
tail(data.df)
```
::: {.callout-caution}
## Using `View()`
The function `View()` can be use for opening the dataframe in a tab. However, if the dataset is very very big, it may crash the R session.
:::
After checking that the correct FCT file have loaded the, then proceed. If not, find the correct file and import it instead.
## Cleaning (tidying) and standardising the data
### Formatting FCT into a tabular format
#### Trimming dataframe rows
Running this will trim down the table to only include the row numbers between x and y - replace x and y with your desired values. If you wanted to include multiple row ranges, that is also possible - use comments to differentiate between different row ranges and individual rows. e.g. if you wanted to include rows a:b, row c, row e, and rows g:x, then the code would be `slice(a:b, c, e, g:x)`.
```{r}
data.df %>% slice(1:5) %>% knitr::kable()
```
#### Trimming dataframe columns
If you only wish to include certain columns/nutrients, then you might wish to remove the unnecessary columns to make the dataframe easier to read and manage. This can be done through 2 methods; either by selecting the names of the columns you want to keep, or by selecting the names of the columns you want to remove.
#### Keep specified columns only
This method requires creating a list of column names you want to keep - for #example the line below would select the columns 'Energy_kcal', 'Fatg', 'Protein_g', but nothing else. If you wish to trim the columns this way, replace the items in the first line with the column names you want to keep, then run the code block below.
```{r warning=FALSE}
# Storing the variables you want to keep
columns_to_keep <- c('Scientific name', 'Energy\r\n(kJ)')
# Selecting the variables
data.df %>% select(columns_to_keep) %>%
head(5) %>%
knitr::kable()
```
#### Remove specified columns, keep all others
Sometimes it is easier to list the columns you want to remove, rather than the ones you want to keep. The code block below identifies the columns to be removed ('VitB12_mcg' and 'Calcium_mg' in the example), and then removes them. If you wish to trim the columns this way, replace the items in the first line with the column names you want to remove, then run the code block below.
This works in a similar way to the codeblock in section 3.3.1, however by putting an exclamation mark (!) before the list of columns, it inverts the selection - instead of instructing R to keep only the listed columns (as with the codeblock above), it instructs R to keep all columns but the listed ones.
```{r warning=FALSE}
# Selecting the variables that you don't want to keep
columns_to_remove <- c('Food name in French', 'Sum of proximate components\r\n(g)')
data.df %>% select(!columns_to_remove) %>%
head(5) %>%
knitr::kable()
```
### Creating food groups variable
Some food composition tables reported food
groups that were placed as the first row of each category, however that it is not a data structure that can be used, as we need one column per variable. Hence, the food group names are extracted from the rows, and are allocated as a new attribute of each food (e.g., fish and fishery products to catfish). The food groups are stored in a new column (`food_group`).
This process requires multiple steps, each covered in their own subsections below: Extracting food group names, Creating the variable, and checking changes in the structure.
#### Extracting food group names
```{r warning=FALSE}
#Creates a list of the food groups using their unique row structure in the table to identify them
fgroup <- data.df %>%
filter(is.na(`Food name in English`), !is.na(Code)) %>%
pull(Code) %>%
stringr::str_split_fixed( '/', n = 2) %>%
as_tibble() %>%
pull(V1)
group.id <- unique(str_extract(data.df$Code, "^[:digit:]{2}\\_"))[-1]
```
# Creating the food_group variable in the FCT
```{r}
# Removes any rows without a food description entry (the food group name rows, and a row that have already been used for naming)
data.df <- data.df %>% #Identifies the food group number from the fdc_id, and applies the correct food_group from the fgroup list to the food_group column
mutate(food_group = ifelse(grepl("01_", Code), fgroup[1],
ifelse(grepl("02_", Code), fgroup[2],
ifelse(grepl("03_", Code), fgroup[3],
ifelse(grepl("04_", Code), fgroup[4],
ifelse(grepl("05_", Code), fgroup[5],
ifelse(grepl("06_", Code), fgroup[6],
ifelse(grepl("07_", Code), fgroup[7],
ifelse(grepl("08_", Code), fgroup[8],
ifelse(grepl("09_", Code), fgroup[9],
ifelse(grepl("10_", Code), fgroup[10],
ifelse(grepl("11_", Code), fgroup[11],
ifelse(grepl("12_", Code), fgroup[12],
ifelse(grepl("13_", Code), fgroup[13],
ifelse(grepl("14_", Code), fgroup[14], 'NA')))))))))))))))
```
### Diving combined variables into two (or more) columns
In some cases, to minimise the number of missing values, two similar food components are combined into one column/ variable. For
example, when Beta-carotene and Beta-Carotene Eq. were reported in the same column
and identified using brackets ([]) around the component values (`CARTEB or [CARTEBEQ]`). However, because we are aiming to have one variable per column, we need to divide the combined variables into two (or more) columns, as such the Beta-carote variable was separated into two independent columns (`CARTEBmg`, `CARTBEQmg`). Note that when we are separating these two food components into the new variable, there are several things that we need to check: we need to check whether they are present in the original FCT, otherwise we may be overwriting values.
1) The variable is not present, then create a new column.
2) The variable is present: only overwrite if the value is missing in the original column.
3) Do not remove the brackets from the original variable. As we will add information (metadata) about the quality of the value after removing the brackets.
### Renaming variable names: Food components definition and re-naming
Each FCT has its own variable names, including for similar food components. Some
FCTs included information related to the FAO/ INFOODS food component identifiers
[(Tagnames)](https://www.fao.org/infoods/infoods/standards-guidelines/food-component-identifiers-tagnames/en/) while other did not. In order to merge all data from various FCTs we
need to harmonise the names of all the variables and particularly the
food components of interest. To do so, we evaluated and renamed them
with the most appropriate Tagname. Other variables (e.g., food id, food
name or food description) were renamed to a common variable name, for
instance, `fdc_id`, `food_desc`.
:::{.callout-note}
## Question
Are the food component variable names using Tagnames & units, i.e., \[Variable\]\[unit\] (NAmg)?
:::
If yes, use that information to rename food components
```{r}
# Checking current names
names(data.df)
```
If not, do they provide Tagnames information?
```{r}
# Checking for Tagnames
data.df %>%
head(5) %>%
knitr::kable()
```
In this case we can use the Tagname information on the two first rows to rename our variables.
```{r}
# Automatic renaming
for( i in 8:62){ #Loops through each column between column 8 and 64
first_row <- toString(data.df[1, i]) #Takes the first row for that column and assigns it to a variable
second_row <- toString(data.df[2, i]) #Takes the second row for that column and assigns it to a variable
split_string <- str_split(first_row, "\\(") #Splits the first row around "(", assigning the two resulting strings to a variable
units_int <- gsub("\\*|\\(|\\)", "", split_string[[1]][length(split_string[[1]])]) #Separates the units out from the split_string (everything after the last open bracket)
colnames(data.df)[i] <- paste0(second_row, units_int) #The column name is replaced with row 1 and the units from row 2
}
# Checking new names
names(data.df)
```
If information is not provided the manual renaming of the food components would be necessary and hence, the identification of each food component to its Tagname.
##### The identification of food components
Information on the food components and their description should be
sought in for FCT. We advise to use the [(Tagnames)](https://www.fao.org/infoods/infoods/standards-guidelines/food-component-identifiers-tagnames/en/). Some minor changes in the Tagnames are introduced
to be compliant with R conventions. E.g., removing spaces in variable
names, changing symbols to characters (e.g., µg to mcg), or
standardising the name formatting from using underscores and/or
parenthesis to using only underscores. Also, changing dashes (-) to
underscores (\_). Note that within the Tagnames, the dash is used to
denote that the method for obtaining that (component) value is unknown.
This is important for the quality assessment of the data.
Also, we also assumed that all the variables labelled as "standardised"
were combined or recalculated variables. In @tbl-tagnames, we provided a list of the most relevant food information and components and their standard variable names, including the Tagnames. It also provides information about the variable name reference (i.e., FAO/INFOODS, 2012) together with the variable source (as it was compiled from the FCT or should be calculated), and the type of variable (e.g., numeric, etc).
```{r, eval = TRUE, label = "tbl-tagnames", tbl.cap = "The list of relevant standardised food variable names, including the Tagnames, and other relevant information."}
read.csv(here::here("data", "fct_variables_standards.csv")) |>
stringr::str_squish() |> knitr::kable()
```
Some of the variables that are frequently manually renamed are, for instance, food code/id, food name, etc. For, easy manual renaming, just change the names in quotes ("") to those in your dataset (if needed), remove/ add as needed.
```{r}
data.df <- data.df %>%
dplyr::rename(
fdc_id = "Code", # Food id/code
food_desc = "Food name in English", # Food name/description
food_descFR = "Food name in French",# Food name/description
scientific_name = "Scientific name", # Scientific name
Edible_factor_in_FCT = "Edible portion coefficient 1 (from as purchased to as described)", # Edible portion
Edible_factor_in_FCT2 = "Edible portion coefficient 2 (from as described to as eaten)", # Edible portion
nutrient_data_source = "BiblioID/Source") # Reference for NVs reported
names(data.df)[1:5]
```
::: {.callout-important}
## Check
Are the variable names = column names?
If not, more formatting is needed (back to previous step)
:::
### Standardisation of values
To perform mathematical operations, characters needed to be converted
into numeric operator. This includes three steps:
#### Removing brackets or other special characters.
As, described above in section, special character ("[]", "()", "\*")were used to denote "low quality values" and/or alternative (determination) methods. We kept record of those values in metadata files for those cases where the numeric
value was extracted and the special character removed. In addition, the variable `comments` was created in a way of keeping that information as metadata for other users.
```{r}
#Creating a dataset w/ the values that were of low quality [] trace or normal
var_nut <- data.df %>% select(Edible_factor_in_FCT:VITCmg) %>% colnames() #selecting nutrient variable names where we want to check for quality/trace
#dataset w/ metadata info that will be removed from the dataset for use
metadata <- data.df %>% mutate_at(var_nut, ~case_when(
str_detect(. , '\\[.*?\\]') ~ "low_quality", #Looking for things in square brackets to mark as low quality
str_detect(. , 'tr') ~ "trace", #Looking for things marked as "tr" and labels them as trace
TRUE ~ "normal_value")) #Else it marks the entry as a normal value
```
The following chunk is related to formatting the FCT section, as here we are creating new varibles to separate those Tagnames and/or similar food components into their individual columns with their Tagnames.
```{r}
#Extracting variables calculated with different (lower quality) method
#and reported as using [] and removing them from the original variable
data.df <- data.df %>%
#Creating calculated values from the lower quality method and removing the original values from the original variable
mutate(FATCEg = str_extract(FATg, '(?<=\\[).*?(?=\\])'),
#e.g. this creates the FIBCg value from the FIBTGg value
FIBCg = str_extract(FIBTGg, '(?<=\\[).*?(?=\\])'),
CARTBmcg = ifelse(is.na(CARTBmcg), str_extract(CARTBEQmcg, '(?<=\\[).*?(?=\\])'), CARTBmcg),
TOCPHAmg = ifelse(is.na(TOCPHAmg),str_extract(VITEmg, '(?<=\\[).*?(?=\\])'), TOCPHAmg ),
NIAmg = ifelse(is.na(NIAmg), str_extract(NIAEQmg, '(?<=\\[).*?(?=\\])'), NIAmg),
FOLSUMmcg = str_extract(FOLmcg, '(?<=\\[).*?(?=\\])'),
PHYTCPPD_PHYTCPPImg = str_extract(PHYTCPPmg, '(?<=\\[).*?(?=\\])'))
```
#### Changing characters into numeric
For instance, values that were reported to be trace ("tr") or below the detection limit ("<LOD") were converted to zero (0). These changes are also recorded in the `comments` variable as part of the metadata available.
```{r}
#The following f(x) removes [] and changing tr w/ 0
no_brackets_tr <- function(i){
case_when(
str_detect(i, 'tr|[tr]') ~ "0",
str_detect(i, '\\[.*?\\]') ~ str_extract(i, '(?<=\\[).*?(?=\\])'),
TRUE ~ i)
}
data.df <- data.df %>%
mutate_at(var_nut, no_brackets_tr) #This applies the above function
```
#### Extracting information
Some food component information, for instance alcohol content, could be reported within the food description instead of in a independent variable. Hence that information needs to be extracted and a new variable generated.
#### Converting into numeric
```{r}
# Converting to numeric
wafct <- wafct %>% mutate_at(vars(`Edible_factor_in_FCT`:`PHYTCPPD_PHYTCPPImg`), as.numeric) #Converts certain columns (the data columns) to numeric
```
### Standardising unit of measurement <br>
To standardise and merge the different FCTs, food components need to be
reported in the same units. For example, some nutrients needed to be
convert from mg/100g to g/100g, or from percentage (100%) to a fraction (1). For all the unit conversion we followed the FAO/INFOODS Guidelines for
Converting Units, Denominators and Expression (FAO/INFOODS, 2012b), and
the suggested standard reporting units.
Eg. Converting alcohol from weight in volume (w/v) to weight in mass (w/m) (Eq.1a) or percentage of alcohol (v/v) into weight mass (Eq.1b) (See INFOODS Guidelines - page 12).
**Eq.1a** ALC (g/100mL) (w/v) / density (g/mL) = ALC (g/100 EP)
**Eq.1b** ALC (%) (v/v) * 0.789 (g/mL) / density (g/mL) = ALC (g/100 EP)
Eg. amino acids (AA) reported per g in 100g of PROT to mg in 100g of EP:
**Eq.2.1** AA mg/100g EP = AA mg/g prot \* prot g/100g EP /100
**Eq.2.2** AA mg/100g EP = AA g/ 100g prot \* prot g/100g EP /100 \*
1000/100
**Eq.2.3** AA mg/100g EP = AA g/100g prot \* prot g/100g EP \*10
\\Eq.3: Edible portion = Edible portion (%)/100
##### Data quality and reporting (e.g., method of analysis, good metadata)
General quality checks are: the level of detail in the food description, the methods used for nutrient values compilation, and the documentation and degree of detail.
Other quality checks that can be performed are: calculating sum of proximate and re-calculating the values of: Carbohydrates available by difference, energy, etc. These is covered in the visualisation and QC section.
### Saving the output
We are saving the standardised FCT into the data folder, for use in the future.
```{r}
# Data Output
write.csv(data.df, file = here::here("data", paste0(FCT_id, "_FCT_FAO_Tags.csv")),
row.names = FALSE) #Saves the newly-created data table to the Output folder
#Run this to clean the environment
rm(list = ls())
```
## Further readings
1. Charrondiere, U.R., Stadlmayr, B., Grande, F., Vincent, A., Oseredczuk, M., Sivakumaran, S., Puwastien, P., Judprasong, K., Haytowitz, D., Gnagnarella, P. 2023. FAO/INFOODS Evaluation framework to assess the quality of published food composition tables and databases - User guide. Rome, FAO. https://doi.org/10.4060/cc5371en