fct_standardisation.qmd

# Food Composition Table & Databases: Standardisation

## Introduction

![](images/Workflow_v2.png){#fig-framework}

### Selecting food composition data

When selecting the food composition table or database (FCT) that will be used, it is good to reflect on the following questions:

1.  Relevancy for the study/context (e.g., is that FCT/FCBD geographically and culturally close to our survey scope?).

2.  FCT availability & missing values (e.g., are relevant foods and nutrients reported?).

3.  Data quality and reporting (e.g., what are the method of analysis and metadata available?).


### Objective

This document provide, together with the template document, the steps and description for cleaning and standardising FCTs from diverse sources. More details about the cleaned data that can be found in the repository is documented in this folder (documentation).

For easy navigation and use of this script it is recommended to use Rstudio. In RStudio please click the "Show Document Outline" button to the right of the source button, at the top right of this window. This will allow for easier navigation of the script.

### Environment Prep

First we need to check what (@sec-packages) are installed. If you have run this template before in this RStudio project and are sure these packages are already installed, you can comment out (put a hash at the start of) line 20, and skip it.

```{r}

# Run this to clean the environment
rm(list = ls())

# Loading libraries

library(readxl) # reading and writing excel files
library(stringr) # character string handling
library(dplyr) # cleaning data
library(here) # file management

```


## Obtaining the raw (FCT) file

### Data license 

Before using any dataset, we recommend to check licensing conditions & record the data source, you can use the [README template](link-to-readme-template).

### Downloading the data {#sec-download}

If the data is publicly available online, usually you only need to run the code below to obtain the raw files. Remember you only need to do it the first time! Then, the data will be stored in the folder of your choice (see below).

For instance, many raw files can be found provided by the FAO [here](https://www.fao.org/infoods/infoods/tables-and-databases/en/), in various formats. 

Once the link to the data is found, check what file type it is, and paste the direct file link to replace the fill-in value below.

### File names conventions

We advise to use the ISO code (2 digits) (see [ISO 3166 2-alpha code](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) for further information) of the country or the region of the FCT scope, plus the two last digits of the year of publication to name, both the folder which will contain the data and the scripts related to the FCT. For instance, Western Africa FCT, 2019 will be coded as WA19. This will help with the interoperability, reusability and findability of the data. Also, to streamline the work in the future. That name convention will be used also as the identifier of the FCT. 

Note that you need to create the folders to store the FCT. 

```{r warning=FALSE}

f <- "https://www.fao.org/fileadmin/user_upload/faoweb/2020/WAFCT_2019.xlsx"
 
 download.file(f, 
             destfile = here::here("data", # data folder
                                   'WA19', #FCT folder
                                   #FCT file
                                   "WAFCT_2019.xlsx"),  
             method="wininet", #use "curl" for OS X / Linux, "wininet" for Windows
             mode="wb")
 

```

If using an RStudio project, and you put the .R file and the data file in the same folder as the RStudio project or within a subfolder, files and folders are much easier to navigate as your project/here::here location automatically moves to the main project folder. 

### Getting your file into R

If you are using an RStudio project but used a different download method, or already have the file you want to process on your computer, or are using base R we can still use the `here::here()` function, however we will have to find the file first. 

The best practice is to put the file in the same folder as this script, or in a folder within the project. If this is done, then use `here::here()` to find your current working directory, and then navigate to the file's folder. More information about the here package can be found [here](https://here.r-lib.org/).


```{r eval=FALSE}

# Run this script to see where is your directory
here::here()

```

In order to navigate R to the file, you have to include each subfolder between the `here::here()` location and the file itself. So for our file structure, we have the 'data' folder, the 'FCT' folder and the FCT file). 

Find your file in your project, and then direct `here::here()` to it.

```{r}

# This identifies the file and file path, and saves it as a variable
FCT_file_location <- here::here('data','WA19', "WAFCT_2019.xlsx") 

```


## Importing the data (loading the data)

### Importing Files

After having found the file on your system that we want to import it. FCT files come in many different forms - the most common being ".xlsx" files and ".csv" files. Methods to import both of these file types will be covered - please navigate to the relevant subsection.

During import, a identifier for the FCT is created and added to the table. Please replace 'WA19' from the next code chunck with the FCT id., comprised of the countries [ISO 3166 2-alpha code](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes), and the year the FCT was produced (e.g. for the Western Africa FCT from 2019, the reference would be 'MWA9'). 

This should be the same as the folder name explained in (section 1.3)\[link-to-section\].

```{r}

# This is an example of the name 
FCT_id <- 'WA19' # Change two first letter for your ISO 2 code & the two digits for the last two digits of the year of publication.

```

If you used the download method above @sec-download then we will see the same location as specified there to specify the file. We can simply use the object `FCT_file_location`, or copy the contents inside the `here::here()` and use it to fill the `here::here()` in the line of code below.

#### Importing .xlsx files

For the spreadsheet (excel-type of files), first, you need to check what information is provided and which of the sheet is providing the FC data.


```{r}

# Checking the sheets

readxl::excel_sheets(FCT_file_location)

data.df <- readxl::read_excel(FCT_file_location, #The file location, as                            identified in section 2.1
                              sheet = 5  # Change to the excel sheet where                              the FCT is stored in the excel file
                              ) %>%  
  mutate(source_fct = FCT_id)  #Creates the source_fct column and fills with                 a id for this FCT, as filled in in section 2.2. 

```

#### Importing .csv files

```{r eval = FALSE}

data.df <- read.csv2(FCT_file_location, #The file location, as identified in section 2.1
                     sep = ",") %>%  # Replace w/ other symbol if needed
  mutate(source_fct = FCT_id) #Creates the source_fct column and fills with a id for this FCT, as filled in in section 2.2. 

```

Once imported, it is important to check the data.frame created from the csv, by using `head(data.df)` or clicking on its entry in the Environment panel of RStudio (This second option is not advised with very large files, however, as it can be slow).

If the data shown by doing this has all its columns combined, with a symbol in-between, then that symbol (e.g. ';') is the separator for that csv. Replace comma in the `sep = ","` line from the code block above with the new symbol, and run the entire block again.


#### Visually checking the data

```{r}

# Checking the dataframe
head(data.df) 

```

#### Checking the loaded data

:::{.callout-note} 
## Question

How many rows & columns have the data?

:::

You can use the function `dim()` to answer to check the number of rows and column.


::: {.callout-tip collapse="true"}

## Answer

```{r }

dim(data.df) # rows & columns

```

:::

Other useful functions to evaluate the structure of the data are:

```{r eval=FALSE}

# Structure (variable names, class, etc.)
str(data.df)

# Checking the last rows and columns
tail(data.df) 

```

::: {.callout-caution}
## Using `View()`

The function `View()` can be use for opening the dataframe in a tab. However, if the dataset is very very big, it may crash the R session.

:::

After checking that the correct FCT file  have loaded the, then proceed. If not, find the correct file and import it instead.

## Cleaning (tidying) and standardising the data

### Formatting FCT into a tabular format 

#### Trimming dataframe rows

Running this will trim down the table to only include the row numbers between x and y - replace x and y with your desired values. If you wanted to include multiple row ranges, that is also possible - use comments to differentiate between different row ranges and individual rows. e.g. if you wanted to include rows a:b, row c, row e, and rows g:x, then the code would be `slice(a:b, c, e, g:x)`.

```{r}

data.df %>% slice(1:5) %>% knitr::kable()

```

#### Trimming dataframe columns

If you only wish to include certain columns/nutrients, then you might wish to remove the unnecessary columns to make the dataframe easier to read and manage. This can be done through 2 methods; either by selecting the names of the columns you want to keep, or by selecting the names of the columns you want to remove.

#### Keep specified columns only

This method requires creating a list of column names you want to keep - for #example the line below would select the columns 'Energy_kcal', 'Fatg', 'Protein_g', but nothing else. If you wish to trim the columns this way, replace the items in the first line with the column names you want to keep, then run the code block below.

```{r warning=FALSE}

# Storing the variables you want to keep
columns_to_keep <- c('Scientific name', 'Energy\r\n(kJ)')

# Selecting the variables
data.df %>% select(columns_to_keep) %>% 
  head(5) %>% 
  knitr::kable()

```

#### Remove specified columns, keep all others

Sometimes it is easier to list the columns you want to remove, rather than the ones you want to keep. The code block below identifies the columns to be removed ('VitB12_mcg' and 'Calcium_mg' in the example), and then removes them. If you wish to trim the columns this way, replace the items in the first line with the column names you want to remove, then run the code block below.

This works in a similar way to the codeblock in section 3.3.1, however by putting an exclamation mark (!) before the list of columns, it inverts the selection - instead of instructing R to keep only the listed columns (as with the codeblock above), it instructs R to keep all columns but the listed ones.

```{r warning=FALSE}

# Selecting the variables that you don't want to keep
columns_to_remove <- c('Food name in French', 'Sum of proximate components\r\n(g)') 

data.df %>% select(!columns_to_remove) %>% 
  head(5) %>% 
  knitr::kable()

```

### Creating food groups variable 

Some food composition tables reported food
groups that were placed as the first row of each category, however that it is not a data structure that can be used, as we need one column per variable. Hence, the food group names are extracted from the rows, and are allocated as a new attribute of each food (e.g., fish and fishery products to catfish). The food groups are stored in a new column (`food_group`).

This process requires multiple steps, each covered in their own subsections below: Extracting food group names, Creating the variable, and checking changes in the structure.

#### Extracting food group names

```{r warning=FALSE}

#Creates a list of the food groups using their unique row structure in the table to identify them

fgroup <- data.df %>% 
  filter(is.na(`Food name in English`), !is.na(Code)) %>%
  pull(Code) %>%
  stringr::str_split_fixed( '/', n = 2) %>% 
  as_tibble() %>%
  pull(V1) 

group.id <-  unique(str_extract(data.df$Code, "^[:digit:]{2}\\_"))[-1]

```

# Creating the food_group variable in the FCT

```{r}

# Removes any rows without a food description entry (the food group name rows, and a row that have already been used for naming)

data.df <- data.df %>% #Identifies the food group number from the fdc_id, and applies the correct food_group from the fgroup list to the food_group column 
  mutate(food_group = ifelse(grepl("01_", Code), fgroup[1],
                      ifelse(grepl("02_", Code), fgroup[2],
                      ifelse(grepl("03_", Code), fgroup[3],
                      ifelse(grepl("04_", Code), fgroup[4],
                      ifelse(grepl("05_", Code), fgroup[5],
                      ifelse(grepl("06_", Code), fgroup[6],
                      ifelse(grepl("07_", Code), fgroup[7],
                      ifelse(grepl("08_", Code), fgroup[8],
                      ifelse(grepl("09_", Code), fgroup[9], 
                      ifelse(grepl("10_", Code), fgroup[10], 
                      ifelse(grepl("11_", Code), fgroup[11], 
                      ifelse(grepl("12_", Code), fgroup[12],
                      ifelse(grepl("13_", Code), fgroup[13],
                      ifelse(grepl("14_", Code), fgroup[14],                          'NA')))))))))))))))

```


### Diving combined variables into two (or more) columns

In some cases, to minimise the number of missing values, two similar food components are combined into one column/ variable. For
example, when Beta-carotene and Beta-Carotene Eq. were reported in the same column
and identified using brackets ([]) around the component values (`CARTEB or [CARTEBEQ]`). However, because we are aiming to have one variable per column, we need to divide the combined variables into two (or more) columns, as such the Beta-carote variable was separated into two independent columns (`CARTEBmg`, `CARTBEQmg`). Note that when we are separating these two food components into the new variable, there are several things that we need to check: we need to check whether they are present in the original FCT, otherwise we may be overwriting values. 

1) The variable is not present, then create a new column.
2) The variable is present: only overwrite if the value is missing in the original column.
3) Do not remove the brackets from the original variable. As we will add information (metadata) about the quality of the value after removing the brackets. 

### Renaming variable names: Food components definition and re-naming

Each FCT has its own variable names, including for similar food components. Some
FCTs included information related to the FAO/ INFOODS food component identifiers
[(Tagnames)](https://www.fao.org/infoods/infoods/standards-guidelines/food-component-identifiers-tagnames/en/) while other did not. In order to merge all data from various FCTs we
need to harmonise the names of all the variables and particularly the
food components of interest. To do so, we evaluated and renamed them
with the most appropriate Tagname. Other variables (e.g., food id, food
name or food description) were renamed to a common variable name, for
instance, `fdc_id`, `food_desc`.


:::{.callout-note} 
## Question 

Are the food component variable names using Tagnames & units, i.e., \[Variable\]\[unit\] (NAmg)?

:::

If yes, use that information to rename food components

```{r}

# Checking current names
names(data.df)

```

If not, do they provide Tagnames information?

```{r}

# Checking for Tagnames
data.df %>% 
  head(5) %>% 
  knitr::kable()


```

In this case we can use the Tagname information on the two first rows to rename our variables.


```{r}

# Automatic renaming 

for( i in 8:62){ #Loops through each column between column 8 and 64
  first_row <- toString(data.df[1, i]) #Takes the first row for that column and assigns it to a variable
  second_row <- toString(data.df[2, i]) #Takes the second row for that column and assigns it to a variable
  split_string <- str_split(first_row, "\\(") #Splits the first row around "(", assigning the two resulting strings to a variable
  units_int <- gsub("\\*|\\(|\\)", "", split_string[[1]][length(split_string[[1]])]) #Separates the units out from the split_string (everything after the last open bracket)
  colnames(data.df)[i] <- paste0(second_row, units_int) #The column name is replaced with row 1 and the units from row 2
}

# Checking new names

names(data.df)

```

If information is not provided the manual renaming of the food components would be necessary and hence, the identification of each food component to its Tagname. 

##### The identification of food components

Information on the food components and their description should be
sought in for FCT.  We advise to use the [(Tagnames)](https://www.fao.org/infoods/infoods/standards-guidelines/food-component-identifiers-tagnames/en/). Some minor changes in the Tagnames are introduced
to be compliant with R conventions. E.g., removing spaces in variable
names, changing symbols to characters (e.g., µg to mcg), or
standardising the name formatting from using underscores and/or
parenthesis to using only underscores. Also, changing dashes (-) to
underscores (\_). Note that within the Tagnames, the dash is used to
denote that the method for obtaining that (component) value is unknown.
This is important for the quality assessment of the data.

Also, we also assumed that all the variables labelled as "standardised"
were combined or recalculated variables. In @tbl-tagnames, we provided a list of the most relevant food information and components and their standard variable names, including the Tagnames. It also provides information about the variable name reference (i.e., FAO/INFOODS, 2012) together with the variable source (as it was compiled from the FCT or should be calculated), and the type of variable (e.g., numeric, etc).


```{r, eval = TRUE, label = "tbl-tagnames", tbl.cap = "The list of relevant standardised food variable names, including the Tagnames, and other relevant information."}

read.csv(here::here("data", "fct_variables_standards.csv")) |>
  stringr::str_squish() |> knitr::kable()

```

Some of the variables that are frequently manually renamed are, for instance, food code/id, food name, etc. For, easy manual renaming, just change the names in quotes ("") to those in your dataset (if needed), remove/ add as needed.


```{r}

data.df <- data.df %>%
dplyr::rename( 
  fdc_id = "Code", # Food id/code 
food_desc = "Food name in English", # Food name/description
food_descFR = "Food name in French",# Food name/description
scientific_name = "Scientific name", # Scientific name
Edible_factor_in_FCT = "Edible portion coefficient 1 (from as purchased to as described)", # Edible portion
Edible_factor_in_FCT2 = "Edible portion coefficient 2 (from as described to as eaten)", # Edible portion
nutrient_data_source = "BiblioID/Source") # Reference for NVs reported

names(data.df)[1:5]

```


::: {.callout-important}
## Check

Are the variable names = column names?

If not, more formatting is needed (back to previous step)

:::


### Standardisation of values

To perform mathematical operations, characters needed to be converted
into numeric operator. This includes three steps: 

#### Removing brackets or other special characters.

As, described above in section, special character ("[]", "()", "\*")were used to denote "low quality values" and/or alternative (determination) methods. We kept record of those values in metadata files for those cases where the numeric
value was extracted and the special character removed. In addition, the variable `comments` was created in a way of keeping that information as metadata for other users.  


```{r}
#Creating a dataset w/ the values that were of low quality [] trace or normal

var_nut <- data.df %>% select(Edible_factor_in_FCT:VITCmg) %>% colnames() #selecting nutrient variable names where we want to check for quality/trace

#dataset w/ metadata info that will be removed from the dataset for use
metadata <- data.df %>% mutate_at(var_nut,  ~case_when(
  str_detect(. , '\\[.*?\\]') ~ "low_quality", #Looking for things in square brackets to mark as low quality
  str_detect(. , 'tr') ~ "trace", #Looking for things marked as "tr" and labels them as trace
  TRUE ~ "normal_value")) #Else it marks the entry as a normal value

```

The following chunk is related to formatting the FCT section, as here we are creating new varibles to separate those Tagnames and/or similar food components into their individual columns with their Tagnames. 

```{r}

#Extracting variables calculated with different (lower quality) method 
#and reported as using [] and removing them from the original variable

data.df <- data.df %>% 
#Creating calculated values from the lower quality method and removing the original values from the original variable
  mutate(FATCEg = str_extract(FATg, '(?<=\\[).*?(?=\\])'),  
   #e.g. this creates the FIBCg value from the FIBTGg value
         FIBCg =  str_extract(FIBTGg, '(?<=\\[).*?(?=\\])'),
         CARTBmcg = ifelse(is.na(CARTBmcg), str_extract(CARTBEQmcg, '(?<=\\[).*?(?=\\])'), CARTBmcg), 
         TOCPHAmg = ifelse(is.na(TOCPHAmg),str_extract(VITEmg, '(?<=\\[).*?(?=\\])'), TOCPHAmg ),
         NIAmg = ifelse(is.na(NIAmg), str_extract(NIAEQmg, '(?<=\\[).*?(?=\\])'), NIAmg), 
         FOLSUMmcg = str_extract(FOLmcg, '(?<=\\[).*?(?=\\])'), 
         PHYTCPPD_PHYTCPPImg = str_extract(PHYTCPPmg, '(?<=\\[).*?(?=\\])'))


```


#### Changing characters into numeric

For instance, values that were reported to be trace ("tr") or below the detection limit ("<LOD") were converted to zero (0). These changes are also recorded in the `comments` variable as part of the metadata available.

```{r}

#The following f(x) removes [] and changing tr w/ 0

no_brackets_tr <- function(i){
  case_when(
    str_detect(i, 'tr|[tr]') ~ "0",
    str_detect(i, '\\[.*?\\]')  ~ str_extract(i, '(?<=\\[).*?(?=\\])'),
    TRUE ~ i)
}

data.df <- data.df %>% 
  mutate_at(var_nut, no_brackets_tr) #This applies the above function


```


#### Extracting information

Some food component information, for instance alcohol content, could be reported within the food description instead of in a independent variable. Hence that information needs to be extracted and a new variable generated. 

#### Converting into numeric

```{r}

# Converting to numeric

wafct <- wafct %>% mutate_at(vars(`Edible_factor_in_FCT`:`PHYTCPPD_PHYTCPPImg`), as.numeric) #Converts certain columns (the data columns) to numeric

```


### Standardising unit of measurement <br>

To standardise and merge the different FCTs, food components need to be
reported in the same units. For example, some nutrients needed to be
convert from mg/100g to g/100g, or from percentage (100%) to a fraction (1). For all the unit conversion we followed the FAO/INFOODS Guidelines for
Converting Units, Denominators and Expression (FAO/INFOODS, 2012b), and
the suggested standard reporting units.

Eg. Converting alcohol from weight in volume (w/v) to weight in mass (w/m) (Eq.1a) or percentage of alcohol (v/v) into weight mass (Eq.1b) (See INFOODS Guidelines - page 12). 

**Eq.1a** ALC (g/100mL) (w/v) / density (g/mL) = ALC (g/100 EP)

**Eq.1b** ALC (%) (v/v) * 0.789 (g/mL) / density (g/mL) = ALC (g/100 EP) 

Eg. amino acids (AA) reported per g in 100g of PROT to mg in 100g of EP:

**Eq.2.1** AA mg/100g EP = AA mg/g prot \* prot g/100g EP /100

**Eq.2.2** AA mg/100g EP = AA g/ 100g prot \* prot g/100g EP /100 \*
1000/100

**Eq.2.3** AA mg/100g EP = AA g/100g prot \* prot g/100g EP \*10

\\Eq.3: Edible portion = Edible portion (%)/100


##### Data quality and reporting (e.g., method of analysis, good metadata)


General quality checks are: the level of detail in the food description, the methods used for nutrient values compilation, and the documentation and degree of detail.

Other quality checks that can be performed are: calculating sum of proximate and re-calculating the values of: Carbohydrates available by difference, energy, etc. These is covered in the visualisation and QC section. 

### Saving the output

We are saving the standardised FCT into the data folder, for use in the future.

```{r}

# Data Output 

write.csv(data.df, file = here::here("data", paste0(FCT_id, "_FCT_FAO_Tags.csv")), 
          row.names = FALSE) #Saves the newly-created data table to the Output folder

#Run this to clean the environment
rm(list = ls())

```

## Further readings

1. Charrondiere, U.R., Stadlmayr, B., Grande, F., Vincent, A., Oseredczuk, M., Sivakumaran, S., Puwastien, P., Judprasong, K., Haytowitz, D., Gnagnarella, P. 2023. FAO/INFOODS Evaluation framework to assess the quality of published food composition tables and databases - User guide. Rome, FAO. https://doi.org/10.4060/cc5371en