This repository is very specific. It contains a nested dataset extracted from the instrument as .aia files. The batch files contained 301 .aia folders each containing five signals (four uv-vis and a fluorescence signal). These data were generated using scanning gradient in high performance liquid chromatography (HPLC).
Step 1: Create a data frame with all the .aia file folder names and paths
Step 2: Fetch the signal data from each file
Step 3: Fetch the data from each signal file
Step 4: fetch the spectra for each signal into separate data frames
I started by unnesting the files into workable lists. This can always be improved and accomading to the desired analysis pipeline. This means a total of 1505 spectra processed.
flowchart TD
A[Step 1 <br/> <br/> A list of 301 .aia files. Each file belongs to a wine sample.] --> |unpack each file| B(Step2 <br/> <br/> signal_1:uv-vis 280nm <br/> signal_2:uv-vis 320nm <br/> signal_3:uv-vis 360nm <br/> signal_4:uv-vis 420nm <br/> signal_5:fluorescence)
B--> C(Step 3 <br/> <br/> Read the data using the ncdf4 library)
C--> |rextract the spectra| D(Step 4 <br/> <br/> spectra = peak retention time vs peak area.)
library("tidyverse") # to wrangle data frames
hplc_wines <-
filename = list.files(
"C:/Users/mafata/Desktop/WORK/Collaborative Work/HPLC scanning/CDF files"
hplc_wines <- hplc_wines %>%
filepath = paste0(
"C:/Users/mafata/Desktop/WORK/Collaborative Work/HPLC scanning/CDF files/",
for (i in 1:length(hplc_wines$filepath)){
hplc_wines <- hplc_wines %>%
samples = str_sub (hplc_wines$filename, end = -8),
repeats = str_sub (hplc_wines$filename, end = -5),
uv_280 = paste0(filepath,"/SIGNAL01.cdf"),
uv_320 = paste0(filepath,"/SIGNAL02.cdf"),
uv_360 = paste0(filepath,"/SIGNAL03.cdf"),
uv_420 = paste0(filepath,"/SIGNAL04.cdf"),
fluo = paste0(filepath,"/SIGNAL05.cdf") # FLD , Ex=280, Em=320
datasets = c("uv_280","uv_320","uv_360","uv_420","fluo")
for (dataset in datasets){
dataset_list = list()
for (i in 1:length(hplc_wines[[dataset]])){
dataset_file <-
write = FALSE,
readunlim = FALSE,
verbose = FALSE,
auto_GMT = TRUE,
suppress_dimvals = FALSE,
return_on_error = FALSE
dataset_list = append(dataset_list, list(dataset_file))
names(dataset_list) = hplc_wines$filename
if (dataset == "uv_280"){uv280_list = dataset_list}
else if (dataset == "uv_320"){uv320_list = dataset_list}
else if (dataset == "uv_360"){uv360_list = dataset_list}
else if (dataset == "uv_420"){uv420_list = dataset_list}
else {fluo_list = dataset_list}
uv_280_spectra = list()
for (i in 1:length(uv280_list)){
# Add verbosity to the script
print(glue(". . . generating sample number {i} uv 280 nm spectra"))
peak_retention_time <
colnames(peak_retention_time) = "peak_retention_time"
peak_retention_time <-format(round(peak_retention_time, 0), nsmall = 0)
peak_area <
colnames(peak_area) = hplc_wines$repeats[i]
uv_280_spectra <- append(uv_280_spectra,
list(name = c(peak_retention_time, peak_area)),
after = length(uv_280_spectra))
names(uv_280_spectra) = hplc_wines$filename
Then next hurdle is plotting the chromatograms to see if everything is okay inside. It is always good practice to have a look at the spectra to see if there are any funny things going on.
From the previous section we can see that something is up. These are not the expected spectral asthetics. In generating this data set, I extracted it as AIA format. Previously, before I started using scripting language, I generated it as a spectrum of values manually and pain-stakingly copied and pasted into an excel sheet. The values were more frequent than this. The AIA extraction may have been due to my specifications when extracting the data in different formats so I need to find out more about how Agilent instrument data extractions are structured and formated since I suspect it may require peak integration.
The data was pasted as a tuple in seperate columns so we will extract every couple (RT, Abs) and merge them as a naive way to aligning by the retention time.
hplc_wines <-
readxl::read_excel(path = "/Users/~/HPLC scouting.xlsx",
sheet = "280 nm SB")
uv_280 <- list()
for (i in 1:(ncol(hplc_wines))) {
first <- as.numeric((2 * i) - 1)
second <- as.numeric(2 * i)
sample_i <- hplc_wines[c(first, second)]
colnames(sample_i)[1] <- c("rt")
sample_i <- list(sample_i)
uv_280 <- append(uv_280, sample_i)
name_list <-
row.names( %>% filter(row_number() %% 2 == 0))
names(uv_280) <- name_list
# create a merged data frame
merged_uv280_spectra <- full_join(x =[[1]]),
y =[[2]]),
by = "rt")
for (i in 3:length(uv_280)) {
merged_uv280_spectra <-
x =,
y =[[i]]),
by = "rt"
merged_uv280_spectra <- merged_uv280_spectra %>%
names_to = "samples",
values_to = "peak_area",
values_drop_na = TRUE
If we compare the previous spectra and the above one we see the differences in the spectral frequecies.
Having inspected the new spectra and feeling confident enough about the spectral properties, we can analyse and compare between sample groups using MFA.
mfa_plot <- MFA(
group = c(26, 25, 25, 25, 25, 25),
type = c(rep("s", 6)),
ncp = 4, = c("AVN", "CDB", "DTK", "FRV", "KZN", "PDB"),
graph = TRUE
Although we have managed to obtained a more iconic specrum (baselines are much more resolved, peak symmetry that's more gaussian) we find that there seems to be some misallignments in the peaks and the baseline is slightly ofset among sprectra.
Here we will see what we can do with that.
