Writing good code that adheres to many of the best practices listed here makes your code more digestible to anyone else who examines your code at any point. That could include collaborators, study reviewers, or even yourself, 6 months into the future. Writing best-practice code also makes the process of finding bugs, fixing them, comparing replicated code, and making large breaking changes much easier.
Many data science projects, especially those with multiple contributors, would benefit from following a standardized workflow that is consistent across all projects. One way to do this is by using a directory structure that looks like this:
0-run-project.sh
0-config.R
1 - Data-Management/
0-prep-data.sh
1-prep-cdph-fluseas.R
2a-prep-absentee.R
2b-prep-absentee-weighted.R
3a-prep-absentee-adj.R
3b-prep-absentee-adj-weighted.R
2 - Analysis/
0-run-analysis.sh
1 - Absentee-Mean/
1-absentee-mean-primary.R
2-absentee-mean-negative-control.R
3-absentee-mean-CDC.R
4-absentee-mean-peakwk.R
5-absentee-mean-cdph2.R
6-absentee-mean-cdph3.R
2 - Absentee-Positivity-Check/
3 - Absentee-P1/
4 - Absentee-P2/
3 - Figures/
0-run-figures.sh
...
4 - Tables/
0-run-tables.sh
...
5 - Results/
1 - Absentee-Mean/
1-absentee-mean-primary.RDS
2-absentee-mean-negative-control.RDS
3-absentee-mean-CDC.RDS
4-absentee-mean-peakwk.RDS
5-absentee-mean-cdph2.RDS
6-absentee-mean-cdph3.RDS
...
.gitignore
.Rproj
For brevity, not every directory is "expanded", but we can glean some important takeaways from what we do see.
- Order Files and Directories - This makes the jumble of alphabetized filenames much more coherent and places similar code and files next to one another. This also helps us understand how data flows from start to finish and allows us to easily map a script to its output (i.e.
2 - Analysis/1 - Absentee-Mean/1-absentee-mean-primary.R
=>5 - Results/1 - Absentee-Mean/1-absentee-mean-primary.RDS
). If you take nothing else away from this guide, this is the single most helpful suggestion to make your workflow more coherent.- Note: Directories have capitalized letters and spaces but individual files do not.
- Use
.gitignore
and.Rproj
files - There is a standardized.gitignore
forR
which you can download and add to your project. This ensures you're not committing log files or things that would otherwise best be left ignored to GitHub. This is a great discussion of project-oriented workflows, extolling the virtues of a self-contained, portable projects, for your reference.- Note: An "R Project" can be created within RStudio by going to
File >> New Project
. Depending on where you are with your research, choose the most appropriate option. This will save preferences, working directories, and even the results of running code/data (though I'd recommend starting from scratch each time you open your project, in general). Then, ensure that whenever you are working on that specific research project, you open your created project to enable the full utility of.Rproj
files.
- Note: An "R Project" can be created within RStudio by going to
- Bash scripts - these are useful components of a reproducible workflow. At many of the directory levels (i.e. in
3 - Analysis
), there is a bash script that runs each of the analysis scripts. This is exceptionally useful when data "upstream" changes -- you simply run the bash script. For big data workflows, the concept of "backgrounding" a Bash script allows you to start a "job" (i.e. run the script) and leave it overnight to run. At the top level, a bash script (0-run-project.sh
) that simply calls the directory-level bash scripts (i.e.0-prep-data.sh
,0-run-analysis.sh
,0-run-figures.sh
, etc.) is a powerful tool to rerun every script in your project. See the included example bash scripts for more details.- Running Bash Scripts in Background: Running a long bash script is not trivial. Normally you would run a bash script by opening a terminal and typing something like
./run-project.sh
. But what if you leave your computer, log out of your server, or close the terminal? Normally, the bash script will exit and fail to complete. To run it in background, type./run-project.sh &; disown
. You can see the job running (and CPU utilization) with the commandtop
and check your memory withfree -h
. - Deleting Previously Computed Results: One helpful lesson we've learned is that your bash scripts should remove previous results (computed and saved by scripts run at a previous time) so that you never mix results from one run with a previous run. This can happen when an R script errors out before saving its result, and can be difficult to catch because your previously saved result exists (leading you to believe everything ran correctly).
- Ensuring Things Ran Correctly: You should check the
.Rout
files generated by the R scripts run by your bash scripts for errors once things are run. A utility file is include in this repository, calledrunFileSaveLogs
, and is used by the example bash scripts to... run files and save the generated logs. It is an awesome utility and one I definitely recommend using. For help and documentation, you can use the command./runFileSaveLogs -h
.
- Running Bash Scripts in Background: Running a long bash script is not trivial. Normally you would run a bash script by opening a terminal and typing something like
- Use a Config File - This is the single most important file for your project. It will be responsible for a variety of common tasks, declare global variables, load functions, declare paths, and more. Every other file in the project will begin with
source("0-config")
, and its role is to reduce redundancy and create an abstraction layer that allows you to make changes in one place (0-config.R
) rather than 5 different files. To this end, paths which will be reference in multiple scripts (i.e. amerged_data_path
) can be declared in0-config.R
and simply referred to by its variable name in scripts. If you ever want to change things, rename them, or even switch from a downsample to the full data, all you would then to need to do is modify the path in one place and the change will automatically update throughout your project. See the example config file for more details.
- File Headers - Every file in a project should have a header that allows it to be interpreted on its own. It should include the name of the project and a short description for what this file (among the many in your project) does specifically. You may optionally wish to include the inputs and outputs of the script as well, though the next section makes this significantly less necessary.
################################################################################
# @Organization - Example Organization
# @Project - Example Project
# @Description - This file is responsible for [...]
################################################################################
- File Structure - Just as your data "flows" through your project, data should flow naturally through a script. Very generally, you want to 1) source your config => 2) load all your data => 3) do all your analysis/computation => save your data. Each of these sections should be "chunked together" using comments. See this file for a good example of how to cleanly organize a file in a way that follows this "flow" and functionally separate pieces of code that are doing different things.
- Note: If your computer isn't able to handle this workflow due to RAM or requirements, modifying the ordering of your code to accomodate it won't be ultimately helpful and your code will be fragile, not to mention less readable and messy. You need to look into high-performance computing (HPC) resources in this case.
- Single-Line Comments - Commenting your code is an important part of reproducibility and helps document your code for the future. When things change or break, you'll be thankful for comments. There's no need to comment excessively or unnecessarily, but a comment describing what a large or complex chunk of code does is always helpful. See this file for an example of how to comment your code and notice that comments are always in the form of:
# This is a comment -- first letter is capitalized and spaced away from the pound sign
- Multi-Line Comments - Occasionally, multi-line comments are necessary. Don't add line breaks manually to a single-line comment for the purpose of making it "fit" on the screen. Instead, in RStudio > Tools > Global Options > Code > “Soft-wrap R source files” to have lines wrap around. Format your multi-line comments like the file header from above.
- Function Documentation - Functions need documentation. For any reproducible workflows, they are essential, because R is dynamically typed. This means, you can pass a
string
into an argument that is meant to be adata.table
, or alist
into an argument meant for atibble
. It is the responsibility of a function's author to document what each argument is meant to do and its basic type. This is an example for documenting a function (inspired by JavaDocs and R's Plumber API docs):
calculate_KSSS = function(centroids, statistical_input, time_column = "schoolyr", location_column = "school_dist", value_column = "absences_ill", population_column = "student_days", k_nearest_neighbors = 5, nsim = 9999, heat_map = TRUE, heat_map_title = NULL, heat_map_caption = NULL) {
# @Description: Calculates the population-based Kulldorff Spatial Scan Statistic
# @Arg: centroids: an SPDF from which the centroids of each school catchment area can be drawn, along with a uniquely identifying location_column (such as an ID or School-District combo)
# @Arg: statistical_input: a tibble of at least 4 columns containing a location, time, value, and population size, ordered by (location_column, time_column)
# @Arg: time_column: an integer or string column on which values can be clustered temporally
# @Arg: location_column: an integer or string column on which values can be clustered spatially (must be a unique key)
# @Arg: value_column: an integer or string column containing the count data for the given space-time
# @Arg: population_column: an integer column containing the size of the population for the given space-time
# @Arg: heat_map: a boolean variable dictating whether to plot a heat_map based on how likely a school catchment is to be part of a cluster
# @Arg: heat_map_title: a string used as the title for a heat_map if one is drawn
# @Arg: heat_map_caption: a string used as the caption for a heat_map if one is drawn
# @Output: plots a heatmap local clustering and prints the total number of significant clusters if heat_map = TRUE
# @Return: a list containing the results of running KSSS and a tibble of all clusters
...
Some code here
...
}
Even if you have no idea what a KSSS
is, you have some way of understanding what the function does, its various inputs, and how you might go about using the function to do what you want. Also notice that the function is defined in one line at the top (which will soft-wrap around) and all optional arguments (i.e. ones with pre-specified defaults) follow arguments that require user input.
- Note: As someone trying to call a function, it is possible to access a function's documentation (and internal code) by CMD-Left-Click
ing the function's name in RStudio
- Note: Depending on how important your function is, the complexity of your function code, and the complexity of different types of data in your project, you can also add "type-checking" to your function with the assertthat::assert_that()
function. You can, for example, assert_that(is.data.frame(statistical_input))
, which will ensure that collaborators or reviewers of your project attempting to use your function are using it in the way that it is intended by calling it with (at the minimum) the correct type of arguments. You can extend this to ensure that certain assumptions regarding the inputs are fulfilled as well (i.e. that time_column
, location_column
, value_column
, and population_column
all exist within the statistical_input
tibble).
-
Variable Names - Try to make your variable names both more expressive and more explicit. Being a bit more verbose is useful and easy in the age of autocompletion! For example, instead of naming a variable
vaxcov_1718
, try naming itvaccination_coverage_2017_18
. Similarly,flu_res
could be namedabsentee_flu_residuals
, making your code more readable and explicit.- For more help, check out Be Expressive: How to Give Your Variables Better Names
-
Snake_Case - Base R allows
.
in variable names and functions (such asread.csv()
), but this goes against best practices for variable naming in many other coding languages. For consistencies sake, across the two main data science languages,snake_case
has been adopted and modern packages and functions typically use it (i.e.readr::read_csv()
). As a very general rule of thumb, if a package you're using doesn't usesnake_case
, there may be an updated version or more modern package that does, bringing with it the variety of performance improvements and bug fixes inherent in more mature and modern software.- Note: you may also see
camelCase
throughout the R code you come across. This is okay but not ideal -- try to stay consistent across all your code withsnake_case
. - Note: again, its also worth noting there's nothing inherently wrong with using
.
in variable names, just that it goes against style best practices that are cropping up in data science, so its worth getting rid of these bad habits now.
- Note: you may also see
-
Assignment - Please use the
=
operator instead of<-
. Please! Similarly, in a function call, definitely use "named arguments" and separate arguments by to make your code more readable. Here's an example of what a function call forcalculate_KSSS
(documented above) might look like without named arguments or any separation:
input_1_KSSS_ill = calculate_KSSS(all_study_school_shapes, input_1, 5, "Local Clustering of Illness-Specific\n Absence Rates in all years during Flu Season")
And here it is again using the best practices we've outlined:
input_1_KSSS_ill = calculate_KSSS(
centroids = all_study_school_shapes,
statistical_input = input_1,
k_nearest_neighbors = 5,
heat_map_title = "Local Clustering of Illness-Specific\n Absence Rates in all years during Flu Season"
)
I'll let you be the judge of which is more coherent.
- The
here
package is one great R package that helps multiple collaborators deal with the mess that is working directories within an R project structure. Let's say we have an R project at the path/home/oski/Some-R-Project
. My collaborator might clone the repository and work with it at some other path, such as/home/bear/R-Code/Some-R-Project
. Dealing with working directories and paths explicitly can be a very large pain, and as you might imagine, setting up a Config with paths requires those paths to flexibly work for all contributors to a project. This is where thehere
package comes in and this a great vignette describing it.
-
Code Autoformatting - RStudio includes a fantastic built-in utility (keyboard shortcut:
CMD-Shift-A
) for autoformatting highlighted chunks of code to fit many of the best practices listed here. It generally makes code more readable and fixes a lot of the small things you may not feel like fixing yourself. Try it out as a "first pass" on some code of yours that doesn't follow many of these best practices! -
Assignment Aligner - A cool R package allows you to very powerfully format large chunks of assignment code to be much cleaner and much more readable. Follow the linked instructions and create a keyboard shortcut of your choosing (recommendation:
CMD-Shift-Z
). Here is an example of how assignment aligning can dramatically improve code readability:
# Before
OUSD_not_found_aliases = list(
"Brookfield Village Elementary" = str_subset(string = OUSD_school_shapes$schnam, pattern = "Brookfield"),
"Carl Munck Elementary" = str_subset(string = OUSD_school_shapes$schnam, pattern = "Munck"),
"Community United Elementary School" = str_subset(string = OUSD_school_shapes$schnam, pattern = "Community United"),
"East Oakland PRIDE Elementary" = str_subset(string = OUSD_school_shapes$schnam, pattern = "East Oakland Pride"),
"EnCompass Academy" = str_subset(string = OUSD_school_shapes$schnam, pattern = "EnCompass"),
"Global Family School" = str_subset(string = OUSD_school_shapes$schnam, pattern = "Global"),
"International Community School" = str_subset(string = OUSD_school_shapes$schnam, pattern = "International Community"),
"Madison Park Lower Campus" = "Madison Park Academy TK-5",
"Manzanita Community School" = str_subset(string = OUSD_school_shapes$schnam, pattern = "Manzanita Community"),
"Martin Luther King Jr Elementary" = str_subset(string = OUSD_school_shapes$schnam, pattern = "King"),
"PLACE @ Prescott" = "Preparatory Literary Academy of Cultural Excellence",
"RISE Community School" = str_subset(string = OUSD_school_shapes$schnam, pattern = "Rise Community")
)
# After
OUSD_not_found_aliases = list(
"Brookfield Village Elementary" = str_subset(string = OUSD_school_shapes$schnam, pattern = "Brookfield"),
"Carl Munck Elementary" = str_subset(string = OUSD_school_shapes$schnam, pattern = "Munck"),
"Community United Elementary School" = str_subset(string = OUSD_school_shapes$schnam, pattern = "Community United"),
"East Oakland PRIDE Elementary" = str_subset(string = OUSD_school_shapes$schnam, pattern = "East Oakland Pride"),
"EnCompass Academy" = str_subset(string = OUSD_school_shapes$schnam, pattern = "EnCompass"),
"Global Family School" = str_subset(string = OUSD_school_shapes$schnam, pattern = "Global"),
"International Community School" = str_subset(string = OUSD_school_shapes$schnam, pattern = "International Community"),
"Madison Park Lower Campus" = "Madison Park Academy TK-5",
"Manzanita Community School" = str_subset(string = OUSD_school_shapes$schnam, pattern = "Manzanita Community"),
"Martin Luther King Jr Elementary" = str_subset(string = OUSD_school_shapes$schnam, pattern = "King"),
"PLACE @ Prescott" = "Preparatory Literary Academy of Cultural Excellence",
"RISE Community School" = str_subset(string = OUSD_school_shapes$schnam, pattern = "Rise Community")
)
- StyleR - Another cool R package from the Tidyverse that can be powerful and used as a first pass on entire projects that need refactoring. The most useful function of the package is the
style_dir
function, which will style all files within a given directory. See the function's documentation and the vignette linked above for more details.- Note: The default Tidyverse styler is subtly different from some of the things we've advocated for in this document. Most notably we differ with regards to the assignment operator (
<-
vs=
) and number of spaces before/after "tokens" (i.e. Assignment Aligner add spaces before=
signs to align them properly). For this reason, we'd recommend the following:style_dir(path = ..., scope = "line_breaks", strict = FALSE)
. You can also customize StyleR even more if you're really hardcore. - Note: As is mentioned in the package vignette linked above, StyleR modifies things in-place, meaning it overwrites your existing code and replaces it with the updated, properly styled code. This makes it a good fit on projects with version control, but if you don't have backups or a good way to revert back to the intial code, I wouldn't recommend going this route.
- Note: The default Tidyverse styler is subtly different from some of the things we've advocated for in this document. Most notably we differ with regards to the assignment operator (
-
.RDS
vs.RData
Files - One of the most common ways to load and save data in Base R is with theload()
andsave()
functions to serialize multiple objects into a file. The biggest problems with this practice include an inability to control the names of things getting loaded in, the inherent confusion this creates in understanding older code, and the inability to load individual elements of a saved file. For this, we recommend using the RDS format to save R objects.- Note:
saveRDS
andloadRDS
are Base R functions to do this but we recommend usingsave_rds
andload_rds
from thereadr
package for the sake of consistency and performance improvements. - Note: when you use
load_rds
you must assign the thing being loaded to a variable (i.e.some_descriptive_variable_name = load_rds(...)
). Then, use the variable as you would. This makes your code less fragile and it doesn't need to rely on an.RData
file to load in a particular name for the code to run properly. - Note: if you have many related R objects you would have otherwise saved all together using the
save
function, the functional equivalent withRDS
would be to create a (named) list containing each of these objects, and saving it.
- Note:
-
CSVs - Once again, the
readr
package as part of the Tidvyerse is great, with a much fasterread_csv()
than Base R'sread.csv()
. For massive CSVs (> 5 GB), you'll finddata.table::fread()
to be the fastest CSV reader in any data science language out there. For writing CSVs,readr::write_csv()
anddata.table::fwrite()
outclass Base R'swrite.csv()
by a significant margin as well. -
Feather - If you're using both R and Python, you may wish to check out the Feather package for exchanging data between the two languages extremely quickly.
Throughout this document there have been references to the Tidyverse, but this section is to explicitly show you how to transform your Base R tendencies to Tidyverse (or Data.Table, Tidyverse's performance-optimized competitor). Tidyverse is quickly becoming the gold standard in R data analysis and modern data science packages and code should use Tidyverse style and packages unless there's a significant reason not to (i.e. big data pipelines that would benefit from Data.Table's performance optimizations).
The package author has published a great textbook on R for Data Science, which leans heavily on many Tidyverse packages and may be worth checking out.
The following list is not exhaustive, but is a compact overview to begin to translate Base R into something better:
Base R | Better Style, Performance, and Utility |
---|---|
_ | _ |
read.csv() |
readr::read_csv() or data.table::fread() |
write.csv() |
readr::write_csv() or data.table::fwrite() |
readRDS |
readr::read_rds() |
saveRDS() |
readr::write_rds() |
_ | _ |
data.frame() |
tibble::tibble() or data.table::data.table() |
rbind() |
dplyr::bind_rows() |
cbind() |
dplyr::bind_cols() |
df$some_column |
df %>% dplyr::pull(some_column) |
df$some_column = ... |
df %>% dplyr::mutate(some_column = ...) |
df[get_rows_condition,] |
df %>% dplyr::filter(get_rows_condition) |
df[,c(col1, col2)] |
df %>% dplyr::select(col1, col2) |
merge(df1, df2, by = ..., all.x = ..., all.y = ...) |
df1 %>% dplyr::left_join(df2, by = ...) or dplyr::full_join or dplyr::inner_join or dplyr::right_join |
_ | _ |
str() |
dplyr::glimpse() |
grep(pattern, x) |
stringr::str_which(string, pattern) |
gsub(pattern, replacement, x) |
stringr::str_replace(string, pattern, replacement) |
ifelse(test_expression, yes, no) |
if_else(condition, true, false) |
Nested: ifelse(test_expression1, yes1, ifelse(test_expression2, yes2, ifelse(test_expression3, yes3, no))) |
case_when(test_expression1 ~ yes1, test_expression2 ~ yes2, test_expression3 ~ yes3, TRUE ~ no) |
proc.time() |
tictoc::tic() and tictoc::toc() |
stopifnot() |
assertthat::assert_that() or assertthat::see_if() or assertthat::validate_that() |
For a more extensive set of syntactical translations to Tidyverse, you can check out this document.
Working with Tidyverse within functions can be somewhat of a pain due to non-standard evaluation (NSE) semantics. If you're an avid function writer, we'd recommend checking out the following resources:
- Tidy Eval in 5 Minutes (video)
- Tidy Evaluation (e-book)
- Data Frame Columns as Arguments to Dplyr Functions (blog)
- Standard Evaluation for *_join (stackoverflow)
- Programming with dplyr (package vignette)
It may also be the case that you're working with very large datasets. Generally I would define this as 10+ million rows. As is outlined in this document, the 3 main players in the data analysis space are Base R, Tidvyerse
(more specificially, dplyr
), and data.table
. For a majority of things, Base R is inferior to both dplyr
and data.table
, with concise but less clear syntax and less speed. Dplyr
is architected for medium and smaller data, and while its very fast for everyday usage, it trades off maximum performance for ease of use and syntax compared to data.table
. An overview of the dplyr
vs data.table
debate can be found in this stackoverflow post and all 3 answers are worth a read.
You can also achieve a performance boost by running dplyr
commands on data.table
s, which I find to be the best of both worlds, given that a data.table
is a special type of data.frame
and fairly easy to convert with the as.data.table()
function. The speedup is due to dplyr
's use of the data.table
backend and in the future this coupling should become even more natural.
- For
ggplot
calls anddplyr
pipelines, do not crowd single lines. Here are some nontrivial examples of "beautiful" pipelines, where beauty is defined by coherence:# Example 1 school_names = list( OUSD_school_names = absentee_all %>% filter(dist.n == 1) %>% pull(school) %>% unique %>% sort, WCCSD_school_names = absentee_all %>% filter(dist.n == 0) %>% pull(school) %>% unique %>% sort )
And of a complex# Example 2 absentee_all = fread(file = raw_data_path) %>% mutate(program = case_when(schoolyr %in% pre_program_schoolyrs ~ 0, schoolyr %in% program_schoolyrs ~ 1)) %>% mutate(period = case_when(schoolyr %in% pre_program_schoolyrs ~ 0, schoolyr %in% LAIV_schoolyrs ~ 1, schoolyr %in% IIV_schoolyrs ~ 2)) %>% filter(schoolyr != "2017-18")
ggplot
call:Imagine (or perhaps mournfully recall) the mess that can occur when you don't strictly style a complicated# Example 3 ggplot(data=data, mapping=aes_string(x="year", y="rd", group=group)) + geom_point(mapping=aes_string(col=group, shape=group), position=position_dodge(width=0.2), size=2.5) + geom_errorbar(mapping=aes_string(ymin="lb", ymax="ub", col=group), position=position_dodge(width=0.2), width=0.2) + geom_point(position=position_dodge(width=0.2), size=2.5) + geom_errorbar(mapping=aes(ymin=lb, ymax=ub), position=position_dodge(width=0.2), width=0.1) + scale_y_continuous(limits=limits, breaks=breaks, labels=breaks) + scale_color_manual(std_legend_title,values=cols,labels=legend_label) + scale_shape_manual(std_legend_title,values=shapes, labels=legend_label) + geom_hline(yintercept=0, linetype="dashed") + xlab("Program year") + ylab(yaxis_lab) + theme_complete_bw() + theme(strip.text.x = element_text(size = 14), axis.text.x = element_text(size = 12)) + ggtitle(title)
ggplot
call. Trying to fix bugs and ensure your code is working can be a nightmare. Now imagine trying to do it with the same code 6 months after you've written it. Invest the time now and reap the rewards as the code practically explains itself, line by line.