Organizing code: best practices

Table of Contents

Objectives

Well-organized code is important for the following tasks:

  1. Distributing analysis code for reproducibility (part of the open science movement) - e.g., submit with manuscript.
  2. Write packages and vignettes.

Organizing code for packaging typically follows strict recommendations for each programming language, so I will focus on the former (primarily for R and python).

The points below are largely suggestions. The most important point is to document your conventions in a good README file.

Considerations

Provenance/traceability/reproducibility

  • should connect raw data → final results
  • explicitly documented vs. “self-documenting”
  • self-contained/portable

Readability

  • compact (understand immediately the overall structure of code) vs. explicit (easier to follow line-by-line)

Modularity (facilitates reuse; lack of redundancy)

  • encapsulation
  • parsimony (simplicity/lack of redundancy)

Efficient memory use (need to store all data in memory)?

  • in-memory matrix vs. relational database

Efficient HD use

  • not critical but don’t make multiple copies of data (see database normalization)
  • save only after substantial processing (otherwise allow functions to construct data tables that you need from the original files)
  • handle data manipulation with functions if not too computationally expensive

Speed (may or may not be relevant)

Stylistic preferences

The following README describes decisions made for a series of packages, but there are useful points to note regarding style convention and preferred file formats:

https://gitlab.com/aprl/APRLspec/blob/master/README.md

In practice

While you will find various "project templates" on the web, you will find common elements among them - particularly among projects within our group.

  1. data (raw, processed)
  2. reusable code (modules, functions; can later become package)
  3. reusable scripts to generate a specific type of output
  4. "one-off" code (interactive/exploratory scripts, notebooks)
  5. results

Some decisions can be argued based on actual merit (which leads to interpretability or reduction in errors) while others are stylistic preferences.

  • Use of subdirectories or file naming conventions (e.g., prefix with underscore) to group similar files?
    • depends on scale
    • affects how you look for things and how you might move files around, or move around among files (see next point)
  • Assumed working directory of scripts if they are not located in top directory of project? One good principle is not to change working directories in the script unless absolutely necessary - better to reference all files with respect to a particular directory.
    • directory of the script - default for many editors
    • top directory of the project ("project root") - immune to relocation of scripts to sub-sub-directories; may make sense for scripts that are meant to be accessed via command line (in which the working directory would be the directory from which the script is called)
  • Should results of analysis go into a subdirectory of the data folder, or into a separate output folder?
    • It's possible that the output of one analysis is the input of another, so might as well place back in "data/processed/" in case (see next section) it would be used later. However, figures should likely go into a separate directory for outputs.
  • Cleaning raw files.
    • Raw files pulled from databases might have a standard script for cleaning and preparing the data.
    • Many files generated by experimentalists (i.e., containing manual entries) will contain particular inconsistencies not served by a general script but one that is specific to each file received. In this case, I create a specific script that I place in the same directory as the raw file.

Example directory structures:

Alternative 1 (see /Volumes/GoogleDrive/Team Drives/Carbonyl/):

.
+-- data/
|   +-- raw/
|   +-- processed/
+-- library/
+-- scripts/
|   +-- build_dostuff1.R
|   +-- build_dostuff2.R
|   +-- analyze_dostuff1.R
|   +-- analyze_dostuff2.R
+-- outputs/

Alternative 2 (see https://gitlab.com/aprl/aprl-carbontypes) - this is more unconventional but is simple in that the working directory for each script is both the project root and script location:

.
+-- data/
|   +-- raw/
|   +-- processed/
+-- library/
+-- build_dostuff1.R
+-- build_dostuff2.R
+-- analyze_dostuff1.R
+-- analyze_dostuff2.R
+-- outputs/

You might choose to call "library/" something else (presumably can be come a package later if similar analyses are to be repeated) and "scripts/" as "notebooks/" depending on your workflow.

Note that "outputs/" can also be a container for subdirectories, e.g., "run_001/", "run_002/", etc.; separate folders can be designated for figures and summary tables. Here these "runddd" subdirectories would typically contain input files (typically JSON format) and output files in the same directory; generated with non-interactive scripts that take the name of the JSON input file as the command-line argument.

Using a master file to associate labels with files (example) may be a useful convention, particularly if some files are located outside the project directory (e.g., a spectral database shared by many projects). Though the linked example is an R file, this can easily be (and better) handled as a JSON file. If using subdirectory structure (Alternative 1) it would be sensible to place this in the "data/" directory rather than in "scripts/".

Working directory

Assuming a working directory of a script can be a non-trivial task. The working directly can depend partly on the IDE. For Jupyter notebooks, the working directory is by default the notebook directory. Some discussion on this for R here.

To access files in other directories, It is important to refer to files from the working directory rather than changing the working directory with os.chdir, setwd, or cd. For portability, use shell/OS utilities - e.g., os.path.join (python), file.path (R), or fullfile (MATLAB) - rather than string operations in each language to construct file paths. E.g., file.path("a", "b", "c") rather than paste(sep="/", "a", "b", "c"), or a/b/c. Technically, the forward slash is accepted a path separator on Mac, Linux, and Windows so as a practical matter the last example will usually work.

The project root should be used as a reference from which other files should are accessed. When executing a script (e.g., Rscript script/script1.R or python script/script1.py, the root directory might already be the working directory. When working with a script interactively, it is likely that the script directory is the working directory (unless using .Rproj of RStudio).

For instance, with R you can use the following convention:

## set working directory ("." if called from Rscript at project root; ".." if invoked interactively from console)
projroot <- if(sys.nframe() == 0L) "." else ".."

## input example
data <- read.csv(file.path(projroot, "data", "processed", "inpfile.csv"))

## output example
write.csv(file.path(projroot, "data", "processed", "outfile.csv"))

Some useful tools:

  • check if file is being run from a script.
    • R: if(sys.nframe() == 0L)
    • python: if __name__ == "__main__" (more commonly used to execute code only when called as a script and not imported as a module)
  • find path of current script
    • R: sys.frame(1)$ofile
    • python: __file__
  • libraries to navigate paths:

Functions and variables

Functions should generally accept a (small) number of arguments and return a value.

  • If returning a sequence, the output can be directly assigned to multiple outputs using tuple unpacking (python), deal (MATLAB), or a custom function in R (example).
  • Input/output (I/O) and plotting operations should be contained within their own functions, and separated from related computations - e.g., don't combine computation with export in the same function. A "wrapper" function can be written to conveniently wrap several related calculations with plotting etc., but should be composed of modular units that are reusable and easily understood (more or less).

Variable names should be informative and not too many modifications of their values should be made sparingly.