Our R Code Style Guide
Our baseline is the tidyverse style guide. This chapter covers where we differ from it or adds emphasis to parts of it. For anything not mentioned here, defer to tidyverse conventions. If you’re new to R or to our lab, read the tidyverse style guide first, then come back here for the lab-specific deviations.
File Organization
File Naming
We name R scripts with zero-padded numeric prefixes and a slug. From a glance, I should know (1) the order to run the code files in and (2) what the code file does without opening it. I prefer underscores, but hyphens are also fine.
01_get_data.R
02_clean_data.R
03_analyze_models.R
10_fig_main_results.R
11_fig_sensitivity.R
99_utils.Rget-data.R
getData.R
step1.R
fig1.R
clean data.RZero-padding matters. 01_ through 09_ keeps your file explorer sorted correctly when you have more than 9 files.
File Headers
For code files that are not self-evident, use a document header like this at the top:
## 03_analyze_models.R ----
##
## Fit Bayesian hierarchical models for state-level mortality trends.
## Outputs: ./data/model_fits.RDS
## Imports ----
library(brms)That is, repeat the file name up top, give a brief description of what the file does, and what the expected output is.
We use the ## Section Name ---- convention (i.e., the double hash and four trailing dashes) because it creates a navigable outline in RStudio,1 so you can quickly navigate through parts of your code.
For files that are obvious or self-documenting (e.g., a file that produces the figure 1 plot), you can skip the header.
Standard File Sections
Not every file needs every section, but when present, use this order:
| Order | Section | What goes here |
|---|---|---|
| 1 | ## Imports |
library() calls |
| 2 | ## Constants |
UPPERCASE configuration values |
| 3 | ## Helper functions |
Small utility functions used only in this file |
| 4 | ## Infrastructure |
Paths, connections, setup code |
| 5 | ## Data |
Call in the data you need (usually produced from a previous file) |
| 6 | ## Processing |
The main work of the file |
| 7 | ## Save/Export |
saveRDS(), write_csv(), ggsave() calls |
Formatting
Indentation: 4 Spaces
This is our biggest deviation from the tidyverse style guide, which recommends 2 spaces. We use 4. Always use 4.
In RStudio: Tools → Global Options → Code → Editing → set “Tab width” to 4. Also check “Insert spaces for tab” and set “Number of spaces for tab” to 4. Under Display, set “Margin column” to 80.
If you’re using VS Code or Positron, set editor.tabSize to 4 for R files.
Four spaces makes nested code more readable, especially in long {dplyr} pipelines and {ggplot2} chains. Yes, it uses more horizontal space. That’s a tradeoff we accept.2
mortality_df <- raw_df |>
dplyr::filter(year >= 2010) |>
dplyr::mutate(
rate = deaths / population * 100000,
log_rate = log(rate)
) |>
dplyr::group_by(state, year) |>
dplyr::summarize(
mean_rate = mean(rate, na.rm = TRUE),
.groups = "drop"
)mortality_df <- raw_df |>
dplyr::filter(year >= 2010) |>
dplyr::mutate(
rate = deaths / population * 100000,
log_rate = log(rate)
) |>
dplyr::group_by(state, year) |>
dplyr::summarize(
mean_rate = mean(rate, na.rm = TRUE),
.groups = "drop"
)Braces: 1TBS
In this lab, we use the one true brace style. Always.
Opening brace goes on the same line. else and else if go on the same line as the closing brace. This matches tidyverse.
if (n_cores > 1) {
furrr::future_map(x, process_state)
} else {
purrr::map(x, process_state)
}Line Length
Keep lines under 80 characters. Break long pipelines after the pipe operator, and break long function calls after a comma.
## Long pipeline — break after each pipe
result_df <- input_df |>
dplyr::filter(age_group != "Unknown") |>
dplyr::left_join(population_df, by = c("state", "year")) |>
dplyr::mutate(rate_per_100k = deaths / population * 100000)
## Long function call — break after commas
model_fit <- brms::brm(
formula = deaths ~ age_group + (1 | state),
data = model_df,
family = brms::poisson(),
cores = N_CORES,
seed = R_SEED
)Spacing
Spaces around <-, ==, +, -, *, /. No space before ( in function calls, but a space before ( in control flow. This matches tidyverse — just a reminder.
## Good
x <- mean(y, na.rm = TRUE)
if (x > 0) {
log(x)
}
## Bad
x<-mean (y,na.rm=TRUE)
if(x > 0){
log( x )
}TRUE / FALSE
Use TRUE and FALSE rather than T and F. Again, I’m old and this helps me quickly scan your code.
Naming
Variables and Functions
snake_case for both, matching tidyverse. For functions, use a verb-first naming convention that describes the action. Here are examples of common function prefixes I use:
| Prefix | When to use | Example |
|---|---|---|
calculate_ |
Returns a computed value | calculate_age_adjusted_rate() |
return_ |
Retrieves or constructs a specific object | return_state_fips() |
flag_ |
Returns a logical vector or indicator | flag_outliers() |
recode_ |
Transforms categories or values | recode_race_ethnicity() |
categorize_ |
Converts a string column into a factor | categorize_race_ethnicity() |
get_ |
Fetches data from an external source | get_census_data() |
plot_ |
Creates a ggplot object | plot_trend_lines() |
Constants
UPPERCASE_SNAKE_CASE for values set once and used throughout a file. These typically go in the ## Constants ---- section near the top.
## Constants ----
N_CORES <- 8
R_SEED <- 8675309
DATA_PATH <- here::here("data")
YEAR_START <- 2003
YEAR_END <- 2019
MIN_CELL_SIZE <- 10Variable Suffixes
We recommend typed suffixes on variable names so you know what you’re looking at when reading code. This isn’t a strict rule, but it prevents a lot of confusion — especially when a project has dozens of data frames floating around.
| Suffix | Type | Example |
|---|---|---|
_df |
Data frame / tibble | mortality_df, raw_df |
_cat |
Categorical / factor (usually for column names inside a tibble) | race_cat, age_cat |
_str |
Character string (usually for column names inside a tibble) | query_str, title_str |
_x |
Loop iterator | state_x, year_x |
_num |
Numeric scalar or vector (usually for column names inside a tibble) | n_obs_num, threshold_num |
_list |
List object | model_list, results_list |
_vec |
Atomic vector | fips_vec, years_vec |
_p |
ggplot object | main_p, trend_p |
for (state_x in unique(mortality_df$state)) {
subset_df <- mortality_df |>
dplyr::filter(state == state_x)
model_x <- fit_model(subset_df)
results_list[[state_x]] <- model_x
}for (s in unique(data$state)) {
temp <- data |>
dplyr::filter(state == s)
m <- fit_model(temp)
results[[s]] <- m
}The suffixed version is longer, but three months from now you’ll know exactly what mortality_df is. You won’t know what data or temp are.3
Syntax
Assignment
Always <-. Never = for assignment. Never ->.
## Good
x <- 10
## Bad
x = 10
10 -> xPipes
Both |> (base R pipe) and %>% ({magrittr} pipe) are acceptable. We prefer |> for new code — it has no dependencies and is slightly faster. Existing code using %>% doesn’t need to be converted.
Break after the pipe operator. Indent continuation lines 4 spaces.
result_df <- raw_df |>
dplyr::filter(!is.na(outcome)) |>
dplyr::mutate(
rate = count / population * 100000
) |>
dplyr::arrange(year, state)Strings
Double quotes is the default for strings. Use single quotes when you need double quotes as part of the string. sprintf() is the default for string formatting; glue::glue() is also fine.
message(sprintf("Processing %s: %d records", state_x, nrow(subset_df)))
## Or with glue
message(glue::glue("Processing {state_x}: {nrow(subset_df)} records"))message(paste0('Processing ', state_x, ': ', nrow(subset_df), ' records'))return()
The tidyverse style guide says to use implicit returns only. We’re more flexible — both implicit and explicit return() are acceptable. The important thing is consistency within a file.
Use explicit return() for early returns, where the function bails out of the normal flow. Use implicit returns for the final value at the end of a function, if you prefer that style.
calculate_rate <- function(deaths, population) {
## Early return — explicit
if (population == 0) {
return(NA_real_)
}
## Final value — implicit is fine
deaths / population * 100000
}Function Arguments
Two alignment styles are acceptable. Pick one per file and stick with it.
## Style 1: Align with opening parenthesis
model_fit <- brms::brm(formula = deaths ~ year + (1 | state),
data = model_df,
family = brms::poisson(),
cores = N_CORES)
## Style 2: 4-space continuation indent
model_fit <- brms::brm(
formula = deaths ~ year + (1 | state),
data = model_df,
family = brms::poisson(),
cores = N_CORES
)I slightly prefer Style 2 because it’s more readable when function names are long and it keeps diffs cleaner. That said, Style 1 is fine.
Namespace Prefixing
Explicit namespace prefixing is the lab standard. Always use package::function() syntax — even after calling library(). Note, however, that I usually do this after I’m done with the coding. You can use the {prefixer} package to insert these for you.
library(dplyr)
library(tidyr)
result_df <- raw_df |>
dplyr::filter(year >= 2010) |>
dplyr::mutate(rate = deaths / pop * 100000) |>
tidyr::pivot_longer(
cols = dplyr::starts_with("age_"),
names_to = "age_group",
values_to = "count"
)library(dplyr)
library(tidyr)
result_df <- raw_df |>
filter(year >= 2010) |>
mutate(rate = deaths / pop * 100000) |>
pivot_longer(
cols = starts_with("age_"),
names_to = "age_group",
values_to = "count"
)Three reasons. First, reproducibility — when someone reads your code, they know exactly which package every function comes from without scanning library() calls at the top. Second, it prevents masking conflicts. dplyr::filter() and stats::filter() do very different things, and silent masking has caused real bugs. Third, it makes dependencies explicit when you read code mid-file, which is most of the time.
Lastly, it is low-cost. Again, I do this after I am done coding a file using the {prefixer} package.
Tidyverse Patterns
Data Manipulation
In general, we use {dplyr} for data manipulation and tibble::tibble() over data.frame(). Base R subsetting is fine for quick one-off operations, but pipelines should use {dplyr}. This isn’t because I feel strongly that {tidyverse} is best, but because if we all use the same set of tools, it makes code review a lot more reliable (and faster). That said, we’re pragmatic about this and you should use the most appropriate tool for the job.
Iteration
purrr::map_*() over lapply() and sapply(). The type-stable variants (purrr::map_dbl(), purrr::map_chr(), purrr::map_dfr()) prevent silent type coercion.
## Good
state_results_df <- purrr::map_dfr(
state_vec,
~ fit_model(.x, data = mortality_df)
)Conditionals
dplyr::case_when() for multi-branch conditions. dplyr::if_else() over base ifelse() — it’s type-stable and faster.
mortality_df <- mortality_df |>
dplyr::mutate(
age_cat = dplyr::case_when(
age < 18 ~ "Under 18",
age < 65 ~ "18-64",
age >= 65 ~ "65+",
.default = "Unknown"
)
)mortality_df$age_cat <- ifelse(
mortality_df$age < 18, "Under 18",
ifelse(mortality_df$age < 65, "18-64",
ifelse(mortality_df$age >= 65, "65+", "Unknown"))
)File Paths
here::here() always. Never hardcode absolute paths. Never use relative paths like ../../data/.
## Good
raw_df <- readr::read_csv(here::here("data", "raw", "mortality_2020.csv"))
## Bad
raw_df <- readr::read_csv("/Users/mkiang/projects/mortality/data/raw/mortality_2020.csv")
raw_df <- readr::read_csv("../../data/raw/mortality_2020.csv")Filesystem Operations
fs::dir_create() and fs::file_exists() over their base R equivalents. The {fs} package is more consistent and cross-platform.
## Good
fs::dir_create(here::here("output", "figures"))
## Also fine but less consistent
dir.create(here::here("output", "figures"), recursive = TRUE, showWarnings = FALSE)That said, we often encounter issues with {fs} that the base equivalents don’t have. In these edge cases, use the base equivalents. (For example, fs::dir_ls() really hates reading in folders with lots [i.e., hundreds of thousands] of files.)
ggplot2
Layer Formatting
Each {ggplot2} layer gets its own line. The + goes at the end of the preceding line, not at the start of the next. Indent each layer 4 spaces from the ggplot() call.
main_p <- ggplot2::ggplot(
mortality_df,
ggplot2::aes(x = year, y = rate, color = state)
) +
ggplot2::geom_line(linewidth = 0.8) +
ggplot2::geom_point(size = 1.5) +
ggplot2::scale_x_continuous(breaks = seq(2000, 2020, 5)) +
ggplot2::scale_color_manual(values = STATE_COLORS) +
ggplot2::labs(
x = "Year",
y = "Mortality rate (per 100,000)",
color = "State"
) +
ggplot2::theme_minimal(base_size = 14)main_p <- ggplot(mortality_df, aes(x = year, y = rate, color = state)) + geom_line(linewidth = 0.8) + geom_point(size = 1.5) + scale_x_continuous(breaks = seq(2000, 2020, 5)) + labs(x = "Year", y = "Mortality rate (per 100,000)")Namespace Prefixing in ggplot2
The lab standard applies to {ggplot2} too. Use ggplot2::ggplot(), ggplot2::aes(), ggplot2::geom_*(), and so on. Again, we do this after we’ve already done the coding, and we use {prefixer} to do it so it should take you very little time.
Aesthetics
When aes() has more than 2 mappings, put each argument on its own line.
## Compact — 2 or fewer mappings
ggplot2::aes(x = year, y = rate)
## Expanded — 3 or more mappings
ggplot2::aes(
x = year,
y = rate,
color = race_cat,
linetype = sex_cat
)Saving Figures
Always specify width, height, and dpi (for raster) or device (for vector). Save both a PDF for the journal and a high-DPI JPG for presentations and preprints.
## PDF for journal submission
ggplot2::ggsave(
here::here("output", "figures", "fig_01_main_results.pdf"),
plot = main_p,
device = grDevices::cairo_pdf,
width = 8,
height = 6
)
## High-DPI jpg for presentations
ggplot2::ggsave(
here::here("output", "figures", "fig_01_main_results.jpg"),
plot = main_p,
width = 8,
height = 6,
dpi = 300
)Piping into ggplot
I have a strong preference for not piping data (especially after performing some manipulation on the data frame) into {ggplot2}. It’s better to save it as a separate object and call that directly. For publications, all figures should also have a numerical (i.e., csv) file saved as well, so having a separate object is helpful.
mortality_df |>
dplyr::filter(year >= 2010) |>
dplyr::group_by(state, year) |>
dplyr::summarize(mean_rate = mean(rate), .groups = "drop") |>
ggplot2::ggplot(ggplot2::aes(x = year, y = mean_rate, color = state)) +
ggplot2::geom_line(linewidth = 0.8) +
ggplot2::labs(
x = "Year",
y = "Mean mortality rate (per 100,000)"
) +
ggplot2::theme_minimal()Faceting
Put facet_wrap() and facet_grid() arguments on separate lines when they’re complex. Use labeller for human-readable facet labels.
main_p +
ggplot2::facet_wrap(
~ age_cat,
ncol = 3,
scales = "free_y",
labeller = ggplot2::labeller(
age_cat = c(
"under_18" = "Under 18",
"18_64" = "18–64",
"65_plus" = "65+"
)
)
)Documentation
Roxygen2
For important functions, add {roxygen2}-style documentation with #'.
You don’t need to document every internal helper, but crucial functions that will need to be checked by somebody else should have {roxygen2}-style documentation.
#' Calculate age-adjusted mortality rate
#'
#' @param deaths_df Data frame with columns `age_group`, `deaths`, `population`
#' @param standard_df Data frame with columns `age_group`, `weight`
#' @returns A single numeric value: the age-adjusted rate per 100,000
calculate_age_adjusted_rate <- function(deaths_df, standard_df) {
merged_df <- dplyr::left_join(deaths_df, standard_df, by = "age_group")
sum(merged_df$deaths / merged_df$population * merged_df$weight) * 100000
}Resources
- Tidyverse style guide — Our baseline. Read this first.
- Wilson et al. “Good enough practices in scientific computing” — Broader principles that inform how we organize projects.
{lintr}— Static analysis for R code. Catches common style violations automatically.{styler}— Automatic code formatting. Useful for bulk reformatting, but double-check its output against our conventions (it defaults to 2-space indent).{prefixer}- Helper for adding package prefixes to functions.
In RStudio, go to the document outline pane (top right of the editor, or Ctrl/Cmd+Shift+O) to see all your sections. This is why the
----convention matters — RStudio recognizes it.↩︎I know people have strong feelings about indentation. Two spaces is fine in other contexts. In this lab, it’s four. I’m old and it helps me with reading your code. This isn’t a hill worth dying on — just set your editor and forget about it.↩︎
Your most frequent collaborator is future-you, and future-you has no idea what past-you was thinking.↩︎
Comments
Comments explain why, not what. If your code needs a comment to explain what it does, the code might be too clever.
Use
##for section-level comments and#for inline comments. Inline comments go on their own line above the code they describe, not at the end of a line.