Our R Code Style Guide

Our baseline is the tidyverse style guide. This chapter covers where we differ from it or adds emphasis to parts of it. For anything not mentioned here, defer to tidyverse conventions. If you’re new to R or to our lab, read the tidyverse style guide first, then come back here for the lab-specific deviations.

File Organization

File Naming

We name R scripts with zero-padded numeric prefixes and a slug. From a glance, I should know (1) the order to run the code files in and (2) what the code file does without opening it. I prefer underscores, but hyphens are also fine.

01_get_data.R
02_clean_data.R
03_analyze_models.R
10_fig_main_results.R
11_fig_sensitivity.R
99_utils.R
get-data.R
getData.R
step1.R
fig1.R
clean data.R

Zero-padding matters. 01_ through 09_ keeps your file explorer sorted correctly when you have more than 9 files.

File Headers

For code files that are not self-evident, use a document header like this at the top:

## 03_analyze_models.R ----
##
## Fit Bayesian hierarchical models for state-level mortality trends.
## Outputs: ./data/model_fits.RDS

## Imports ----
library(brms)

That is, repeat the file name up top, give a brief description of what the file does, and what the expected output is.

We use the ## Section Name ---- convention (i.e., the double hash and four trailing dashes) because it creates a navigable outline in RStudio,1 so you can quickly navigate through parts of your code.

For files that are obvious or self-documenting (e.g., a file that produces the figure 1 plot), you can skip the header.

Standard File Sections

Not every file needs every section, but when present, use this order:

Order Section What goes here
1 ## Imports library() calls
2 ## Constants UPPERCASE configuration values
3 ## Helper functions Small utility functions used only in this file
4 ## Infrastructure Paths, connections, setup code
5 ## Data Call in the data you need (usually produced from a previous file)
6 ## Processing The main work of the file
7 ## Save/Export saveRDS(), write_csv(), ggsave() calls

Formatting

Indentation: 4 Spaces

This is our biggest deviation from the tidyverse style guide, which recommends 2 spaces. We use 4. Always use 4.

ImportantConfigure your editor

In RStudio: Tools → Global Options → Code → Editing → set “Tab width” to 4. Also check “Insert spaces for tab” and set “Number of spaces for tab” to 4. Under Display, set “Margin column” to 80.

If you’re using VS Code or Positron, set editor.tabSize to 4 for R files.

Four spaces makes nested code more readable, especially in long {dplyr} pipelines and {ggplot2} chains. Yes, it uses more horizontal space. That’s a tradeoff we accept.2

mortality_df <- raw_df |>
    dplyr::filter(year >= 2010) |>
    dplyr::mutate(
        rate = deaths / population * 100000,
        log_rate = log(rate)
    ) |>
    dplyr::group_by(state, year) |>
    dplyr::summarize(
        mean_rate = mean(rate, na.rm = TRUE),
        .groups = "drop"
    )
mortality_df <- raw_df |>
  dplyr::filter(year >= 2010) |>
  dplyr::mutate(
    rate = deaths / population * 100000,
    log_rate = log(rate)
  ) |>
  dplyr::group_by(state, year) |>
  dplyr::summarize(
    mean_rate = mean(rate, na.rm = TRUE),
    .groups = "drop"
  )

Braces: 1TBS

In this lab, we use the one true brace style. Always.

Opening brace goes on the same line. else and else if go on the same line as the closing brace. This matches tidyverse.

if (n_cores > 1) {
    furrr::future_map(x, process_state)
} else {
    purrr::map(x, process_state)
}

Line Length

Keep lines under 80 characters. Break long pipelines after the pipe operator, and break long function calls after a comma.

## Long pipeline — break after each pipe
result_df <- input_df |>
    dplyr::filter(age_group != "Unknown") |>
    dplyr::left_join(population_df, by = c("state", "year")) |>
    dplyr::mutate(rate_per_100k = deaths / population * 100000)

## Long function call — break after commas
model_fit <- brms::brm(
    formula = deaths ~ age_group + (1 | state),
    data = model_df,
    family = brms::poisson(),
    cores = N_CORES,
    seed = R_SEED
)

Spacing

Spaces around <-, ==, +, -, *, /. No space before ( in function calls, but a space before ( in control flow. This matches tidyverse — just a reminder.

## Good
x <- mean(y, na.rm = TRUE)
if (x > 0) {
    log(x)
}

## Bad
x<-mean (y,na.rm=TRUE)
if(x > 0){
    log( x )
}

TRUE / FALSE

Use TRUE and FALSE rather than T and F. Again, I’m old and this helps me quickly scan your code.

Naming

Variables and Functions

snake_case for both, matching tidyverse. For functions, use a verb-first naming convention that describes the action. Here are examples of common function prefixes I use:

Prefix When to use Example
calculate_ Returns a computed value calculate_age_adjusted_rate()
return_ Retrieves or constructs a specific object return_state_fips()
flag_ Returns a logical vector or indicator flag_outliers()
recode_ Transforms categories or values recode_race_ethnicity()
categorize_ Converts a string column into a factor categorize_race_ethnicity()
get_ Fetches data from an external source get_census_data()
plot_ Creates a ggplot object plot_trend_lines()

Constants

UPPERCASE_SNAKE_CASE for values set once and used throughout a file. These typically go in the ## Constants ---- section near the top.

## Constants ----
N_CORES <- 8
R_SEED <- 8675309
DATA_PATH <- here::here("data")
YEAR_START <- 2003
YEAR_END <- 2019
MIN_CELL_SIZE <- 10

Variable Suffixes

We recommend typed suffixes on variable names so you know what you’re looking at when reading code. This isn’t a strict rule, but it prevents a lot of confusion — especially when a project has dozens of data frames floating around.

Suffix Type Example
_df Data frame / tibble mortality_df, raw_df
_cat Categorical / factor (usually for column names inside a tibble) race_cat, age_cat
_str Character string (usually for column names inside a tibble) query_str, title_str
_x Loop iterator state_x, year_x
_num Numeric scalar or vector (usually for column names inside a tibble) n_obs_num, threshold_num
_list List object model_list, results_list
_vec Atomic vector fips_vec, years_vec
_p ggplot object main_p, trend_p
for (state_x in unique(mortality_df$state)) {
    subset_df <- mortality_df |>
        dplyr::filter(state == state_x)

    model_x <- fit_model(subset_df)
    results_list[[state_x]] <- model_x
}
for (s in unique(data$state)) {
    temp <- data |>
        dplyr::filter(state == s)

    m <- fit_model(temp)
    results[[s]] <- m
}

The suffixed version is longer, but three months from now you’ll know exactly what mortality_df is. You won’t know what data or temp are.3

Syntax

Assignment

Always <-. Never = for assignment. Never ->.

## Good
x <- 10

## Bad
x = 10
10 -> x

Pipes

Both |> (base R pipe) and %>% ({magrittr} pipe) are acceptable. We prefer |> for new code — it has no dependencies and is slightly faster. Existing code using %>% doesn’t need to be converted.

Break after the pipe operator. Indent continuation lines 4 spaces.

result_df <- raw_df |>
    dplyr::filter(!is.na(outcome)) |>
    dplyr::mutate(
        rate = count / population * 100000
    ) |>
    dplyr::arrange(year, state)

Strings

Double quotes is the default for strings. Use single quotes when you need double quotes as part of the string. sprintf() is the default for string formatting; glue::glue() is also fine.

message(sprintf("Processing %s: %d records", state_x, nrow(subset_df)))

## Or with glue
message(glue::glue("Processing {state_x}: {nrow(subset_df)} records"))
message(paste0('Processing ', state_x, ': ', nrow(subset_df), ' records'))

return()

NoteFlexibility note

The tidyverse style guide says to use implicit returns only. We’re more flexible — both implicit and explicit return() are acceptable. The important thing is consistency within a file.

Use explicit return() for early returns, where the function bails out of the normal flow. Use implicit returns for the final value at the end of a function, if you prefer that style.

calculate_rate <- function(deaths, population) {
    ## Early return — explicit
    if (population == 0) {
        return(NA_real_)
    }

    ## Final value — implicit is fine
    deaths / population * 100000
}

Function Arguments

Two alignment styles are acceptable. Pick one per file and stick with it.

## Style 1: Align with opening parenthesis
model_fit <- brms::brm(formula = deaths ~ year + (1 | state),
                       data = model_df,
                       family = brms::poisson(),
                       cores = N_CORES)

## Style 2: 4-space continuation indent
model_fit <- brms::brm(
    formula = deaths ~ year + (1 | state),
    data = model_df,
    family = brms::poisson(),
    cores = N_CORES
)

I slightly prefer Style 2 because it’s more readable when function names are long and it keeps diffs cleaner. That said, Style 1 is fine.

Namespace Prefixing

Explicit namespace prefixing is the lab standard. Always use package::function() syntax — even after calling library(). Note, however, that I usually do this after I’m done with the coding. You can use the {prefixer} package to insert these for you.

library(dplyr)
library(tidyr)

result_df <- raw_df |>
    dplyr::filter(year >= 2010) |>
    dplyr::mutate(rate = deaths / pop * 100000) |>
    tidyr::pivot_longer(
        cols = dplyr::starts_with("age_"),
        names_to = "age_group",
        values_to = "count"
    )
library(dplyr)
library(tidyr)

result_df <- raw_df |>
    filter(year >= 2010) |>
    mutate(rate = deaths / pop * 100000) |>
    pivot_longer(
        cols = starts_with("age_"),
        names_to = "age_group",
        values_to = "count"
    )
ImportantWhy we do this

Three reasons. First, reproducibility — when someone reads your code, they know exactly which package every function comes from without scanning library() calls at the top. Second, it prevents masking conflicts. dplyr::filter() and stats::filter() do very different things, and silent masking has caused real bugs. Third, it makes dependencies explicit when you read code mid-file, which is most of the time.

Lastly, it is low-cost. Again, I do this after I am done coding a file using the {prefixer} package.

Tidyverse Patterns

Data Manipulation

In general, we use {dplyr} for data manipulation and tibble::tibble() over data.frame(). Base R subsetting is fine for quick one-off operations, but pipelines should use {dplyr}. This isn’t because I feel strongly that {tidyverse} is best, but because if we all use the same set of tools, it makes code review a lot more reliable (and faster). That said, we’re pragmatic about this and you should use the most appropriate tool for the job.

Iteration

purrr::map_*() over lapply() and sapply(). The type-stable variants (purrr::map_dbl(), purrr::map_chr(), purrr::map_dfr()) prevent silent type coercion.

## Good
state_results_df <- purrr::map_dfr(
    state_vec,
    ~ fit_model(.x, data = mortality_df)
)

Conditionals

dplyr::case_when() for multi-branch conditions. dplyr::if_else() over base ifelse() — it’s type-stable and faster.

mortality_df <- mortality_df |>
    dplyr::mutate(
        age_cat = dplyr::case_when(
            age < 18 ~ "Under 18",
            age < 65 ~ "18-64",
            age >= 65 ~ "65+",
            .default = "Unknown"
        )
    )
mortality_df$age_cat <- ifelse(
    mortality_df$age < 18, "Under 18",
    ifelse(mortality_df$age < 65, "18-64",
           ifelse(mortality_df$age >= 65, "65+", "Unknown"))
)

File Paths

here::here() always. Never hardcode absolute paths. Never use relative paths like ../../data/.

## Good
raw_df <- readr::read_csv(here::here("data", "raw", "mortality_2020.csv"))

## Bad
raw_df <- readr::read_csv("/Users/mkiang/projects/mortality/data/raw/mortality_2020.csv")
raw_df <- readr::read_csv("../../data/raw/mortality_2020.csv")

Filesystem Operations

fs::dir_create() and fs::file_exists() over their base R equivalents. The {fs} package is more consistent and cross-platform.

## Good
fs::dir_create(here::here("output", "figures"))

## Also fine but less consistent
dir.create(here::here("output", "figures"), recursive = TRUE, showWarnings = FALSE)

That said, we often encounter issues with {fs} that the base equivalents don’t have. In these edge cases, use the base equivalents. (For example, fs::dir_ls() really hates reading in folders with lots [i.e., hundreds of thousands] of files.)

ggplot2

Layer Formatting

Each {ggplot2} layer gets its own line. The + goes at the end of the preceding line, not at the start of the next. Indent each layer 4 spaces from the ggplot() call.

main_p <- ggplot2::ggplot(
        mortality_df,
        ggplot2::aes(x = year, y = rate, color = state)
    ) +
    ggplot2::geom_line(linewidth = 0.8) +
    ggplot2::geom_point(size = 1.5) +
    ggplot2::scale_x_continuous(breaks = seq(2000, 2020, 5)) +
    ggplot2::scale_color_manual(values = STATE_COLORS) +
    ggplot2::labs(
        x = "Year",
        y = "Mortality rate (per 100,000)",
        color = "State"
    ) +
    ggplot2::theme_minimal(base_size = 14)
main_p <- ggplot(mortality_df, aes(x = year, y = rate, color = state)) + geom_line(linewidth = 0.8) + geom_point(size = 1.5) + scale_x_continuous(breaks = seq(2000, 2020, 5)) + labs(x = "Year", y = "Mortality rate (per 100,000)")

Namespace Prefixing in ggplot2

The lab standard applies to {ggplot2} too. Use ggplot2::ggplot(), ggplot2::aes(), ggplot2::geom_*(), and so on. Again, we do this after we’ve already done the coding, and we use {prefixer} to do it so it should take you very little time.

Aesthetics

When aes() has more than 2 mappings, put each argument on its own line.

## Compact — 2 or fewer mappings
ggplot2::aes(x = year, y = rate)

## Expanded — 3 or more mappings
ggplot2::aes(
    x = year,
    y = rate,
    color = race_cat,
    linetype = sex_cat
)

Saving Figures

Always specify width, height, and dpi (for raster) or device (for vector). Save both a PDF for the journal and a high-DPI JPG for presentations and preprints.

## PDF for journal submission
ggplot2::ggsave(
    here::here("output", "figures", "fig_01_main_results.pdf"),
    plot = main_p,
    device = grDevices::cairo_pdf,
    width = 8,
    height = 6
)

## High-DPI jpg for presentations
ggplot2::ggsave(
    here::here("output", "figures", "fig_01_main_results.jpg"),
    plot = main_p,
    width = 8,
    height = 6,
    dpi = 300
)

Piping into ggplot

I have a strong preference for not piping data (especially after performing some manipulation on the data frame) into {ggplot2}. It’s better to save it as a separate object and call that directly. For publications, all figures should also have a numerical (i.e., csv) file saved as well, so having a separate object is helpful.

mortality_df |>
    dplyr::filter(year >= 2010) |>
    dplyr::group_by(state, year) |>
    dplyr::summarize(mean_rate = mean(rate), .groups = "drop") |>
    ggplot2::ggplot(ggplot2::aes(x = year, y = mean_rate, color = state)) +
    ggplot2::geom_line(linewidth = 0.8) +
    ggplot2::labs(
        x = "Year",
        y = "Mean mortality rate (per 100,000)"
    ) +
    ggplot2::theme_minimal()

Faceting

Put facet_wrap() and facet_grid() arguments on separate lines when they’re complex. Use labeller for human-readable facet labels.

main_p +
    ggplot2::facet_wrap(
        ~ age_cat,
        ncol = 3,
        scales = "free_y",
        labeller = ggplot2::labeller(
            age_cat = c(
                "under_18" = "Under 18",
                "18_64" = "18–64",
                "65_plus" = "65+"
            )
        )
    )

Documentation

Comments

Comments explain why, not what. If your code needs a comment to explain what it does, the code might be too clever.

Use ## for section-level comments and # for inline comments. Inline comments go on their own line above the code they describe, not at the end of a line.

## Exclude states with suppressed counts to avoid bias from
## differential suppression thresholds across states
clean_df <- raw_df |>
    dplyr::filter(count >= MIN_CELL_SIZE)
## Filter the data
clean_df <- raw_df |>
    dplyr::filter(count >= MIN_CELL_SIZE)  # remove small counts

Roxygen2

For important functions, add {roxygen2}-style documentation with #'.

You don’t need to document every internal helper, but crucial functions that will need to be checked by somebody else should have {roxygen2}-style documentation.

#' Calculate age-adjusted mortality rate
#'
#' @param deaths_df Data frame with columns `age_group`, `deaths`, `population`
#' @param standard_df Data frame with columns `age_group`, `weight`
#' @returns A single numeric value: the age-adjusted rate per 100,000
calculate_age_adjusted_rate <- function(deaths_df, standard_df) {
    merged_df <- dplyr::left_join(deaths_df, standard_df, by = "age_group")

    sum(merged_df$deaths / merged_df$population * merged_df$weight) * 100000
}

Resources


  1. In RStudio, go to the document outline pane (top right of the editor, or Ctrl/Cmd+Shift+O) to see all your sections. This is why the ---- convention matters — RStudio recognizes it.↩︎

  2. I know people have strong feelings about indentation. Two spaces is fine in other contexts. In this lab, it’s four. I’m old and it helps me with reading your code. This isn’t a hill worth dying on — just set your editor and forget about it.↩︎

  3. Your most frequent collaborator is future-you, and future-you has no idea what past-you was thinking.↩︎