9  Doing Science

Now that you’ve joined the lab, it’s reasonable to ask how we actually do science. There are many other resources you should also be familiar with:

9.1 Open Science

Our science should be transparent, reproducible, and accessible. This is non-negotiable, but we are pragmatic about this1 and bound to the obligations of our data use agreements and relevant laws.

This isn’t just good practice — it’s how we build trust in our findings and contribute meaningfully to the field. For a good overview of why open and reproducible practices matter, see Heise et al.’s “Ten simple rules for implementing open and reproducible research practices after attending a training course” (2023, PLOS Computational Biology).

Open data and code: By default, all code and data associated with our papers should be publicly available.

  • Publishing code on our lab GitHub organization with clear documentation
  • Sharing data (with appropriate de-identification and ethical considerations) on GitHub, Zenodo, or the Open Science Framework
  • Writing clear README files and documentation so others can reproduce and build on our work
  • If you have questions about data sharing due to privacy or licensing concerns, let’s discuss early.

Reproducibility: Reproducibility is non-negotiable.

  • Your code should run end-to-end without manual intervention.
  • Document all dependencies, versions, and computational environment.
  • Use version control for everything internally (code, manuscripts, even analysis notebooks).
  • Share the final version of your code online (without a git history).2
  • Test your pipeline on a clean environment before publishing.

Why this matters: Our work informs policy and shapes how people understand public health. That responsibility means we need to do it right, and we need to be able to show our work. Transparent science also makes us better scientists — it forces us to be clear about our assumptions and methods.

9.1.1 Things I don’t feel strongly about

For a project to be truly reproducible, we should be setting seeds during random draws (e.g., in simulations). I do not feel strongly about this at all. If you are running so few simulations that changing the seed changes your results in any meaningful way, we should not be publishing it. I’m much more concerned about your results being robust than I am about the thousandth decimal place being exactly the same. To further complicate the issue, saving the seed in R actually does not guarantee reproducibility across versions of R or different environments. Instead, I think you should just run enough simulations that you don’t need to worry about this.

I don’t feel strongly about pre-registration of empirical descriptive studies. The motivation of pre-registration is to prevent p-hacking our way to success and it’s a worthwhile goal, but again there are better ways of assessing the robustness of your results. I would rather you show honest and extensive sensitivity analyses, multiverse approaches, quantitative bias analyses, or other methods of assessing how robust your results are.

That said, for studies where we collect primary data, we are running a proper experiment, or we are trying to use a strong causal identification, pre-registration is great.

9.2 Data Analysis

We use R as our primary programming language. It has excellent tools for statistical analysis, visualization, and reproducible research. Here’s how we work:

Code style and standards: Use the lab style guide and project structure document for guidance. See my slides and notes about The Scientific Computing Workflow for more detail. For situations where these guides are not quite right for your research project, our fallback is Wilson et al.’s “Good enough practices in scientific computing” (2017, PLOS Computational Biology).

These rules will get you 70% of the way there for code:

  • Use snake_case for variables and functions.
  • Use meaningful variable names.
  • Aim for readable code over clever code.
  • Use the pipe (|> or %>%) to write readable data pipelines.
  • Comment liberally, especially for non-obvious logic.

The tidyverse ecosystem: Our data analysis workflows center on tidyverse packages.

  • dplyr for data manipulation
  • ggplot2 for visualization
  • tidyr for reshaping data
  • These tools are designed to work together and encourage clear, readable code.

That said, use whatever package is most appropriate for your task. Having everybody within the same common set of tools allows for more thorough and rapid code review.

Package management with renv: Every project uses renv to manage package versions and ensure reproducibility.

  • Run renv::init() when starting a new project.
  • Commit renv.lock to version control.
  • Team members use renv::restore() to get the exact same environment.
  • Update packages intentionally with renv::update(), not accidentally.

Code review: Before committing major analysis code, request a review from a lab member. This catches bugs, clarifies fuzzy logic, and spreads knowledge across the lab. Use GitHub pull requests for this — they create a documented record of what was reviewed and why. Even better is parallel coding where two people independently work on the same data set to answer the same question and see if they can arrive at the same answer.

Documentation: Write clear comments in your code.

  • Explain why you’re doing something, not just what you’re doing.
  • Document function arguments and return values.
  • Explain any non-standard statistical approaches or transformations.
  • If you’re making a dataset, document how you created it.

Quarto for reports and manuscripts: Use Quarto for all tables and numbers in your manuscript. This minimizes errors and allows us to quickly update numbers with new data.

9.3 Version Control

Every project — no matter how small — should be a Git repository. Version control is how we track changes, collaborate safely, and maintain a history of our work.

Getting started with Git/GitHub:

  • All lab projects live in the lab GitHub organization.
  • If you’re new to Git, complete a tutorial (GitHub has good ones). Ask the lab if you want recommendations.
  • We use GitHub for both code and manuscript repositories.

Branching strategy: If you are working alone, a branching strategy is not necessary. But if several people are working on the code, it is useful to have a few basic rules.

  • Keep main clean — never push to main directly.
  • Create a new branch for each analysis (feature/new-model, fix/bug-in-pipeline).
  • Make your changes on the branch, commit frequently.
  • Open a pull request (PR) when you’re ready for review.
  • After approval, merge to main.
  • Delete the branch after merging to keep things tidy.

Commit messages: Write clear, informative commit messages. Use the present tense and be specific.

  • Good: “Add Bayesian spatial model for county-level analysis”
  • Bad: “stuff”, “fixed things”
  • If your commit fixes an issue, reference it: “Fix data import bug (closes #42).”
  • Keep the first line short (under 50 characters), then add more detail if needed.

Pull request workflow:

  • Open a PR with a description of what you’re doing and why.
  • Link to any related issues.
  • Request review from at least one other lab member.
  • Address feedback and update the PR.
  • Squash and merge to keep the history clean (or rebase if you prefer).

What to commit:

  • All code (obviously)
  • Data if it’s small and not sensitive
  • Quarto files that generate tables or numbers
  • Figures
  • Configuration files and environment files (renv.lock, .gitignore)
  • README files and documentation
  • What not to commit: Large data files (use .gitignore), sensitive information, renv cache.

9.4 Computing Resources

Most of our work is computational, so you need to know where and how to run your analyses efficiently.

Sherlock cluster: For computationally intensive work that does not use high risk data.

  • Sherlock is Stanford’s shared computing cluster with hundreds of cores.
  • Use it for long-running simulations, hyperparameter tuning, or large-scale analyses.
  • You have an allocation through the lab (thanks to lab funding).
  • Learn to write SLURM job submission scripts — it’s worth the effort.
  • Start with small test runs locally, then scale up on Sherlock.

Carina: For computationally intensive work that uses high risk data. Otherwise same as Sherlock.

Local computing: Your laptop is right for day-to-day tasks.

  • Initial exploration and code development
  • Quick analyses and visualization
  • Manuscript writing and manuscript edits
  • Anything that runs in a few minutes
  • Non-sensitive data that can be stored on your laptop (or simulated data)

Cloud resources: We occasionally use cloud computing (AWS, Google Cloud) for specific projects. If you need cloud resources, talk to Matt — there are costs involved and we may need approvals.

General principle: Start locally (sometimes on fake data), test, validate, then scale up. This saves time and prevents costly mistakes on expensive computing resources.

9.5 Publications

Publishing papers can be expensive. This is not your concern. If you’re the first author, I will pay for the publication fee on all lab papers.3 For advanced PhD students and postdocs, I will also pay for the open-access fee.

Instead, your concern should be on how to turn your work into a publication. This is how our work makes an impact. It is important for your career, and it is a skill you build through practice.

Target journals:

  • We aim for reputable, open-access journals in epidemiology, public health, general interest, medical, and computational fields. This means we avoid journals that are predatory or quasi-predatory.
  • My general rule is that your paper should get rejected at least five times. Publishing is a stochastic process and there is no way to out-strategize it. Aiming high and shooting often are the best strategies.
  • Consider the scope of your work — is this foundational methods work or an application to a specific problem?
  • Let’s discuss target journals early in the writing process. You should have a full “submission pathway” of 5-7 journals before you submit the first time.

Co-author review:

  • Share drafts with all coauthors early and often.
  • Allow at least 2 weeks for coauthor review (longer for major drafts).
  • Incorporate feedback generously — coauthors often catch things you miss.
  • Resolve disagreements through discussion; we don’t override coauthor concerns lightly.

Data and code availability:

  • Before submission, ensure your data and code are ready to share.
  • Include a statement in your manuscript about where code and data are available.
  • Example: “All code and data are available at https://github.com/KiangLab/[project-name].”
  • If data cannot be shared (privacy concerns), explain why and describe what’s available.

9.6 Authorship

Authorship should reflect genuine intellectual contribution. We use clear, fair criteria to determine who gets authorship and in what order. Authorship is a discussion we will have early and we will have often.

ICMJE criteria: We follow the International Committee of Medical Journal Editors (ICMJE) guidelines.

  • Substantial contributions to conception or design OR data acquisition, analysis, or interpretation
  • AND drafting the article or revising it critically for important intellectual content
  • AND final approval and agreement to be accountable for the work

All three criteria must be met. If someone meets only one or two, they get acknowledged instead. Everybody who meets all three criteria will be an author unless they explicitly decline.

CRediT taxonomy: Beyond ICMJE, we use CRediT to specify what each author actually did.

  • Conceptualization, Funding acquisition, Investigation, etc.
  • Include CRediT statements in your manuscript’s author contributions section.
  • This clarifies roles and is increasingly requested by journals.

Authorship order: Author order reflects contribution to the work.

  • First author typically did most of the work, and wrote the first draft.
  • Last author is often the lab PI (though not always — discuss with me).
  • Second author is usually the person who did the analysis (if this is not the first author).
  • Middle authors are ordered by contribution magnitude.

Authorship order is not static. As the project evolves or authors’ bandwidth changes, we may move authors around to keep the project moving. This is a normal part of the process. To be first author, the expectation is that you did the majority of the work from start to acceptance. If somebody else has to finish up the revise and resubmit for you, we will discuss ways of appropriately recognizing their contribution such as co-first authorship or swapping positions.

When to discuss authorship: We will discuss authorship early and often.

  • At project inception, discuss who will likely be involved and their roles.
  • If new people join mid-project, revisit authorship expectations.
  • Before submission, confirm authorship with everyone (names, emails, affiliations).
  • If a revise and resubmit requires substantial work that the first author cannot do, we may bring on an additional author.
  • Don’t spring authorship surprises on people; they should expect it.

Authorship disputes: If there’s disagreement about authorship, address it head-on.

  • Raise it early (not at final acceptance).
  • Refer back to ICMJE criteria — did the person meet all three?
  • Discuss with the people involved and with me.
  • We aim for fairness, not politics.

9.7 Conferences

Conferences are where we present our work, learn from others, and build our research network. Here’s how we approach them:

Presenting your work:

  • We aim to present major research findings at conferences in our field or subfield.
  • Conferences are good for feedback before publication.
  • Consider oral presentations and posters both valuable — different audiences, different strengths.
  • Give yourself time to prepare a good presentation.

Choosing which conferences to attend:

  • Prioritize conferences in epidemiology, public health, or related fields.
  • Consider the scope and audience. A niche conference might be better for very specialized work.
  • Ask the lab — we can advise on which meetings have the best impact for your work.

Oral presentations vs. posters:

  • Oral talks reach more people and get more visibility.
  • Posters allow deeper one-on-one discussions.
  • Competitive presentations are great for early-career development (apply for them!)
  • Both are valuable; we’ll discuss which fits your work best.

Funding for travel:

  • The lab has funding to support travel to present our research.
  • Usually covers registration, flights, and hotel.
  • Let me know early if you’re planning to submit an abstract.
  • Some conferences have early-bird discounts — we’ll coordinate timing.

Before you go:

  • Prepare your slides or poster well in advance.
  • Do a practice talk for the lab.
  • Respect abstracts (they have tight word limits).
  • Prepare for questions and criticism.

  1. For example, if a data set was expensive or took us a long time to collect, we may only share the sections of the data necessary to reproduce the paper rather than the entire data set or we may deposit the dataset online but set an embargo date on it to give students and postdocs an opportunity to wrap up their projects before it becomes publicly available.↩︎

  2. This is an intentional precaution to prevent us from accidentally uploading API keys or proprietary information that violates our DUA. Once those elements enter the commit history, they are hard for humans to find and easy for bots to find. We use GitHub (and commit histories) during the collaborative process and we use it to share our manuscript code but those are two different uses.↩︎

  3. It is a “lab” paper if it is funded by the lab or I am the senior/corresponding author on the paper.↩︎