9 Doing Science
Now that you’ve joined the lab, it’s reasonable to ask how we actually do science. There are many other resources you should also be familiar with:
- I have slides and notes about The Scientific Computing Workflow that may be helpful.
- The code style guide outlines standard code format for the lab.
- We use a template repository as the base for new projects
- There’s a corresponding appendix about project structure that goes into more detail about the template repository.
9.1 Open Science
Our science should be transparent, reproducible, and accessible. This is non-negotiable, but we are pragmatic about this1 and bound to the obligations of our data use agreements and relevant laws.
This isn’t just good practice — it’s how we build trust in our findings and contribute meaningfully to the field. For a good overview of why open and reproducible practices matter, see Heise et al.’s “Ten simple rules for implementing open and reproducible research practices after attending a training course” (2023, PLOS Computational Biology).
Open data and code: By default, all code and data associated with our papers should be publicly available.
- Publishing code on our lab GitHub organization with clear documentation
- Sharing data (with appropriate de-identification and ethical considerations) on GitHub, Zenodo, or the Open Science Framework
- Writing clear README files and documentation so others can reproduce and build on our work
- If you have questions about data sharing due to privacy or licensing concerns, let’s discuss early.
Reproducibility: Reproducibility is non-negotiable.
- Your code should run end-to-end without manual intervention.
- Document all dependencies, versions, and computational environment.
- Use version control for everything internally (code, manuscripts, even analysis notebooks).
- Share the final version of your code online (without a git history).2
- Test your pipeline on a clean environment before publishing.
Why this matters: Our work informs policy and shapes how people understand public health. That responsibility means we need to do it right, and we need to be able to show our work. Transparent science also makes us better scientists — it forces us to be clear about our assumptions and methods.
9.1.1 Things I don’t feel strongly about
For a project to be truly reproducible, we should be setting seeds during random draws (e.g., in simulations). I do not feel strongly about this at all. If you are running so few simulations that changing the seed changes your results in any meaningful way, we should not be publishing it. I’m much more concerned about your results being robust than I am about the thousandth decimal place being exactly the same. To further complicate the issue, saving the seed in R actually does not guarantee reproducibility across versions of R or different environments. Instead, I think you should just run enough simulations that you don’t need to worry about this.
I don’t feel strongly about pre-registration of empirical descriptive studies. The motivation of pre-registration is to prevent p-hacking our way to success and it’s a worthwhile goal, but again there are better ways of assessing the robustness of your results. I would rather you show honest and extensive sensitivity analyses, multiverse approaches, quantitative bias analyses, or other methods of assessing how robust your results are.
That said, for studies where we collect primary data, we are running a proper experiment, or we are trying to use a strong causal identification, pre-registration is great.
9.2 Data Analysis
We use R as our primary programming language. It has excellent tools for statistical analysis, visualization, and reproducible research. Here’s how we work:
Code style and standards: Use the lab style guide and project structure document for guidance. See my slides and notes about The Scientific Computing Workflow for more detail. For situations where these guides are not quite right for your research project, our fallback is Wilson et al.’s “Good enough practices in scientific computing” (2017, PLOS Computational Biology).
These rules will get you 70% of the way there for code:
- Use snake_case for variables and functions.
- Use meaningful variable names.
- Aim for readable code over clever code.
- Use the pipe (
|>or%>%) to write readable data pipelines. - Comment liberally, especially for non-obvious logic.
The tidyverse ecosystem: Our data analysis workflows center on tidyverse packages.
dplyrfor data manipulationggplot2for visualizationtidyrfor reshaping data- These tools are designed to work together and encourage clear, readable code.
That said, use whatever package is most appropriate for your task. Having everybody within the same common set of tools allows for more thorough and rapid code review.
Package management with renv: Every project uses renv to manage package versions and ensure reproducibility.
- Run
renv::init()when starting a new project. - Commit
renv.lockto version control. - Team members use
renv::restore()to get the exact same environment. - Update packages intentionally with
renv::update(), not accidentally.
Code review: Before committing major analysis code, request a review from a lab member. This catches bugs, clarifies fuzzy logic, and spreads knowledge across the lab. Use GitHub pull requests for this — they create a documented record of what was reviewed and why. Even better is parallel coding where two people independently work on the same data set to answer the same question and see if they can arrive at the same answer.
Documentation: Write clear comments in your code.
- Explain why you’re doing something, not just what you’re doing.
- Document function arguments and return values.
- Explain any non-standard statistical approaches or transformations.
- If you’re making a dataset, document how you created it.
Quarto for reports and manuscripts: Use Quarto for all tables and numbers in your manuscript. This minimizes errors and allows us to quickly update numbers with new data.
9.3 Version Control
Every project — no matter how small — should be a Git repository. Version control is how we track changes, collaborate safely, and maintain a history of our work.
Getting started with Git/GitHub:
- All lab projects live in the lab GitHub organization.
- If you’re new to Git, complete a tutorial (GitHub has good ones). Ask the lab if you want recommendations.
- We use GitHub for both code and manuscript repositories.
Branching strategy: If you are working alone, a branching strategy is not necessary. But if several people are working on the code, it is useful to have a few basic rules.
- Keep
mainclean — never push tomaindirectly. - Create a new branch for each analysis (
feature/new-model,fix/bug-in-pipeline). - Make your changes on the branch, commit frequently.
- Open a pull request (PR) when you’re ready for review.
- After approval, merge to
main. - Delete the branch after merging to keep things tidy.
Commit messages: Write clear, informative commit messages. Use the present tense and be specific.
- Good: “Add Bayesian spatial model for county-level analysis”
- Bad: “stuff”, “fixed things”
- If your commit fixes an issue, reference it: “Fix data import bug (closes #42).”
- Keep the first line short (under 50 characters), then add more detail if needed.
Pull request workflow:
- Open a PR with a description of what you’re doing and why.
- Link to any related issues.
- Request review from at least one other lab member.
- Address feedback and update the PR.
- Squash and merge to keep the history clean (or rebase if you prefer).
What to commit:
- All code (obviously)
- Data if it’s small and not sensitive
- Quarto files that generate tables or numbers
- Figures
- Configuration files and environment files (
renv.lock,.gitignore) - README files and documentation
- What not to commit: Large data files (use .gitignore), sensitive information,
renvcache.
9.4 Computing Resources
Most of our work is computational, so you need to know where and how to run your analyses efficiently.
Sherlock cluster: For computationally intensive work that does not use high risk data.
- Sherlock is Stanford’s shared computing cluster with hundreds of cores.
- Use it for long-running simulations, hyperparameter tuning, or large-scale analyses.
- You have an allocation through the lab (thanks to lab funding).
- Learn to write SLURM job submission scripts — it’s worth the effort.
- Start with small test runs locally, then scale up on Sherlock.
Carina: For computationally intensive work that uses high risk data. Otherwise same as Sherlock.
Local computing: Your laptop is right for day-to-day tasks.
- Initial exploration and code development
- Quick analyses and visualization
- Manuscript writing and manuscript edits
- Anything that runs in a few minutes
- Non-sensitive data that can be stored on your laptop (or simulated data)
Cloud resources: We occasionally use cloud computing (AWS, Google Cloud) for specific projects. If you need cloud resources, talk to Matt — there are costs involved and we may need approvals.
General principle: Start locally (sometimes on fake data), test, validate, then scale up. This saves time and prevents costly mistakes on expensive computing resources.
9.5 Publications
Publishing papers can be expensive. This is not your concern. If you’re the first author, I will pay for the publication fee on all lab papers.3 For advanced PhD students and postdocs, I will also pay for the open-access fee.
Instead, your concern should be on how to turn your work into a publication. This is how our work makes an impact. It is important for your career, and it is a skill you build through practice.
Target journals:
- We aim for reputable, open-access journals in epidemiology, public health, general interest, medical, and computational fields. This means we avoid journals that are predatory or quasi-predatory.
- My general rule is that your paper should get rejected at least five times. Publishing is a stochastic process and there is no way to out-strategize it. Aiming high and shooting often are the best strategies.
- Consider the scope of your work — is this foundational methods work or an application to a specific problem?
- Let’s discuss target journals early in the writing process. You should have a full “submission pathway” of 5-7 journals before you submit the first time.
Co-author review:
- Share drafts with all coauthors early and often.
- Allow at least 2 weeks for coauthor review (longer for major drafts).
- Incorporate feedback generously — coauthors often catch things you miss.
- Resolve disagreements through discussion; we don’t override coauthor concerns lightly.
Data and code availability:
- Before submission, ensure your data and code are ready to share.
- Include a statement in your manuscript about where code and data are available.
- Example: “All code and data are available at https://github.com/KiangLab/[project-name].”
- If data cannot be shared (privacy concerns), explain why and describe what’s available.
9.7 Conferences
Conferences are where we present our work, learn from others, and build our research network. Here’s how we approach them:
Presenting your work:
- We aim to present major research findings at conferences in our field or subfield.
- Conferences are good for feedback before publication.
- Consider oral presentations and posters both valuable — different audiences, different strengths.
- Give yourself time to prepare a good presentation.
Choosing which conferences to attend:
- Prioritize conferences in epidemiology, public health, or related fields.
- Consider the scope and audience. A niche conference might be better for very specialized work.
- Ask the lab — we can advise on which meetings have the best impact for your work.
Oral presentations vs. posters:
- Oral talks reach more people and get more visibility.
- Posters allow deeper one-on-one discussions.
- Competitive presentations are great for early-career development (apply for them!)
- Both are valuable; we’ll discuss which fits your work best.
Funding for travel:
- The lab has funding to support travel to present our research.
- Usually covers registration, flights, and hotel.
- Let me know early if you’re planning to submit an abstract.
- Some conferences have early-bird discounts — we’ll coordinate timing.
Before you go:
- Prepare your slides or poster well in advance.
- Do a practice talk for the lab.
- Respect abstracts (they have tight word limits).
- Prepare for questions and criticism.
For example, if a data set was expensive or took us a long time to collect, we may only share the sections of the data necessary to reproduce the paper rather than the entire data set or we may deposit the dataset online but set an embargo date on it to give students and postdocs an opportunity to wrap up their projects before it becomes publicly available.↩︎
This is an intentional precaution to prevent us from accidentally uploading API keys or proprietary information that violates our DUA. Once those elements enter the commit history, they are hard for humans to find and easy for bots to find. We use GitHub (and commit histories) during the collaborative process and we use it to share our manuscript code but those are two different uses.↩︎
It is a “lab” paper if it is funded by the lab or I am the senior/corresponding author on the paper.↩︎