Hitchhiker’s 👍 Guide To Reproducibility

Workshop at CEPII Paris

Florian Oswald

florian.oswald@sciencespo.fr

SciencesPo Paris, RES Data Editor

26 February, 2024

Agenda

10 simple rules to Reproducibility compiled by the Econ Data Editors.
The README file.
Some Reproducibility Best Practices.
Three attempts at reproducing published papers.

10 simple rules to Reproducibility

Computational Empathy
Make data accessible
Cite Data and how to access it
Describe software and hardware requirements
Provide all code

Explain how to reproduce your work
Provide a table of all things that can be reproduced
Include all supporting material
Use a permissible license. Any license is better than none.
Re-run everything!

The `README` File

Plain text top level file which explains everything about your package.
We have a useful template and a template generator.
Here are the minimum requirements for a README at The Economic Journal

Best Practices

Project Organisation (folder structure)
Code
Data
Output

Best Practices

Project Organisation

Folder Structure is a first order concern for your project.

Minimum Requirement

There should be a separation along:

Inputs: Data, parameters, etc
Outputs: Numbers, tables, figures
Code
Paper/Report etc

Example?

Best Practices

Good or Bad?

.
├── 20211107ext_2v1.do
├── 20220120ext_2v1.do
├── 20221101wave1.dta
├── james
│   └── NLSY97
│       └── nlsy97_v2.do
├── mary
│   └── NLSY97
│       └── nlsy97.do
├── matlab_fortran
│   ├── graphs
│   ├── sensitivity1
│   │   ├── data.xlsx
│   │   ├── good_version.do
│   │   └── script.m
│   └── sensitivity2
│       ├── models.f90
│       ├── models.mod
│       └── nrtype.f90
├── readme.do
├── scatter1.eps
├── scatter1_1.eps
├── scatter1_2.eps
├── ts.eps
├── wave1.dta
└── wave2.dta
└── wave2regs.dta
└── wave2regs2.dta

(scroll down! 😉)

Bad! 👎

Sub directories are not helpful
File names are confusing
code/data/output are not separated

Best Practices

Good 👍

.
├── README.md
├── code
│   ├── R
│   │   ├── 0-install.R
│   │   ├── 1-main.R
│   │   ├── 2-figure2.R
│   │   └── 3-table2.R
│   ├── stata
│   │   ├── 1-main.do
│   │   ├── 2-read_raw.do
│   │   ├── 3-figure1.do
│   │   ├── 4-figure3.do
│   │   └── 5-table1.do
│   └── tex
│       ├── appendix.tex
│       └── main.tex
├── data
│   ├── processed
│   └── raw
└── output
    ├── plots
    └── tables

Good.

Meaningful sub directories
top level README
code/data/output are separated

Best Practices

Example: TIER Protocol structure

Best Practices

Best Project Structure?

Note

There is no unique best way to organize your project: Make it simple, intuitive and helpful.

Important

Ideally your entire project is under version control.

Reproducible Code

Question:

How to write reproducible code?

👉 Huge question to answer. Let’s try with a few simple things first:

Provide a run script which…runs everything.
No copy and paste in your pipeline! Write results to disk.
Clear instructions
Provide a clear way to create the required environment (library installation etc)

Reproducible Code

No Manual Manipulation.

Change this parameter to 0.4, then run code again 😖
I computed this number manually 😖😖

Do This!

Use functions, ado files, programs, macros, subroutines etc
Use loops and parameters
Use placeholders for file paths

In general, take all necessary steps to ensure cross-platform compatibility of your code.

file paths are such low-hanging fruit 🍇…

Reproducible Code

File Paths

👉 Ask the user to set the root of your project, via global variable, environment variable, or other

# in my R, I do
Sys.setenv(PACKAGE_ROOT="/Users/floswald/Downloads/your_package")

# your package uses:
file.path(Sys.getenv("PACKAGE_ROOT"), "data", "wages.csv")

# in my stata, I do
global PACKAGE_ROOT "/Users/floswald/Downloads/your_package"

# your package uses
use "$PACKAGE_ROOT/data/wages.dta"

Always use forward slashes on Stata /, even on a windows machine!

Reproducible Code

Safe Environments for Running Your Code

No Guarantee

Your code will yield identical results on a different computer only if certain conditions apply.

Protected Environments

👉 You should provide a mechanism which ensures that those conditions do apply.

Reproducible Code

Safe Environments for Running Your Code

At a minimum, you list your exact computing environment:
OS, software and which version used (R 4.1, stata 17/MP, matlab 2023b, GNU Fortran (Homebrew GCC 13.2.0))
Libraries and which exact version used (ggplot2 1.3.4, outreg 2, numpy 1.26.4, boost 1.8.3 )
Stata: install all libraries into you replication package.

👉 Virtual Environments can help.

Reproducible Code

Provide a Virtual Environment

python via anaconda:

conda create -n py27 python=2.7 numpy=1.15.4 matplotlib
conda activate py27

There are other virtual env managers in python

R via renv

# in your existing project:
renv::init() # creates local library
renv::snapshot() # commit
renv::restore()  # checkout

julia built-in Pkg manager:

(@v1.10) pkg> activate .
  Activating new project at `~/CEPII`
  
(CEPII) pkg> add DataFrames GLM
# created 2 files in `~/CEPII`
# tracking all dependencies

Docker 🐳 container. This provides a fully specified virtual machine (i.e. a dedicated computer for your project)

Reproducible Code

Stata Virtual Environment

Include a version xyz statement in master script.
User contributed libraries are not versioned.
You must install all libraries next to your project code. If not, ssc install somelib will install an incompatible version a few years later.
Here is a _config.do script forcing stata to use only libraries installed in a given location.
Excellent guidance by Julian Reif

* file run.do:
global root "/location/of/your/package"
do $root/code/_config.do /* from above link.
will use libaries in $root/code/libraries only */
do $root/code/runanalysis.do

Reproducible code

Note

Such mechanisms can reduce version conflicts amongst your dependencies. To the extent that all versions of those dependencies are still available, this guarantees a stable computing environment.

Data

Always keep your raw data intact (i.e. read-only).
Generate separate analysis datasets to perform analysis.
Datasets change over time, keep a record of the date and versions you obtained. It might be difficult to obtain it in the future.

Output

Write both tables and figures to disk.
The gold standard: include this table in your readme.

Output in Paper	Output in Package	Program to execute
Table 1	`outputs/tables/table1.tex`	`code/table1.do`
Figure 1	`outputs/plots/figure1.pdf`	`code/figure1.do`
Figure 2	`outputs/plots/figure2.pdf`	`code/figure2.do`

Output

More Details

Run your codes from the replication folder before you submit and make sure it runs and all your results are reproduced - ideally on another machine!
Make sure to delete all expected output from the package before you run it, so you can be sure that all output was actually produced.
Ideally, your submitted paper (your \(\LaTeX\) file which produces it) should depend on the output of your replication package, so that if a piece of output is missing, the paper cannot be compiled (or you would quickly spot the mistake).
Help us by submitting your package without any expected output, i.e. with an empty folder outputs/.

Case Studies

Replications

Instructions for Case Studies

Hitchhiker’s 👍 Guide To Reproducibility

Agenda

10 simple rules to Reproducibility

The README File

Best Practices

Best Practices

Best Practices

Project Organisation

Best Practices

Good or Bad?

Bad! 👎

Best Practices

Good 👍

Good.

Best Practices

Example: TIER Protocol structure

Best Practices

Best Project Structure?

Reproducible Code

Reproducible Code

Reproducible Code

Reproducible Code

File Paths

Reproducible Code

Safe Environments for Running Your Code

Reproducible Code

Safe Environments for Running Your Code

Reproducible Code

Provide a Virtual Environment

Reproducible Code

Stata Virtual Environment

Reproducible code

Data

Data

Output

Output

Output

More Details

Case Studies

Replications

End

The `README` File