Hitchhiker’s πŸ‘ Guide To Reproducibility

Workshop at CEPII Paris

Florian Oswald

SciencesPo Paris, RES Data Editor

26 February, 2024

Agenda

  1. 10 simple rules to Reproducibility compiled by the Econ Data Editors.
  2. The README file.
  3. Some Reproducibility Best Practices.
  4. Three attempts at reproducing published papers.

10 simple rules to Reproducibility

  1. Computational Empathy
  2. Make data accessible
  3. Cite Data and how to access it
  4. Describe software and hardware requirements
  5. Provide all code
  1. Explain how to reproduce your work
  2. Provide a table of all things that can be reproduced
  3. Include all supporting material
  4. Use a permissible license. Any license is better than none.
  5. Re-run everything!

The README File

  1. Plain text top level file which explains everything about your package.
  2. We have a useful template and a template generator.
  3. Here are the minimum requirements for a README at The Economic Journal

Best Practices

Best Practices

  1. Project Organisation (folder structure)
  2. Code
  3. Data
  4. Output

Best Practices

Project Organisation

  • Folder Structure is a first order concern for your project.

Minimum Requirement

There should be a separation along:

  1. Inputs: Data, parameters, etc
  2. Outputs: Numbers, tables, figures
  3. Code
  4. Paper/Report etc

Example?

Best Practices

Good or Bad?


.
β”œβ”€β”€ 20211107ext_2v1.do
β”œβ”€β”€ 20220120ext_2v1.do
β”œβ”€β”€ 20221101wave1.dta
β”œβ”€β”€ james
β”‚   └── NLSY97
β”‚       └── nlsy97_v2.do
β”œβ”€β”€ mary
β”‚   └── NLSY97
β”‚       └── nlsy97.do
β”œβ”€β”€ matlab_fortran
β”‚   β”œβ”€β”€ graphs
β”‚   β”œβ”€β”€ sensitivity1
β”‚   β”‚   β”œβ”€β”€ data.xlsx
β”‚   β”‚   β”œβ”€β”€ good_version.do
β”‚   β”‚   └── script.m
β”‚   └── sensitivity2
β”‚       β”œβ”€β”€ models.f90
β”‚       β”œβ”€β”€ models.mod
β”‚       └── nrtype.f90
β”œβ”€β”€ readme.do
β”œβ”€β”€ scatter1.eps
β”œβ”€β”€ scatter1_1.eps
β”œβ”€β”€ scatter1_2.eps
β”œβ”€β”€ ts.eps
β”œβ”€β”€ wave1.dta
└── wave2.dta
└── wave2regs.dta
└── wave2regs2.dta

(scroll down! πŸ˜‰)



Bad! πŸ‘Ž

  • Sub directories are not helpful
  • File names are confusing
  • code/data/output are not separated

Best Practices

Good πŸ‘


.
β”œβ”€β”€ README.md
β”œβ”€β”€ code
β”‚   β”œβ”€β”€ R
β”‚   β”‚   β”œβ”€β”€ 0-install.R
β”‚   β”‚   β”œβ”€β”€ 1-main.R
β”‚   β”‚   β”œβ”€β”€ 2-figure2.R
β”‚   β”‚   └── 3-table2.R
β”‚   β”œβ”€β”€ stata
β”‚   β”‚   β”œβ”€β”€ 1-main.do
β”‚   β”‚   β”œβ”€β”€ 2-read_raw.do
β”‚   β”‚   β”œβ”€β”€ 3-figure1.do
β”‚   β”‚   β”œβ”€β”€ 4-figure3.do
β”‚   β”‚   └── 5-table1.do
β”‚   └── tex
β”‚       β”œβ”€β”€ appendix.tex
β”‚       └── main.tex
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ processed
β”‚   └── raw
└── output
    β”œβ”€β”€ plots
    └── tables


Good.

  • Meaningful sub directories
  • top level README
  • code/data/output are separated

Best Practices

Example: TIER Protocol structure

Best Practices

Best Project Structure?


Note

There is no unique best way to organize your project: Make it simple, intuitive and helpful.


Important

Ideally your entire project is under version control.

Reproducible Code

Reproducible Code

Question:

How to write reproducible code?

πŸ‘‰ Huge question to answer. Let’s try with a few simple things first:

  1. Provide a run script which…runs everything.
  2. No copy and paste in your pipeline! Write results to disk.
  3. Clear instructions
  4. Provide a clear way to create the required environment (library installation etc)

Reproducible Code

No Manual Manipulation.

  • Change this parameter to 0.4, then run code again πŸ˜–
  • I computed this number manually πŸ˜–πŸ˜–

Do This!

  • Use functions, ado files, programs, macros, subroutines etc
  • Use loops and parameters
  • Use placeholders for file paths

In general, take all necessary steps to ensure cross-platform compatibility of your code.

file paths are such low-hanging fruit πŸ‡β€¦

Reproducible Code

File Paths

πŸ‘‰ Ask the user to set the root of your project, via global variable, environment variable, or other

# in my R, I do
Sys.setenv(PACKAGE_ROOT="/Users/floswald/Downloads/your_package")

# your package uses:
file.path(Sys.getenv("PACKAGE_ROOT"), "data", "wages.csv")


# in my stata, I do
global PACKAGE_ROOT "/Users/floswald/Downloads/your_package"

# your package uses
use "$PACKAGE_ROOT/data/wages.dta"

Always use forward slashes on Stata /, even on a windows machine!

Reproducible Code

Safe Environments for Running Your Code

No Guarantee

Your code will yield identical results on a different computer only if certain conditions apply.

Protected Environments

πŸ‘‰ You should provide a mechanism which ensures that those conditions do apply.

Reproducible Code

Safe Environments for Running Your Code

  • At a minimum, you list your exact computing environment:

  • OS, software and which version used (R 4.1, stata 17/MP, matlab 2023b, GNU Fortran (Homebrew GCC 13.2.0))

  • Libraries and which exact version used (ggplot2 1.3.4, outreg 2, numpy 1.26.4, boost 1.8.3 )

  • Stata: install all libraries into you replication package.

πŸ‘‰ Virtual Environments can help.

Reproducible Code

Provide a Virtual Environment

python via anaconda:

conda create -n py27 python=2.7 numpy=1.15.4 matplotlib
conda activate py27

There are other virtual env managers in python

R via renv

# in your existing project:
renv::init() # creates local library
renv::snapshot() # commit
renv::restore()  # checkout

julia built-in Pkg manager:

(@v1.10) pkg> activate .
  Activating new project at `~/CEPII`
  
(CEPII) pkg> add DataFrames GLM
# created 2 files in `~/CEPII`
# tracking all dependencies

Docker 🐳 container. This provides a fully specified virtual machine (i.e. a dedicated computer for your project)

Reproducible Code

Stata Virtual Environment

  1. Include a version xyz statement in master script.
  2. User contributed libraries are not versioned.
  3. You must install all libraries next to your project code. If not, ssc install somelib will install an incompatible version a few years later.
  4. Here is a _config.do script forcing stata to use only libraries installed in a given location.
  5. Excellent guidance by Julian Reif
* file run.do:
global root "/location/of/your/package"
do $root/code/_config.do /* from above link.
will use libaries in $root/code/libraries only */
do $root/code/runanalysis.do

Reproducible code

Note

Such mechanisms can reduce version conflicts amongst your dependencies. To the extent that all versions of those dependencies are still available, this guarantees a stable computing environment.

Data

Data

  • Always keep your raw data intact (i.e. read-only).
  • Generate separate analysis datasets to perform analysis.
  • Datasets change over time, keep a record of the date and versions you obtained. It might be difficult to obtain it in the future.

Output

Output

  • Write both tables and figures to disk.
  • The gold standard: include this table in your readme.
Output in Paper Output in Package Program to execute
Table 1 outputs/tables/table1.tex code/table1.do
Figure 1 outputs/plots/figure1.pdf code/figure1.do
Figure 2 outputs/plots/figure2.pdf code/figure2.do

Output

More Details

  • Run your codes from the replication folder before you submit and make sure it runs and all your results are reproduced - ideally on another machine!
  • Make sure to delete all expected output from the package before you run it, so you can be sure that all output was actually produced.
  • Ideally, your submitted paper (your \(\LaTeX\) file which produces it) should depend on the output of your replication package, so that if a piece of output is missing, the paper cannot be compiled (or you would quickly spot the mistake).
  • Help us by submitting your package without any expected output, i.e. with an empty folder outputs/.

Case Studies

Replications

Instructions for Case Studies

End