--- title: "Using checkpoint for reproducible research" author: "Hong Ooi and Andrie de Vries" output: rmarkdown::html_vignette: toc: true number_sections: true vignette: > %\VignetteIndexEntry{Using checkpoint for reproducible research} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## The Reproducible R Toolkit (RRT) The **Reproducible R Toolkit** provides tools to ensure the results of R code are repeatable over time, by anyone. Most R scripts rely on packages, but new versions of packages are released daily. To ensure that results from R are reproducible, it's important to run R scripts using exactly the same package version in use when the script was written. The Reproducible R Toolkit provides an R function checkpoint, which ensures that all of the necessary R packages are installed with the correct version. This makes it easy to reproduce your results at a later date or on another system, and makes it easier to share your code with the confidence that others will get the same results you did. The Reproducible R Toolkit also works in conjunction with the "checkpoint-server", which makes a daily copy of all CRAN packages, to guarantee that every package version is available to all R developers thereby ensuring reproducibility. ## Components of RRT RRT is a collection of R packages and the checkpoint server that together make your work with R packages more reproducible over time by anyone. ### The checkpoint server To achieve reproducibility, daily snapshots of CRAN are stored on our checkpoint server. At midnight UTC each day, we refresh our mirror of [CRAN](https://cran.r-project.org/) is refreshed. When the rsync process is complete, the checkpoint server takes and stores a snapshot of the CRAN mirror as it was at that very moment. These daily snapshots can then be accessed on the [MRAN website](https://mran.microsoft.com/timemachine) or using the `checkpoint` package, which installs and consistently use these packages just as they existed at midnight UTC on a specified snapshot date. Daily snapshots are available as far back as `2014-09-17`. For more information, visit the [checkpoint server GitHub site](https://github.com/RevolutionAnalytics/checkpoint-server). ![checkpoint server](checkpoint-server.png) ### The checkpoint package The goal of the `checkpoint` package is to solve the problem of package reproducibility in R. Since packages get updated on CRAN all the time, it can be difficult to recreate an environment where all your packages are consistent with some earlier state. To solve this issue, `checkpoint` allows you to install packages locally as they existed on a specific date from the corresponding snapshot (stored on the checkpoint server) and it configures your R session to use only these packages. Together, the `checkpoint` package and the checkpoint server act as a CRAN time machine so that anyone using `checkpoint()` can ensure the reproducibility of their scripts or projects at any time. ![checkpoint package](checkpoint-pkg.png) ## Using checkpoint The `checkpoint` package has 3 main functions. The `create_checkpoint` function - Creates a checkpoint directory to install packages. This directory is located underneath `~/.checkpoint` by default, but you can change its location. - Scans your project directory for all packages used. Specifically, it looks in your code for instances of `library()` and `require()` calls, as well as the namespacing operators `::` and `:::`. - Installs these packages from the MRAN snapshot for a specified date, into your checkpoint directory. The `use_checkpoint` function - Sets the CRAN mirror for your R session to point to a MRAN snapshot, i.e. modifies `options(repos)` - Sets your library search path to point to the folder created by `create_checkpoint`, i.e. modifies `.libPaths()` This means the remainder of your script will run with the packages from your specified date. Finally, the `checkpoint` function serves as a unified interface to `create_checkpoint` and `use_checkpoint`. It looks for a pre-existing checkpoint directory, and if not found, creates it with `create_checkpoint`. It then calls `use_checkpoint` to put the checkpoint into use. ## Sharing your scripts and projects for reproducibility Sharing a script to be reproducible is as easy as placing the following snippet at the top: ```r library(checkpoint) checkpoint("2020-01-01") # replace with desired date ``` Then send this script to your collaborators. When they run this script on their machine for the first time, `checkpoint` will perform the same steps of scanning for package dependencies, creating the checkpoint directory, installing the necessary packages, and setting your session to use the checkpoint. On subsequent runs, `checkpoint` will find and use the created checkpoint, so the packages don't have to be installed again. If you have more than one script in your project, you can place the above snippet in every standalone script. Alternatively, you can put it in a script of its own, and run it before running any other script. ### Note on projects The `checkpoint` package is designed to be used with _projects_, which are directories that contain the R code and output associated with the tasks you're working on. If you use RStudio, you will probably be aware of the concept, but the same applies for many other programming editors and IDEs including Visual Studio Code, Notepad++ and Sublime Text. When it is run, `create_checkpoint` scans all R files inside a given project to determine what packages your code requires. The default project is the current directory `"."`. If you do not have an actual project open, this will usually expand to your R user directory (`~/` on Unix/Linux and MacOS, or `C:\Users\\Documents` on Windows). For most people, this means that the function will scan through _all_ the projects they have on their machine, which can lead to checkpointing a very large number of packages. Because of this, you should ensure that you are not in your user directory when you run `checkpoint`. A mitigating factor is that this should happen only once, as long as the checkpoint directory remains intact. ### Checkpointing the R version For an even _more stringent_ form of reproducibility, you can use the following: ```r library(checkpoint) checkpoint("2020-01-01", r_version="3.6.2") # replace with desired date and R version ``` This requires that anyone running the script must be using the specified version of R. The benefit of this is because changes in R over time can affect reproducibility just like changes in third-party packages, so by restricting the script to only one R version, we remove another possible source of variation. However, R itself is usually very stable, and requiring a specific version can be excessively demanding especially in locked-down IT environments. For this reason, specifying the R version is optional. ### Using knitr and rmarkdown with checkpoint `checkpoint` will automatically add the `rmarkdown` package as a dependency if it finds any Rmarkdown-based files (those with extension `.Rmd`, `.Rpres` or `.Rhtml`) in your project. This allows you to continue working with such documents after checkpointing. ## Resetting your session To reset your session to the way it was before checkpointing, call `uncheckpoint()`. Alternatively, you can simply restart R. ## Managing checkpoints To update an existing checkpoint, for example if you need new packages installed, call `create_checkpoint()` again. Any existing packages will remain untouched. The functions `delete_checkpoint()` and `delete_all_checkpoints()` allow you to remove checkpoint directories that are no longer required. They check that the checkpoint(s) in question are not actually in use before deleting. Each time `create_checkpoint()` is run, it saves a series of json files in the main checkpoint directory. These are outputs from the `pkgdepends` package, which `checkpoint` uses to perform the actual package installation, and can help you debug any problems that may occur. 1. `_