Concepts • projr

Core concepts

This article explains the key concepts behind projr’s approach to reproducible research.

Single-purpose directories

The problem

Research projects often have an ad-hoc structure where:

Data files are scattered across multiple locations
Output files mix with source code
Temporary files clutter the repository
It’s unclear what each directory contains

This makes it difficult to:

Share specific parts of the project
Restore the project on a new machine
Understand the project structure at a glance

The projr solution

projr organises projects into single-purpose directories, each with a clear role:

my-project/
├── _raw_data/          # Source data (never modified)
├── _output/            # Final outputs (figures, tables)
├── _tmp/               # Temporary/cache files
├── docs/               # Rendered documents (HTML, PDF)
├── R/                  # Source code
├── analysis.Rmd        # Analysis documents
└── _projr.yml          # Configuration

Benefits:

Clarity: Anyone can understand the structure from _projr.yml
Selective sharing: Archive only what’s needed (e.g., data + outputs)
Restoration: Easily restore projects on new machines
Automation: projr can manage these directories automatically

Directory labels

Each directory has a label that describes its purpose:

raw-data → source data
cache → temporary files
output → final outputs
docs → rendered documents

Labels have prefixes that determine their role:

raw-* → source inputs
cache-* → temporary storage
output-* → final outputs
docs-* → documentation

Custom labels:

You can create custom labels following the prefix rules:

directories:
  raw-data-public:
    path: _raw_data_public
  raw-data-sensitive:
    path: _raw_data_sensitive
  output-figures:
    path: _output/figures
  output-tables:
    path: _output/tables

Versioned builds

The problem

In typical research projects:

It’s unclear which version of data produced which outputs
Reproducing an earlier analysis requires Git archaeology
Dependencies (R packages) change over time
No formal link between code version and outputs

The projr solution

projr implements versioned builds that:

Assign a version to the entire project (e.g., v0.1.0)
Version individual components (code, data, outputs)
Link outputs to specific input versions via a manifest
Optionally capture R dependencies with renv

Version numbers:

projr uses semantic versioning (x.y.z):

Major (x): Breaking changes or major milestones
Minor (y): New features or analyses
Patch (z): Small changes or bug fixes

Example progression:

v0.1.0  Initial analysis
  ↓
v0.1.1  Fix typo in figure
  ↓
v0.2.0  Add sensitivity analysis
  ↓
v1.0.0  Final publication version

Component versioning

Each build versions:

Code: Git commit SHA
Raw data: Directory hash (SHA-256)
Outputs: Directory hash
Documents: Directory hash
R packages: renv.lock file (optional)

These are linked in the manifest:

output/manifest.csv:
  version: v0.1.0
  code_sha: abc123...
  raw_data_hash: def456...
  output_hash: ghi789...

Traceability:

Given an output file, you can trace back to:

Which project version created it
Which raw data was used
Which code (Git commit) generated it
Which R packages were installed

Development vs final builds

Development builds

Purpose: Safe iteration whilst coding

projr_build_dev()

Characteristics:

Routes outputs to cache (_tmp/projr/v<version>/)
Doesn’t modify _output or docs
No version bump
No archiving
Fast feedback loop

Use when:

Testing code changes
Debugging analysis
Iterating on documents
Checking output before committing

Final builds

Purpose: Create versioned releases

projr_build()        # patch bump
projr_build_minor()  # minor bump
projr_build_major()  # major bump

Characteristics:

Clears and populates _output and docs
Bumps version number
Creates manifest
Archives to GitHub/OSF/local
Commits to Git (if configured)

Use when:

Ready to share results
Creating a milestone
Publishing results
Archiving for posterity

Build clearing behavior

Overview

projr manages when and how output directories are cleared during builds to ensure clean, reproducible results while preventing accidental data loss.

Clearing modes

The clear_output parameter controls when directories are cleared:

# Clear before build (default)
projr_build_patch(clear_output = "pre")

# Clear after build
projr_build_patch(clear_output = "post")

# Never clear automatically
projr_build_patch(clear_output = "never")

You can also set this globally via environment variable:

Sys.setenv(PROJR_CLEAR_OUTPUT = "post")
projr_build_patch()  # Uses "post" mode

Mode comparison

Mode	When cleared	Safe for iteration?	Use case
`"pre"`	Before build starts	✓ Yes	Default: Clean slate, save directly to final locations
`"post"`	After build completes	✓ Yes	Conservative: Preserve outputs until after successful build
`"never"`	Never	✗ Manual control needed	Advanced: User manages all clearing

What gets cleared?

In development builds (projr_build_dev()):

Cache directories are cleared
Final _output and docs directories are never touched
Safe for rapid iteration

In final builds (projr_build(), etc.):

Mode "pre": - Before build: Clears cache AND final _output directories - During build: Renders to final locations directly - After build: Final outputs contain only current build results

Mode "post": - Before build: Clears cache only (preserves final _output) - During build: Renders to cache - After build: Clears final directories, then copies from cache

Mode "never": - No automatic clearing - New files may mix with old files - Requires manual directory management

Example workflows

Workflow 1: Default (clear before)

# Clean slate approach
projr_build_patch(clear_output = "pre")

# 1. Clears _output/ completely
# 2. Builds analysis
# 3. Saves outputs directly to _output/
# 4. Archives to remotes (if configured)

Workflow 2: Conservative (clear after)

# Safety-first approach
projr_build_patch(clear_output = "post")

# 1. Keeps existing _output/ intact
# 2. Builds to cache (_tmp/projr/v0.0.1/)
# 3. After successful build, clears _output/
# 4. Copies from cache to _output/
# 5. Archives to remotes (if configured)

Workflow 3: Manual control

# Advanced: Manual clearing
unlink("_output", recursive = TRUE)  # Clear manually
projr_build_patch(clear_output = "never")

# 1. No automatic clearing
# 2. Builds to specified locations
# 3. User responsible for cleanup

Special: “old” directory preservation

The cache clearing process preserves a special old/ subdirectory for archival purposes:

_tmp/projr/v0.0.1/
├── output/           # Cleared each build
├── docs/             # Cleared each build
└── old/              # Never cleared automatically
    └── archived_results.csv

This allows you to manually preserve important intermediate results across builds.

Manifests

What is a manifest?

A manifest is a CSV file that records:

What was built
Which inputs were used
When it was built
SHA-256 checksums for verification

Example manifest:

label,path,hash,version,timestamp
raw-data,_raw_data,abc123...,v0.1.0,2024-01-15T10:30:00Z
output,_output,def456...,v0.1.0,2024-01-15T10:35:00Z
docs,docs,ghi789...,v0.1.0,2024-01-15T10:35:00Z

Why manifests matter

Verification: Check if files have been modified

# Compare current hash with manifest
tools::md5sum("_output/figure.png")
# vs hash in manifest

Traceability: Link outputs to inputs

“This figure was generated from raw data version abc123… using code commit def456…”

Reproducibility: Ensure the right inputs

“To reproduce this analysis, restore raw-data v0.1.0”

Manifest location

Manifests are stored in:

manifest.csv at project root (tracks all versions)
Dev builds also create temporary manifests in cache

Querying manifests

projr provides functions to query the manifest and track changes across versions:

Find what changed between versions:

# See what files changed between v0.0.1 and v0.0.2
changes <- projr_manifest_changes("0.0.1", "0.0.2")
# Returns: label, fn, change_type (added/modified/removed), hash_from, hash_to

# Filter by directory
output_changes <- projr_manifest_changes("0.0.1", "0.0.2", label = "output")

Track files across a version range:

# See all files and when they last changed from v0.0.1 to current
file_history <- projr_manifest_range("0.0.1")
# Returns: label, fn, version_first, version_last_change, hash

Check when directories last changed:

# See most recent changes for each directory
last_changes <- projr_manifest_last_change()
# Returns: label, version_last_change, n_files

Archiving and restoration

Archive strategies

projr supports two archiving strategies:

1. Versioned archives (default)

Each build creates a new archive:

GitHub Releases:
  v0.1.0/raw-data-v0.1.0.zip
  v0.1.1/raw-data-v0.1.1.zip
  v0.2.0/raw-data-v0.2.0.zip

Preserves all versions
Enables time travel
Uses more storage

2. Latest-only archives

Each build overwrites the previous:

GitHub Releases:
  latest/raw-data-latest.zip  (always current)

Saves storage space
Loses history
Faster downloads

Configure in _projr.yml:

build:
  github:
    raw-data:
      content: [raw-data]
      strategy: "archive"  # or "latest"

Archive cues

Control when archives are uploaded:

"always" - Upload on every build
"new" - Upload only if content changed (default)
"never" - Don’t upload

build:
  github:
    raw-data:
      content: [raw-data]
      cue: "new"  # Only upload if raw data changed

Restoration

Restoration pulls archived content back:

# Restore from GitHub
projr_restore_repo("owner/repo")

# Restore specific version
projr_restore(label = "raw-data", version = "v0.1.0")

# Restore specific label
projr_restore(label = "output")

Restoration order:

projr checks sources in order:

GitHub Releases
OSF
Local archives

The first available source is used.

Profiles

What are profiles?

Profiles are alternative configurations for different contexts:

Development vs production
Public vs private sharing
Individual vs collaborative workflows

How profiles work

A profile is a separate YAML file:

_projr.yml         # Default configuration
_projr-dev.yml     # Development profile
_projr-public.yml  # Public sharing profile

Activate a profile:

Sys.setenv(PROJR_PROFILE = "dev")

Or in .Renviron:

PROJR_PROFILE=dev

Profile inheritance

Profiles only specify differences from _projr.yml:

_projr.yml:

build:
  github:
    enabled: true
    raw-data:
      content: [raw-data]
directories:
  output:
    path: _output

_projr-dev.yml:

build:
  github:
    enabled: false  # Override: disable GitHub in dev
# All other settings inherited from _projr.yml

Example profiles

Development profile (_projr-dev.yml):

build:
  github:
    enabled: false
  git:
    commit: false

Public sharing profile (_projr-public.yml):

directories:
  raw-data:
    path: _raw_data_public  # Only public data
build:
  github:
    raw-data:
      content: [raw-data]  # Share public data only

Environment variables

Configuration via environment

projr reads these environment variables:

PROJR_PROFILE

Activate a profile:

export PROJR_PROFILE=dev

PROJR_OUTPUT_CLEAR

Control when _output is cleared:

"pre" - Clear before build (default)
"post" - Clear after build
"none" - Never clear

export PROJR_OUTPUT_CLEAR=pre

PROJR_CACHE_CLEAR

Control cache clearing (same options as above).

Setting environment variables

In R:

Sys.setenv(PROJR_PROFILE = "dev")
Sys.setenv(PROJR_OUTPUT_CLEAR = "pre")

In .Renviron:

PROJR_PROFILE=dev
PROJR_OUTPUT_CLEAR=pre

Helper function:

projr_env_set(
  profile = "dev",
  output_clear = "pre"
)

Dependencies and renv

Why renv?

R package versions change over time. Code that works today might break in 6 months due to package updates.

renv locks package versions:

renv.lock:
  {
    "R": {"Version": "4.3.0"},
    "Packages": {
      "dplyr": {"Version": "1.1.0"},
      "ggplot2": {"Version": "3.4.0"}
    }
  }

projr + renv

projr integrates with renv:

Initialise:

projr_init_renv()

Update lockfile:

projr_renv_update()  # Wrapper for renv::snapshot()

Restore packages:

projr_renv_restore()  # Wrapper for renv::restore()

When to use renv

Use renv when:

Long-term reproducibility is critical
Collaborating across machines/time
Submitting to journals requiring reproducibility

Skip renv when:

Quick exploratory projects
Sharing code is low priority
Package versions are stable

The whole game

Putting it all together:

# 1. Initialise
projr_init()

# 2. Add raw data to _raw_data/

# 3. Write analysis code in .Rmd files

# 4. Iterate with dev builds
projr_build_dev("analysis.Rmd")
# Check outputs in _tmp/projr/v0.0.1/

# 5. When ready, create first release
projr_build()
# Outputs in _output/, archived to GitHub

# 6. Continue development
# ... edit code ...
projr_build_dev()

# 7. Create minor release with new analysis
projr_build_minor()

# 8. Share repository
# Collaborators run:
projr_restore_repo("you/your-project")

Key takeaways:

Directories: Organise by purpose (raw, cache, output, docs)
Versions: Link outputs to inputs via manifests
Dev builds: Safe iteration without overwriting releases
Final builds: Versioned, archived, traceable releases
Restoration: One command to reconstruct project