Concepts
concepts.RmdCore concepts
This article explains the key concepts behind projr’s approach to reproducible research.
Single-purpose directories
The problem
Research projects often have an ad-hoc structure where:
- Data files are scattered across multiple locations
- Output files mix with source code
- Temporary files clutter the repository
- It’s unclear what each directory contains
This makes it difficult to:
- Share specific parts of the project
- Restore the project on a new machine
- Understand the project structure at a glance
The projr solution
projr organises projects into single-purpose directories, each with a clear role:
my-project/
├── _raw_data/ # Source data (never modified)
├── _output/ # Final outputs (figures, tables)
├── _tmp/ # Temporary/cache files
├── docs/ # Rendered documents (HTML, PDF)
├── R/ # Source code
├── analysis.Rmd # Analysis documents
└── _projr.yml # Configuration
Benefits:
-
Clarity: Anyone can understand the structure from
_projr.yml - Selective sharing: Archive only what’s needed (e.g., data + outputs)
- Restoration: Easily restore projects on new machines
- Automation: projr can manage these directories automatically
Directory labels
Each directory has a label that describes its purpose:
-
raw-data→ source data -
cache→ temporary files -
output→ final outputs -
docs→ rendered documents
Labels have prefixes that determine their role:
-
raw-*→ source inputs -
cache-*→ temporary storage -
output-*→ final outputs -
docs-*→ documentation
Custom labels:
You can create custom labels following the prefix rules:
directories:
raw-data-public:
path: _raw_data_public
raw-data-sensitive:
path: _raw_data_sensitive
output-figures:
path: _output/figures
output-tables:
path: _output/tablesVersioned builds
The problem
In typical research projects:
- It’s unclear which version of data produced which outputs
- Reproducing an earlier analysis requires Git archaeology
- Dependencies (R packages) change over time
- No formal link between code version and outputs
The projr solution
projr implements versioned builds that:
- Assign a version to the entire project (e.g.,
v0.1.0) - Version individual components (code, data, outputs)
- Link outputs to specific input versions via a manifest
- Optionally capture R dependencies with renv
Version numbers:
projr uses semantic versioning (x.y.z):
- Major (x): Breaking changes or major milestones
- Minor (y): New features or analyses
- Patch (z): Small changes or bug fixes
Example progression:
v0.1.0 Initial analysis
↓
v0.1.1 Fix typo in figure
↓
v0.2.0 Add sensitivity analysis
↓
v1.0.0 Final publication version
Component versioning
Each build versions:
- Code: Git commit SHA
- Raw data: Directory hash (SHA-256)
- Outputs: Directory hash
- Documents: Directory hash
- R packages: renv.lock file (optional)
These are linked in the manifest:
output/manifest.csv:
version: v0.1.0
code_sha: abc123...
raw_data_hash: def456...
output_hash: ghi789...
Traceability:
Given an output file, you can trace back to:
- Which project version created it
- Which raw data was used
- Which code (Git commit) generated it
- Which R packages were installed
Development vs final builds
Development builds
Purpose: Safe iteration whilst coding
Characteristics:
- Routes outputs to cache
(
_tmp/projr/v<version>/) - Doesn’t modify
_outputordocs - No version bump
- No archiving
- Fast feedback loop
Use when:
- Testing code changes
- Debugging analysis
- Iterating on documents
- Checking output before committing
Final builds
Purpose: Create versioned releases
projr_build() # patch bump
projr_build_minor() # minor bump
projr_build_major() # major bumpCharacteristics:
- Clears and populates
_outputanddocs - Bumps version number
- Creates manifest
- Archives to GitHub/OSF/local
- Commits to Git (if configured)
Use when:
- Ready to share results
- Creating a milestone
- Publishing results
- Archiving for posterity
Build clearing behavior
Overview
projr manages when and how output directories are cleared during builds to ensure clean, reproducible results while preventing accidental data loss.
Clearing modes
The clear_output parameter controls when directories are
cleared:
# Clear before build (default)
projr_build_patch(clear_output = "pre")
# Clear after build
projr_build_patch(clear_output = "post")
# Never clear automatically
projr_build_patch(clear_output = "never")You can also set this globally via environment variable:
Sys.setenv(PROJR_CLEAR_OUTPUT = "post")
projr_build_patch() # Uses "post" modeMode comparison
| Mode | When cleared | Safe for iteration? | Use case |
|---|---|---|---|
"pre" |
Before build starts | ✓ Yes | Default: Clean slate, save directly to final locations |
"post" |
After build completes | ✓ Yes | Conservative: Preserve outputs until after successful build |
"never" |
Never | ✗ Manual control needed | Advanced: User manages all clearing |
What gets cleared?
In development builds
(projr_build_dev()):
- Cache directories are cleared
- Final
_outputanddocsdirectories are never touched - Safe for rapid iteration
In final builds (projr_build(),
etc.):
Mode "pre": - Before build: Clears cache AND final
_output directories - During build: Renders to final
locations directly - After build: Final outputs contain only current
build results
Mode "post": - Before build: Clears cache only
(preserves final _output) - During build: Renders to cache
- After build: Clears final directories, then copies from cache
Mode "never": - No automatic clearing - New files may
mix with old files - Requires manual directory management
Example workflows
Workflow 1: Default (clear before)
# Clean slate approach
projr_build_patch(clear_output = "pre")
# 1. Clears _output/ completely
# 2. Builds analysis
# 3. Saves outputs directly to _output/
# 4. Archives to remotes (if configured)Workflow 2: Conservative (clear after)
# Safety-first approach
projr_build_patch(clear_output = "post")
# 1. Keeps existing _output/ intact
# 2. Builds to cache (_tmp/projr/v0.0.1/)
# 3. After successful build, clears _output/
# 4. Copies from cache to _output/
# 5. Archives to remotes (if configured)Workflow 3: Manual control
# Advanced: Manual clearing
unlink("_output", recursive = TRUE) # Clear manually
projr_build_patch(clear_output = "never")
# 1. No automatic clearing
# 2. Builds to specified locations
# 3. User responsible for cleanupSpecial: “old” directory preservation
The cache clearing process preserves a special old/
subdirectory for archival purposes:
_tmp/projr/v0.0.1/
├── output/ # Cleared each build
├── docs/ # Cleared each build
└── old/ # Never cleared automatically
└── archived_results.csv
This allows you to manually preserve important intermediate results across builds.
Manifests
What is a manifest?
A manifest is a CSV file that records:
- What was built
- Which inputs were used
- When it was built
- SHA-256 checksums for verification
Example manifest:
label,path,hash,version,timestamp
raw-data,_raw_data,abc123...,v0.1.0,2024-01-15T10:30:00Z
output,_output,def456...,v0.1.0,2024-01-15T10:35:00Z
docs,docs,ghi789...,v0.1.0,2024-01-15T10:35:00Z
Why manifests matter
Verification: Check if files have been modified
# Compare current hash with manifest
tools::md5sum("_output/figure.png")
# vs hash in manifestTraceability: Link outputs to inputs
“This figure was generated from raw data version abc123… using code commit def456…”
Reproducibility: Ensure the right inputs
“To reproduce this analysis, restore raw-data v0.1.0”
Manifest location
Manifests are stored in:
-
manifest.csvat project root (tracks all versions) - Dev builds also create temporary manifests in cache
Querying manifests
projr provides functions to query the manifest and track changes across versions:
Find what changed between versions:
# See what files changed between v0.0.1 and v0.0.2
changes <- projr_manifest_changes("0.0.1", "0.0.2")
# Returns: label, fn, change_type (added/modified/removed), hash_from, hash_to
# Filter by directory
output_changes <- projr_manifest_changes("0.0.1", "0.0.2", label = "output")Track files across a version range:
# See all files and when they last changed from v0.0.1 to current
file_history <- projr_manifest_range("0.0.1")
# Returns: label, fn, version_first, version_last_change, hashCheck when directories last changed:
# See most recent changes for each directory
last_changes <- projr_manifest_last_change()
# Returns: label, version_last_change, n_filesArchiving and restoration
Archive strategies
projr supports two archiving strategies:
1. Versioned archives (default)
Each build creates a new archive:
GitHub Releases:
v0.1.0/raw-data-v0.1.0.zip
v0.1.1/raw-data-v0.1.1.zip
v0.2.0/raw-data-v0.2.0.zip
- Preserves all versions
- Enables time travel
- Uses more storage
2. Latest-only archives
Each build overwrites the previous:
GitHub Releases:
latest/raw-data-latest.zip (always current)
- Saves storage space
- Loses history
- Faster downloads
Configure in _projr.yml:
Archive cues
Control when archives are uploaded:
-
"always"- Upload on every build -
"new"- Upload only if content changed (default) -
"never"- Don’t upload
Restoration
Restoration pulls archived content back:
# Restore from GitHub
projr_restore_repo("owner/repo")
# Restore specific version
projr_restore(label = "raw-data", version = "v0.1.0")
# Restore specific label
projr_restore(label = "output")Restoration order:
projr checks sources in order:
- GitHub Releases
- OSF
- Local archives
The first available source is used.
Profiles
What are profiles?
Profiles are alternative configurations for different contexts:
- Development vs production
- Public vs private sharing
- Individual vs collaborative workflows
How profiles work
A profile is a separate YAML file:
_projr.yml # Default configuration
_projr-dev.yml # Development profile
_projr-public.yml # Public sharing profile
Activate a profile:
Sys.setenv(PROJR_PROFILE = "dev")Or in .Renviron:
PROJR_PROFILE=dev
Environment variables
Configuration via environment
projr reads these environment variables:
PROJR_PROFILE
Activate a profile:
PROJR_OUTPUT_CLEAR
Control when _output is cleared:
-
"pre"- Clear before build (default) -
"post"- Clear after build -
"none"- Never clear
PROJR_CACHE_CLEAR
Control cache clearing (same options as above).
Setting environment variables
In R:
Sys.setenv(PROJR_PROFILE = "dev")
Sys.setenv(PROJR_OUTPUT_CLEAR = "pre")In .Renviron:
PROJR_PROFILE=dev
PROJR_OUTPUT_CLEAR=pre
Helper function:
projr_env_set(
profile = "dev",
output_clear = "pre"
)Dependencies and renv
Why renv?
R package versions change over time. Code that works today might break in 6 months due to package updates.
renv locks package versions:
renv.lock:
{
"R": {"Version": "4.3.0"},
"Packages": {
"dplyr": {"Version": "1.1.0"},
"ggplot2": {"Version": "3.4.0"}
}
}
projr + renv
projr integrates with renv:
Initialise:
Update lockfile:
projr_renv_update() # Wrapper for renv::snapshot()Restore packages:
projr_renv_restore() # Wrapper for renv::restore()The whole game
Putting it all together:
# 1. Initialise
projr_init()
# 2. Add raw data to _raw_data/
# 3. Write analysis code in .Rmd files
# 4. Iterate with dev builds
projr_build_dev("analysis.Rmd")
# Check outputs in _tmp/projr/v0.0.1/
# 5. When ready, create first release
projr_build()
# Outputs in _output/, archived to GitHub
# 6. Continue development
# ... edit code ...
projr_build_dev()
# 7. Create minor release with new analysis
projr_build_minor()
# 8. Share repository
# Collaborators run:
projr_restore_repo("you/your-project")Key takeaways:
- Directories: Organise by purpose (raw, cache, output, docs)
- Versions: Link outputs to inputs via manifests
- Dev builds: Safe iteration without overwriting releases
- Final builds: Versioned, archived, traceable releases
- Restoration: One command to reconstruct project