Design
design.RmdDesign philosophy
This article explains the design decisions behind projr and why it works the way it does.
Design goals
1. Minimal cognitive overhead
Goal: Reduce the mental burden of maintaining reproducible research.
How projr achieves this:
-
One function to remember:
projr_build()does everything - Sensible defaults: Most projects work out-of-the-box
- Convention over configuration: Standard directory structure
- Gradual complexity: Start simple, customise as needed
Anti-pattern: Complex build systems requiring extensive configuration, multiple commands, and deep understanding of internals.
Example: Compare a typical make-based
workflow:
# Traditional approach
make clean
make data
make analysis
make figures
make paper
make deploy-to-osf
git add .
git commit -m "Update"
git push
gh release create v0.1.0
# ... upload files manually ...With projr:
projr_build() # That's it2. Fail-safe iteration
Goal: Make it safe to experiment without losing work.
How projr achieves this:
-
Dev builds: Route outputs to cache, never touch
_output - Manifests: Always know what inputs created what outputs
- Git integration: Automatic commits preserve history
- Reversible versioning: Can always access previous versions via Git + archives
Anti-pattern: In-place modification of output directories leading to lost results.
Example: Without projr, you might:
# Accidentally overwrite yesterday's figures
render("analysis.Rmd") # Oh no, the new plot is worse!
# Now you've lost the good versionWith projr:
# Safe iteration
projr_build_dev() # Outputs to _tmp/
# Check results, if bad, just run again
# If good:
projr_build() # Now commit to _output3. Automation without magic
Goal: Automate tedious tasks whilst maintaining transparency.
How projr achieves this:
-
Explicit configuration:
_projr.ymlmakes everything visible - Predictable behaviour: Same inputs → same outputs
- Inspectable artefacts: Manifests, build logs, Git history
- No hidden state: All configuration in version-controlled files
Anti-pattern: Build systems with hidden state, implicit dependencies, or configuration scattered across multiple locations.
Example: projr’s manifest system:
label,path,hash,version,timestamp
raw-data,_raw_data,abc123...,v0.1.0,2024-01-15
You can inspect, version-control, and audit this. Compare to a system that tracks dependencies in a binary database or in-memory cache.
4. Reproducibility by default
Goal: Make it easier to be reproducible than not.
How projr achieves this:
- Automatic versioning: Every build is versioned
- Manifests: Inputs-outputs linkage is automatic
- renv integration: Optional package version locking
- Archiving: Automatic upload to GitHub/OSF
- Restoration: One command to reconstruct
Anti-pattern: Reproducibility as an afterthought requiring manual effort.
Example: Without thinking about it, projr users get:
v0.1.0:
- code: Git SHA abc123
- data: Hash def456
- outputs: Hash ghi789
- packages: renv.lock
- archived: GitHub Release v0.1.0
All automatically. To reproduce:
projr_restore_repo("owner/repo")
renv::restore()
projr_build()Core design principles
Single-purpose directories
Principle: Each directory has exactly one purpose.
Rationale:
- Clarity: No ambiguity about where things go
- Selective sharing: Archive only what’s needed
- Automation: Tools can act on directories knowing their contents
- Restoration: Simple mapping from label to path
Trade-offs:
- ✅ Structure is immediately obvious
- ✅ Easy to share/archive specific parts
- ❌ More directories than minimal structure
- ❌ Some redundancy (e.g., separate output-figures and output-tables)
Why this trade-off is worth it:
The clarity and automation benefits outweigh the slight increase in directory count. Modern file systems handle many directories efficiently.
Versioned builds, not versioned files
Principle: Version the entire project state, not individual files.
Rationale:
- Coherence: All files at version X are consistent with each other
- Simplicity: One version number, not per-file versions
- Traceability: Know exactly what produced what
- Restoration: Restore entire consistent state
Trade-offs:
- ✅ Simpler mental model (one version)
- ✅ Guarantees consistency
- ❌ Version bumps even for small changes
- ❌ Can’t mix versions of different components
Why this trade-off is worth it:
Scientific outputs depend on multiple inputs. Versioning the whole project ensures you can always reconstruct a consistent state. Per-file versioning leads to combinatorial explosion of possible states.
Configuration in YAML, not code
Principle: Project structure and build behaviour in
_projr.yml, not scattered across code.
Rationale:
- Centralised: One place to understand project configuration
- Readable: YAML is human-readable
- Version-controlled: Configuration changes are tracked
- Shareable: Easy to share configuration across projects
Trade-offs:
- ✅ Configuration is explicit and visible
- ✅ Easy to diff and merge configuration changes
- ❌ Less flexible than code-based configuration
- ❌ YAML syntax can be tricky
Why this trade-off is worth it:
Most research projects don’t need the flexibility of code-based configuration. The benefits of having a single, readable, version-controlled configuration file outweigh the limitations.
Dev builds vs final builds
Principle: Separate safe iteration from committed releases.
Rationale:
- Safety: Dev builds can’t accidentally overwrite released outputs
- Speed: Dev builds skip versioning and archiving
- Clarity: Explicit distinction between “testing” and “committing”
Trade-offs:
- ✅ Safe experimentation
- ✅ Fast feedback loop
- ❌ Two commands to remember (dev vs final)
- ❌ Cache directory can grow large
Why this trade-off is worth it:
The safety and speed benefits are critical for iterative research. The cost of remembering two commands is minimal.
Git integration, not Git dependency
Principle: projr works with or without Git, but works better with it.
Rationale:
- Accessibility: Beginners can use projr without learning Git
- Power: Advanced users get automatic Git integration
- Flexibility: Use Git features without learning them
Trade-offs:
- ✅ Low barrier to entry
- ✅ Automatic Git for those who want it
- ❌ More complex codebase (supporting both paths)
- ❌ Some features require Git (versioning)
Why this trade-off is worth it:
Git is powerful but intimidating. By making it optional, projr reaches more users whilst still offering Git benefits to those who want them.
Architecture
Layered design
projr is organised into layers:
User-facing API (projr_build, projr_init, ...)
↓
Configuration layer (YAML parsing, validation)
↓
Build engine (rendering, versioning, archiving)
↓
Backend services (Git, GitHub, OSF, file system)
Benefits:
- Modularity: Each layer can be tested independently
- Extensibility: New backends (e.g., Zenodo) can be added
- Clarity: Separation of concerns
Function naming conventions
projr uses systematic naming:
-
projr_*- All exported functions -
.projr_*- Internal functions (not exported) -
projr_build*- Build-related functions -
projr_init*- Initialisation functions -
projr_yml_*- YAML configuration functions -
projr_path_*- Path helper functions
Benefits:
- Discoverability: Autocomplete groups related functions
- Clarity: Function purpose is obvious from name
- Namespace: All public functions prefixed to avoid conflicts
Configuration precedence
projr uses this precedence for configuration:
- Environment variables (highest)
-
Profile YAML
(
_projr-{profile}.yml) -
Default YAML (
_projr.yml) - Built-in defaults (lowest)
Example:
# Built-in default
output: _output
# Overridden in _projr.yml
output: _my_output
# Overridden in _projr-dev.yml (if PROJR_PROFILE=dev)
output: _dev_output
# Overridden by environment variable (if set)
PROJR_OUTPUT_DIR=_temp_output
Final value: _temp_output
Benefits:
- Flexibility: Different contexts without editing files
- Explicitness: Clear hierarchy of precedence
- Debuggability: Easy to trace where a setting comes from
Manifest format
Manifests use CSV for simplicity and compatibility:
label,path,hash,version,timestamp
raw-data,_raw_data,abc123...,v0.1.0,2024-01-15T10:30:00Z
Why CSV?
- Universal: Readable by any tool (R, Python, Excel)
- Simple: No complex parsing
- Diff-friendly: Git can show line-by-line changes
- Human-readable: Open in text editor or spreadsheet
Alternative considered: JSON
- ✅ More structured
- ❌ Less human-readable
- ❌ Harder to diff
- ❌ Overkill for simple tabular data
Design decisions
Why semantic versioning?
Decision: Use x.y.z versioning (major.minor.patch)
Rationale:
- Familiar: Most developers know SemVer
- Expressive: Can communicate scale of changes
- Tooling: Many tools understand SemVer
Alternative considered: Date-based versioning (2024-01-15)
- ✅ Chronological ordering
- ❌ Doesn’t communicate significance of changes
- ❌ Multiple versions per day require disambiguation
Why default to GitHub Releases?
Decision: GitHub Releases is the default archive destination
Rationale:
- Ubiquity: Most R projects already use GitHub
- Free: Unlimited public releases, generous private quotas
- Integrated: Works with existing Git workflow
- Accessible: Web interface for downloads
Alternative considered: OSF as primary
- ✅ Designed for research
- ✅ Better for large datasets
- ❌ Separate account/authentication
- ❌ Less familiar to R developers
Solution: Support both; default to GitHub for familiarity.
Why clear _output before builds?
Decision: Default to clearing _output
before final builds
Rationale:
- Correctness: Ensures outputs match current code
- No cruft: Old outputs don’t linger
- Idempotency: Same code → same outputs
Alternative considered: Incremental updates
- ✅ Faster (only update changed files)
- ❌ Risk of stale files
- ❌ Non-deterministic (depends on previous state)
Solution: Clear by default; allow override via
PROJR_OUTPUT_CLEAR.
Why route dev builds to cache?
Decision: Dev builds write to
_tmp/projr/v<version>/ not _output
Rationale:
- Safety: Can’t accidentally overwrite released outputs
- Isolation: Multiple dev builds don’t conflict
- Cleanup: Cache can be deleted without losing work
Alternative considered: Use _output
with flag to prevent overwrites
- ✅ Simpler mental model (one output location)
- ❌ Risk of accidental overwrites
- ❌ Harder to keep dev and release outputs separate
Why YAML not TOML/JSON?
Decision: Use YAML for configuration
Rationale:
- Familiar: Most R users know YAML (R Markdown, pkgdown)
- Readable: Comments, no quotes on strings
- Expressive: Supports lists, nested structures
Alternatives considered:
TOML: - ✅ Simpler syntax - ❌ Less familiar in R ecosystem - ❌ Harder to nest deeply
JSON: - ✅ Strict, machine-friendly - ❌ Less human-readable (quotes, no comments) - ❌ Harder to hand-edit
Future directions
Potential enhancements
These are design considerations for future versions:
1. Incremental builds
Idea: Only rebuild changed documents
Pros: Faster builds, less re-rendering
Cons: More complexity, risk of stale outputs
Decision: Consider for v2.0 with careful invalidation logic
2. Dependency graphs
Idea: Track which outputs depend on which inputs
Pros: Finer-grained rebuilding, better traceability
Cons: Complexity, requires analysing code
Decision: Interesting but out-of-scope for now
3. Remote execution
Idea: Build on CI/cloud instead of locally
Pros: Reproducible environment, faster hardware
Cons: Network dependency, setup complexity
Decision: Possible via existing CI integrations (GitHub Actions)
4. Multi-language support
Idea: Support Python, Julia, etc., not just R
Pros: Broader audience, more use cases
Cons: Different ecosystems, more maintenance
Decision: Focus on R first; generalise later if demand exists
Comparison to alternatives
projr vs targets
targets: Pipeline tool for dependency tracking
Similarities: - Both focus on reproducibility - Both integrate with R Markdown
Differences: - targets: Focuses on caching intermediate results - projr: Focuses on versioning and archiving final outputs
Use together? Yes! Use targets for complex pipelines, projr for versioning and sharing.
projr vs workflowr
workflowr: Website-based project template
Similarities: - Both provide project structure - Both integrate with Git
Differences: - workflowr: Focuses on website generation - projr: Focuses on versioning and archiving
Use together? Potentially, though there’s overlap in Git integration.
Conclusion
projr’s design prioritises:
- Simplicity: One function does everything
- Safety: Dev builds can’t break releases
- Transparency: Configuration is visible and version-controlled
- Reproducibility: Automatic versioning and archiving
These principles guide every design decision, from directory structure to function naming to configuration format.
The result is a tool that makes reproducible research easier than non-reproducible research—which is exactly the point.