Design • projr

Design philosophy

This article explains the design decisions behind projr and why it works the way it does.

Design goals

1. Minimal cognitive overhead

Goal: Reduce the mental burden of maintaining reproducible research.

How projr achieves this:

One function to remember: projr_build() does everything
Sensible defaults: Most projects work out-of-the-box
Convention over configuration: Standard directory structure
Gradual complexity: Start simple, customise as needed

Anti-pattern: Complex build systems requiring extensive configuration, multiple commands, and deep understanding of internals.

Example: Compare a typical make-based workflow:

# Traditional approach
make clean
make data
make analysis
make figures
make paper
make deploy-to-osf
git add .
git commit -m "Update"
git push
gh release create v0.1.0
# ... upload files manually ...

With projr:

projr_build()  # That's it

2. Fail-safe iteration

Goal: Make it safe to experiment without losing work.

How projr achieves this:

Dev builds: Route outputs to cache, never touch _output
Manifests: Always know what inputs created what outputs
Git integration: Automatic commits preserve history
Reversible versioning: Can always access previous versions via Git + archives

Anti-pattern: In-place modification of output directories leading to lost results.

Example: Without projr, you might:

# Accidentally overwrite yesterday's figures
render("analysis.Rmd")  # Oh no, the new plot is worse!
# Now you've lost the good version

With projr:

# Safe iteration
projr_build_dev()  # Outputs to _tmp/
# Check results, if bad, just run again
# If good:
projr_build()  # Now commit to _output

3. Automation without magic

Goal: Automate tedious tasks whilst maintaining transparency.

How projr achieves this:

Explicit configuration: _projr.yml makes everything visible
Predictable behaviour: Same inputs → same outputs
Inspectable artefacts: Manifests, build logs, Git history
No hidden state: All configuration in version-controlled files

Anti-pattern: Build systems with hidden state, implicit dependencies, or configuration scattered across multiple locations.

Example: projr’s manifest system:

label,path,hash,version,timestamp
raw-data,_raw_data,abc123...,v0.1.0,2024-01-15

You can inspect, version-control, and audit this. Compare to a system that tracks dependencies in a binary database or in-memory cache.

4. Reproducibility by default

Goal: Make it easier to be reproducible than not.

How projr achieves this:

Automatic versioning: Every build is versioned
Manifests: Inputs-outputs linkage is automatic
renv integration: Optional package version locking
Archiving: Automatic upload to GitHub/OSF
Restoration: One command to reconstruct

Anti-pattern: Reproducibility as an afterthought requiring manual effort.

Example: Without thinking about it, projr users get:

v0.1.0:
  - code: Git SHA abc123
  - data: Hash def456
  - outputs: Hash ghi789
  - packages: renv.lock
  - archived: GitHub Release v0.1.0

All automatically. To reproduce:

projr_restore_repo("owner/repo")
renv::restore()
projr_build()

Core design principles

Single-purpose directories

Principle: Each directory has exactly one purpose.

Rationale:

Clarity: No ambiguity about where things go
Selective sharing: Archive only what’s needed
Automation: Tools can act on directories knowing their contents
Restoration: Simple mapping from label to path

Trade-offs:

✅ Structure is immediately obvious
✅ Easy to share/archive specific parts
❌ More directories than minimal structure
❌ Some redundancy (e.g., separate output-figures and output-tables)

Why this trade-off is worth it:

The clarity and automation benefits outweigh the slight increase in directory count. Modern file systems handle many directories efficiently.

Versioned builds, not versioned files

Principle: Version the entire project state, not individual files.

Rationale:

Coherence: All files at version X are consistent with each other
Simplicity: One version number, not per-file versions
Traceability: Know exactly what produced what
Restoration: Restore entire consistent state

Trade-offs:

✅ Simpler mental model (one version)
✅ Guarantees consistency
❌ Version bumps even for small changes
❌ Can’t mix versions of different components

Why this trade-off is worth it:

Scientific outputs depend on multiple inputs. Versioning the whole project ensures you can always reconstruct a consistent state. Per-file versioning leads to combinatorial explosion of possible states.

Configuration in YAML, not code

Principle: Project structure and build behaviour in _projr.yml, not scattered across code.

Rationale:

Centralised: One place to understand project configuration
Readable: YAML is human-readable
Version-controlled: Configuration changes are tracked
Shareable: Easy to share configuration across projects

Trade-offs:

✅ Configuration is explicit and visible
✅ Easy to diff and merge configuration changes
❌ Less flexible than code-based configuration
❌ YAML syntax can be tricky

Why this trade-off is worth it:

Most research projects don’t need the flexibility of code-based configuration. The benefits of having a single, readable, version-controlled configuration file outweigh the limitations.

Dev builds vs final builds

Principle: Separate safe iteration from committed releases.

Rationale:

Safety: Dev builds can’t accidentally overwrite released outputs
Speed: Dev builds skip versioning and archiving
Clarity: Explicit distinction between “testing” and “committing”

Trade-offs:

✅ Safe experimentation
✅ Fast feedback loop
❌ Two commands to remember (dev vs final)
❌ Cache directory can grow large

Why this trade-off is worth it:

The safety and speed benefits are critical for iterative research. The cost of remembering two commands is minimal.

Git integration, not Git dependency

Principle: projr works with or without Git, but works better with it.

Rationale:

Accessibility: Beginners can use projr without learning Git
Power: Advanced users get automatic Git integration
Flexibility: Use Git features without learning them

Trade-offs:

✅ Low barrier to entry
✅ Automatic Git for those who want it
❌ More complex codebase (supporting both paths)
❌ Some features require Git (versioning)

Why this trade-off is worth it:

Git is powerful but intimidating. By making it optional, projr reaches more users whilst still offering Git benefits to those who want them.

Architecture

Layered design

projr is organised into layers:

User-facing API (projr_build, projr_init, ...)
         ↓
Configuration layer (YAML parsing, validation)
         ↓
Build engine (rendering, versioning, archiving)
         ↓
Backend services (Git, GitHub, OSF, file system)

Benefits:

Modularity: Each layer can be tested independently
Extensibility: New backends (e.g., Zenodo) can be added
Clarity: Separation of concerns

Function naming conventions

projr uses systematic naming:

projr_* - All exported functions
.projr_* - Internal functions (not exported)
projr_build* - Build-related functions
projr_init* - Initialisation functions
projr_yml_* - YAML configuration functions
projr_path_* - Path helper functions

Benefits:

Discoverability: Autocomplete groups related functions
Clarity: Function purpose is obvious from name
Namespace: All public functions prefixed to avoid conflicts

Configuration precedence

projr uses this precedence for configuration:

Environment variables (highest)
Profile YAML (_projr-{profile}.yml)
Default YAML (_projr.yml)
Built-in defaults (lowest)

Example:

# Built-in default
output: _output

# Overridden in _projr.yml
output: _my_output

# Overridden in _projr-dev.yml (if PROJR_PROFILE=dev)
output: _dev_output

# Overridden by environment variable (if set)
PROJR_OUTPUT_DIR=_temp_output

Final value: _temp_output

Benefits:

Flexibility: Different contexts without editing files
Explicitness: Clear hierarchy of precedence
Debuggability: Easy to trace where a setting comes from

Manifest format

Manifests use CSV for simplicity and compatibility:

label,path,hash,version,timestamp
raw-data,_raw_data,abc123...,v0.1.0,2024-01-15T10:30:00Z

Why CSV?

Universal: Readable by any tool (R, Python, Excel)
Simple: No complex parsing
Diff-friendly: Git can show line-by-line changes
Human-readable: Open in text editor or spreadsheet

Alternative considered: JSON

✅ More structured
❌ Less human-readable
❌ Harder to diff
❌ Overkill for simple tabular data

Design decisions

Why semantic versioning?

Decision: Use x.y.z versioning (major.minor.patch)

Rationale:

Familiar: Most developers know SemVer
Expressive: Can communicate scale of changes
Tooling: Many tools understand SemVer

Alternative considered: Date-based versioning (2024-01-15)

✅ Chronological ordering
❌ Doesn’t communicate significance of changes
❌ Multiple versions per day require disambiguation

Why default to GitHub Releases?

Decision: GitHub Releases is the default archive destination

Rationale:

Ubiquity: Most R projects already use GitHub
Free: Unlimited public releases, generous private quotas
Integrated: Works with existing Git workflow
Accessible: Web interface for downloads

Alternative considered: OSF as primary

✅ Designed for research
✅ Better for large datasets
❌ Separate account/authentication
❌ Less familiar to R developers

Solution: Support both; default to GitHub for familiarity.

Why clear _output before builds?

Decision: Default to clearing _output before final builds

Rationale:

Correctness: Ensures outputs match current code
No cruft: Old outputs don’t linger
Idempotency: Same code → same outputs

Alternative considered: Incremental updates

✅ Faster (only update changed files)
❌ Risk of stale files
❌ Non-deterministic (depends on previous state)

Solution: Clear by default; allow override via PROJR_OUTPUT_CLEAR.

Why route dev builds to cache?

Decision: Dev builds write to _tmp/projr/v<version>/ not _output

Rationale:

Safety: Can’t accidentally overwrite released outputs
Isolation: Multiple dev builds don’t conflict
Cleanup: Cache can be deleted without losing work

Alternative considered: Use _output with flag to prevent overwrites

✅ Simpler mental model (one output location)
❌ Risk of accidental overwrites
❌ Harder to keep dev and release outputs separate

Why YAML not TOML/JSON?

Decision: Use YAML for configuration

Rationale:

Familiar: Most R users know YAML (R Markdown, pkgdown)
Readable: Comments, no quotes on strings
Expressive: Supports lists, nested structures

Alternatives considered:

TOML: - ✅ Simpler syntax - ❌ Less familiar in R ecosystem - ❌ Harder to nest deeply

JSON: - ✅ Strict, machine-friendly - ❌ Less human-readable (quotes, no comments) - ❌ Harder to hand-edit

Future directions

Potential enhancements

These are design considerations for future versions:

1. Incremental builds

Idea: Only rebuild changed documents

Pros: Faster builds, less re-rendering

Cons: More complexity, risk of stale outputs

Decision: Consider for v2.0 with careful invalidation logic

2. Dependency graphs

Idea: Track which outputs depend on which inputs

Pros: Finer-grained rebuilding, better traceability

Cons: Complexity, requires analysing code

Decision: Interesting but out-of-scope for now

3. Remote execution

Idea: Build on CI/cloud instead of locally

Pros: Reproducible environment, faster hardware

Cons: Network dependency, setup complexity

Decision: Possible via existing CI integrations (GitHub Actions)

4. Multi-language support

Idea: Support Python, Julia, etc., not just R

Pros: Broader audience, more use cases

Cons: Different ecosystems, more maintenance

Decision: Focus on R first; generalise later if demand exists

Comparison to alternatives

projr vs targets

targets: Pipeline tool for dependency tracking

Similarities: - Both focus on reproducibility - Both integrate with R Markdown

Differences: - targets: Focuses on caching intermediate results - projr: Focuses on versioning and archiving final outputs

Use together? Yes! Use targets for complex pipelines, projr for versioning and sharing.

projr vs workflowr

workflowr: Website-based project template

Similarities: - Both provide project structure - Both integrate with Git

Differences: - workflowr: Focuses on website generation - projr: Focuses on versioning and archiving

Use together? Potentially, though there’s overlap in Git integration.

projr vs usethis

usethis: Package development infrastructure

Similarities: - Both automate setup tasks - Both follow conventions

Differences: - usethis: For R packages - projr: For research projects

Use together? Yes! Use usethis for package development, projr for analysis projects.

Conclusion

projr’s design prioritises:

Simplicity: One function does everything
Safety: Dev builds can’t break releases
Transparency: Configuration is visible and version-controlled
Reproducibility: Automatic versioning and archiving

These principles guide every design decision, from directory structure to function naming to configuration format.

The result is a tool that makes reproducible research easier than non-reproducible research—which is exactly the point.