betydata R data package with BETYdb public data export#12
betydata R data package with BETYdb public data export#12divine7022 wants to merge 58 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR delivers the initial release (v0.1.0) of betydata, an R data package providing offline access to public data from the BETYdb (Biofuel Ecophysiological Traits and Yields) database. The package enables reproducible analyses of plant traits and crop yields without requiring database connectivity.
Changes:
- Complete R package structure with 16 datasets (traitsview + 15 support tables) totaling 43,532+ trait and yield records
- Multiple data formats: lazy-loaded .rda files, Parquet alternatives, and Frictionless metadata (datapackage.json)
- Comprehensive documentation: roxygen2 docs for all datasets, 4 vignettes (orientation, sql-analogs, pfts-priors, manuscript), and GitHub issue templates
- Quality controls: excludes checked=-1 records, public data only (access_level >= 4), full test coverage
- CI/CD infrastructure: GitHub Actions R-CMD-check workflow, testthat 3.0 test suite
Reviewed changes
Copilot reviewed 38 out of 71 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| DESCRIPTION | Package metadata and dependencies; minor email format issue |
| CITATION.cff | Citation metadata; email and missing preferred-citation issues |
| LICENSE | BSD-3-Clause license file |
| README.md | Comprehensive package documentation; table formatting issue |
| NEWS.md | Release notes documenting v0.1.0 |
| R/betydata-package.R | Package-level documentation |
| R/data.R | Roxygen2 documentation for all 16 datasets |
| man/*.Rd | Generated documentation files for datasets |
| vignettes/*.Rmd | Four tutorial vignettes; minor issues in manuscript.Rmd and pfts-priors.Rmd |
| tests/testthat/*.R | Test suite for data and metadata validation; deprecated context() calls |
| data-raw/make-data.R | Data build script for generating .rda and Parquet files |
| inst/metadata/datapackage.json | Frictionless Data package metadata |
| inst/extdata/parquet/*.parquet | Sample Parquet data files |
| data/*.rda | Binary R data files (compressed with xz) |
| .github/workflows/*.yaml | GitHub Actions CI configuration |
| .github/ISSUE_TEMPLATE/*.md | Issue templates for data corrections and verifications |
| .gitignore, .Rbuildignore | Build and version control configuration; CSV exclusion concern |
Comments suppressed due to low confidence (2)
tests/testthat/test-metadata.R:3
- The
context()function on line 3 is deprecated in testthat 3.0.0 and later. According to the DESCRIPTION file, this package usestestthat (>= 3.0.0)and hasConfig/testthat/edition: 3. The context() calls should be removed as they are no longer needed and will generate warnings.
tests/testthat/test-data.R:3 - The
context()function on line 3 is deprecated in testthat 3.0.0 and later. According to the DESCRIPTION file, this package usestestthat (>= 3.0.0)and hasConfig/testthat/edition: 3. The context() calls should be removed as they are no longer needed and will generate warnings.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
dlebauer
left a comment
There was a problem hiding this comment.
I've done a quick first review. On a future review I will go through all of the vignettes and explore the tables as they exist.
I am now wondering if we should 1) store the data in CSV files to allow text-based version control and 2) if we can reconstruct traitsview on the fly from the component datasets (i.e. traitsview should not be in data_raw)
…re-install the package into the standard library before R CMD check runs
…y skip code execution if the package is unavailable (e.g. someone running R CMD check locally without quarto's library path fix), instead of crashing with an error
For v0.1.0, i kept the conventional approach: CSVs in For text based change tracking, one option is to start version controlling happy to implement this if you prefer it for v0.2.0
currently no -- the core trait/yield records (mean, n, stat, checked, etc.) exist only within the denormalized To reconstruct traitsview on the fly, we would need to also export the raw traits and yields tables from BETYdb (with their foreign keys), then join them to the dimension tables in R. For now, shipping the pre-built traitsview is the most practical approach. If we want to move toward a normalized structure in a future version, we could export traits and yields as separate tables and add a helper function to join them. Could be a good goal for v0.2.0. |
|
Heads-up : https://github.com/PecanProject/betydata/actions/runs/22558371999/job/65339932874?pr=12 Just drafting a note on windows CI: windows R CMD check was failing with there is no package called 'betydata' during vignette rendering. This is a known quarto vignette engine issue (tracked at quarto-dev issue -- #217) -- quarto spawns a separate R subprocess that doesn't inherit the temporary library path used during R CMD check, so library(betydata) fails in that subprocess. Workaround applied:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 43 out of 76 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
data-raw/make-data.R
Outdated
| df <- get(nm) | ||
| base <- list( | ||
| name = nm, | ||
| path = paste0("data/", nm, ".rda"), |
There was a problem hiding this comment.
In datapackage.json, resource path values are generated as data/<name>.rda, but the JSON file is located under inst/metadata/. Relative paths like data/traitsview.rda won’t resolve from that directory (they would point to inst/metadata/data/...). Consider either (a) generating paths relative to inst/metadata (e.g., ../data/<name>.rda), or (b) moving datapackage.json to the package root / a location where data/ is a sibling per the Frictionless spec.
| path = paste0("data/", nm, ".rda"), | |
| path = paste0("../data/", nm, ".rda"), |
There was a problem hiding this comment.
Should use datapackage to document the csv files in data-raw? Those are the files that will be edited etc.
There was a problem hiding this comment.
good call, updated datapackage.json to document the CSV source files in data-raw/csv/ instead of .rda
data-raw/make-data.R
Outdated
| # Filter out checked = -1 | ||
| traitsview <- traitsview[is.na(traitsview$checked) | traitsview$checked != -1, ] | ||
|
|
||
| # Drop access_level column (all records are public, access_level = 4) |
There was a problem hiding this comment.
The script drops access_level with the assumption that the input CSV is already filtered to public records, but it never enforces or asserts this. To prevent accidental inclusion of non-public data, filter traitsview to the intended access level(s) (e.g., access_level == 4) or add a hard check that fails the build if any non-public access_level values are present before removing the column.
| # Drop access_level column (all records are public, access_level = 4) | |
| # Enforce public-only records: keep only access_level = 4, then drop the column | |
| traitsview <- traitsview[traitsview$access_level == 4, ] |
There was a problem hiding this comment.
lets summarize all data with access_level < 4. it is possible that there is relatively little, and that it can now be released (10+ years later)
There was a problem hiding this comment.
ran summary, all 43,532 records in the csv already have access_level = 4 and no records with access_level < 4; export came from betydb's public traitsview, which filters to public only by default
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Summary
Initial release of
betydata, an R data package providing offline access topublic data from BETYdb
traitsview(43,532 rows) + 15 reference tablesaccess_level = 4,checked >= 0)Vignettes
orientation: Package overview and data relationshipssql-analogs: Migrate BETYdb SQL queries to dplyrpfts-priors: Working with PFTs and Bayesian priorsmanuscript: Reproduce LeBauer et al. (2018) analysesDatasets
implements #1, #2, #3, #4, #5, #6, #7, #8, #9, #10, #11