betydata R data package with BETYdb public data export by divine7022 · Pull Request #12 · PecanProject/betydata

divine7022 · 2026-02-11T20:24:15Z

Summary

Initial release of betydata, an R data package providing offline access to
public data from BETYdb

16 datasets: traitsview (43,532 rows) + 15 reference tables
Multiple formats: .rda (lazy-loaded), Parquet, Frictionless datapackage.json
Filtered to public data only (access_level = 4, checked >= 0)
Complete roxygen2 documentation for all datasets
Package-level documentation with BETYdb context
Data quality policy in README (checked column, access levels)

Vignettes

orientation: Package overview and data relationships
sql-analogs: Migrate BETYdb SQL queries to dplyr
pfts-priors: Working with PFTs and Bayesian priors
manuscript: Reproduce LeBauer et al. (2018) analyses

Datasets

Dataset	Description
traitsview	Primary trait/yield observations (43,532 × 36)
species	Plant taxonomy
sites	Research site locations
variables	Trait definitions and units
citations	Literature references
pfts	Plant functional types
priors	Bayesian prior distributions
+ 9 more	Support and relationship tables

implements #1, #2, #3, #4, #5, #6, #7, #8, #9, #10, #11

Copilot

Pull request overview

This PR delivers the initial release (v0.1.0) of betydata, an R data package providing offline access to public data from the BETYdb (Biofuel Ecophysiological Traits and Yields) database. The package enables reproducible analyses of plant traits and crop yields without requiring database connectivity.

Changes:

Complete R package structure with 16 datasets (traitsview + 15 support tables) totaling 43,532+ trait and yield records
Multiple data formats: lazy-loaded .rda files, Parquet alternatives, and Frictionless metadata (datapackage.json)
Comprehensive documentation: roxygen2 docs for all datasets, 4 vignettes (orientation, sql-analogs, pfts-priors, manuscript), and GitHub issue templates
Quality controls: excludes checked=-1 records, public data only (access_level >= 4), full test coverage
CI/CD infrastructure: GitHub Actions R-CMD-check workflow, testthat 3.0 test suite

Reviewed changes

Copilot reviewed 38 out of 71 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
DESCRIPTION	Package metadata and dependencies; minor email format issue
CITATION.cff	Citation metadata; email and missing preferred-citation issues
LICENSE	BSD-3-Clause license file
README.md	Comprehensive package documentation; table formatting issue
NEWS.md	Release notes documenting v0.1.0
R/betydata-package.R	Package-level documentation
R/data.R	Roxygen2 documentation for all 16 datasets
man/*.Rd	Generated documentation files for datasets
vignettes/*.Rmd	Four tutorial vignettes; minor issues in manuscript.Rmd and pfts-priors.Rmd
tests/testthat/*.R	Test suite for data and metadata validation; deprecated context() calls
data-raw/make-data.R	Data build script for generating .rda and Parquet files
inst/metadata/datapackage.json	Frictionless Data package metadata
inst/extdata/parquet/*.parquet	Sample Parquet data files
data/*.rda	Binary R data files (compressed with xz)
.github/workflows/*.yaml	GitHub Actions CI configuration
.github/ISSUE_TEMPLATE/*.md	Issue templates for data corrections and verifications
.gitignore, .Rbuildignore	Build and version control configuration; CSV exclusion concern

Comments suppressed due to low confidence (2)

tests/testthat/test-metadata.R:3

The context() function on line 3 is deprecated in testthat 3.0.0 and later. According to the DESCRIPTION file, this package uses testthat (>= 3.0.0) and has Config/testthat/edition: 3. The context() calls should be removed as they are no longer needed and will generate warnings.
tests/testthat/test-data.R:3
The context() function on line 3 is deprecated in testthat 3.0.0 and later. According to the DESCRIPTION file, this package uses testthat (>= 3.0.0) and has Config/testthat/edition: 3. The context() calls should be removed as they are no longer needed and will generate warnings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vignettes/manuscript.Rmd

CITATION.cff

DESCRIPTION

README.md

vignettes/pfts-priors.Rmd

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

dlebauer

I've done a quick first review. On a future review I will go through all of the vignettes and explore the tables as they exist.

I am now wondering if we should 1) store the data in CSV files to allow text-based version control and 2) if we can reconstruct traitsview on the fly from the component datasets (i.e. traitsview should not be in data_raw)

inst/metadata/datapackage.json

vignettes/orientation.Rmd

README.md

…re-install the package into the standard library before R CMD check runs

…y skip code execution if the package is unavailable (e.g. someone running R CMD check locally without quarto's library path fix), instead of crashing with an error

divine7022 · 2026-03-02T03:31:35Z

store the data in CSV files to allow text-based version control

For v0.1.0, i kept the conventional approach: CSVs in data-raw/csv/ (build source, gitignored) and .rda in data/ (shipped format); this matches R data package conventions and keeps the repo size manageable.

For text based change tracking, one option is to start version controlling data-raw/csv/ (remove from .gitignore). This would give diffable change visibility without breaking R package conventions. The .rda files would still be the shipped format for lazydata

happy to implement this if you prefer it for v0.2.0

if we can reconstruct traitsview on the fly from the component datasets

currently no -- the core trait/yield records (mean, n, stat, checked, etc.) exist only within the denormalized traitsview.csv. The support tables (species, sites, citations...) are reference/lookup tables but don't contain the actual measurements themselves

To reconstruct traitsview on the fly, we would need to also export the raw traits and yields tables from BETYdb (with their foreign keys), then join them to the dimension tables in R.

For now, shipping the pre-built traitsview is the most practical approach. If we want to move toward a normalized structure in a future version, we could export traits and yields as separate tables and add a helper function to join them. Could be a good goal for v0.2.0.

divine7022 · 2026-03-02T03:39:40Z

Heads-up : https://github.com/PecanProject/betydata/actions/runs/22558371999/job/65339932874?pr=12

Just drafting a note on windows CI: windows R CMD check was failing with there is no package called 'betydata' during vignette rendering. This is a known quarto vignette engine issue (tracked at quarto-dev issue -- #217) -- quarto spawns a separate R subprocess that doesn't inherit the temporary library path used during R CMD check, so library(betydata) fails in that subprocess.

Workaround applied:

added local::. to extra-packages in R-CMD-check.yaml to pre-install the package before check runs
added requireNamespace("betydata") eval guard in each vignette as a fallback

dlebauer · 2026-03-06T21:01:40Z

In the current state, this PR will close:

It does not close:

Remaining work to close 1,2,3,11 note: it is okay to either a) leave these open or b) create one or more new issues to track remaining work

inst/STYLE_GUIDE.md (Scaffold betydata R package (usethis + renv + CI) #1)
remove parquet and commit CSV files; update make-data.R script as needed (Build MVP datasets: CSV sources in data-raw → exported data objects + Parquet #2)
- tracked in Put data-raw/csv/ under version control and include traitsview as the primary table #13
update README to tell users to file issues to report data errors and verification. (Lightweight data policy: checked defaults + correction workflow #3)
- could also add PR template for corrections
citations (Release + citation: GitHub release + Zenodo DOI + CITATION.cff #11)
- add inst/CITATION
- citation('betydata') should print valid citation
- add doi (will require post merge release)
update NEWS.md

Copilot

Pull request overview

Copilot reviewed 43 out of 76 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

data-raw/make-data.R

Copilot · 2026-03-06T21:08:07Z

data-raw/make-data.R

+  df <- get(nm)
+  base <- list(
+    name = nm,
+    path = paste0("data/", nm, ".rda"),


In datapackage.json, resource path values are generated as data/<name>.rda, but the JSON file is located under inst/metadata/. Relative paths like data/traitsview.rda won’t resolve from that directory (they would point to inst/metadata/data/...). Consider either (a) generating paths relative to inst/metadata (e.g., ../data/<name>.rda), or (b) moving datapackage.json to the package root / a location where data/ is a sibling per the Frictionless spec.

Suggested change

path = paste0("data/", nm, ".rda"),

path = paste0("../data/", nm, ".rda"),

Should use datapackage to document the csv files in data-raw? Those are the files that will be edited etc.

good call, updated datapackage.json to document the CSV source files in data-raw/csv/ instead of .rda

.github/workflows/quarto-publish.yml

index.qmd

NEWS.md

DESCRIPTION

Copilot · 2026-03-06T21:08:08Z

data-raw/make-data.R

+# Filter out checked = -1
+traitsview <- traitsview[is.na(traitsview$checked) | traitsview$checked != -1, ]
+
+# Drop access_level column (all records are public, access_level = 4)


The script drops access_level with the assumption that the input CSV is already filtered to public records, but it never enforces or asserts this. To prevent accidental inclusion of non-public data, filter traitsview to the intended access level(s) (e.g., access_level == 4) or add a hard check that fails the build if any non-public access_level values are present before removing the column.

Suggested change

# Drop access_level column (all records are public, access_level = 4)

# Enforce public-only records: keep only access_level = 4, then drop the column

traitsview <- traitsview[traitsview$access_level == 4, ]

lets summarize all data with access_level < 4. it is possible that there is relatively little, and that it can now be released (10+ years later)

ran summary, all 43,532 records in the csv already have access_level = 4 and no records with access_level < 4; export came from betydb's public traitsview, which filters to public only by default

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

divine7022 added 18 commits February 11, 2026 14:22

update DESCRIPTION

1866c45

add .Rbuildignore

ff1485f

add git workflows

4c1bada

ad gitignore

c8d41b5

add citation

86bee56

add NEWS.md

dce94e3

add dataset doc

0a1b59b

add readme

164df97

add csv data

6d69f73

add .rda files

b3f17cb

add inst/

f632581

add .Rd files

29e0fd0

add unit tests

41362a5

add vignettes

cdd8d18

add .html and csv to ignore

69ce06a

update r cmd check

afa3163

add PULL_REQUEST_TEMPLATE.md

dbdf052

add maps to DESCRIPTION

f4444ac

divine7022 requested a review from dlebauer February 11, 2026 21:13

divine7022 added 5 commits February 11, 2026 16:48

add license

b6f9ada

udpate .Rd

11bc4ec

add citation to .Rbuildignore

25653dc

add jsonlit to DESCRIPTION

3faa298

update roxy

0265a5c

dlebauer requested a review from Copilot February 19, 2026 18:42

Copilot started reviewing on behalf of dlebauer February 19, 2026 18:42 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

clarify citation

4e0d685

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

dlebauer requested changes Feb 21, 2026

View reviewed changes

update build and git ignore rules

cf2a620

divine7022 added 2 commits March 1, 2026 21:50

fix know quarto-dev upstream bug #217 on windows; added local::. to p…

dfef334

…re-install the package into the standard library before R CMD check runs

added a hidden setup chunk at the top; this makes vignettes gracefull…

a4fd3f4

…y skip code execution if the package is unavailable (e.g. someone running R CMD check locally without quarto's library path fix), instead of crashing with an error

dlebauer mentioned this pull request Mar 6, 2026

Put data-raw/csv/ under version control and include traitsview as the primary table #13

Open

dlebauer requested a review from Copilot March 6, 2026 21:01

Copilot started reviewing on behalf of dlebauer March 6, 2026 21:02 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

dlebauer and others added 14 commits March 6, 2026 17:03

require dplyr >= 1.1.0

eadd8d1

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

convert checked=NA to 0 and enforce access_level==4 filter

eb62be5

remove arrow from description suggests

aa65e72

update description

179678d

add CITATION

b07b138

add preferred-citation block, DOI comment note

17660bc

documenting package conventions

fb95640

add data reporting section to readme

0730d10

PR template for corrections

701156f

update NEWS with correct vignette names and all changes

3968994

fix formatting

1256fc8

fix dependencies

c500a62

update resource paths to reference csv source files

e59900f

add csv files

6fc317f

	path = paste0("data/", nm, ".rda"),
	path = paste0("../data/", nm, ".rda"),

	# Drop access_level column (all records are public, access_level = 4)
	# Enforce public-only records: keep only access_level = 4, then drop the column
	traitsview <- traitsview[traitsview$access_level == 4, ]

Conversation

divine7022 commented Feb 11, 2026

Summary

Datasets

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dlebauer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

divine7022 commented Mar 2, 2026

Uh oh!

divine7022 commented Mar 2, 2026

Uh oh!

dlebauer commented Mar 6, 2026 • edited by divine7022 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

dlebauer Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

divine7022 Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

dlebauer Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

divine7022 Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dlebauer commented Mar 6, 2026 •

edited by divine7022

Loading