Add Synthetic EHR Generation Support -- GPT Baseline#879
Open
ethanrasmussen wants to merge 24 commits intosunlabuiuc:masterfrom
Open
Add Synthetic EHR Generation Support -- GPT Baseline#879ethanrasmussen wants to merge 24 commits intosunlabuiuc:masterfrom
ethanrasmussen wants to merge 24 commits intosunlabuiuc:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR introduces comprehensive synthetic EHR generation capabilities to PyHealth, enabling researchers to train generative models that create realistic synthetic patient histories. The implementation follows PyHealth conventions and provides a complete pipeline from data processing to model training and evaluation.
Changes
Core Functionality
1. New Task:
SyntheticEHRGenerationMIMIC3andSyntheticEHRGenerationMIMIC4File:
pyhealth/tasks/synthetic_ehr_generation.pyBaseTaskfor preparing patient visit sequences for generative modelingmin_visits,max_visits)2. New Model:
TransformerEHRGeneratorFile:
pyhealth/models/synthetic_ehr.py3. Utility Module:
synthetic_ehr_utilsFile:
pyhealth/synthetic_ehr_utils/synthetic_ehr_utils.pyProvides data conversion utilities for working with different EHR representations:
Core Functions:
tabular_to_sequences(): Converts long-form DataFrames to text sequencessequences_to_tabular(): Converts text sequences back to DataFramesnested_codes_to_sequences(): Converts PyHealth nested structure to textsequences_to_nested_codes(): Converts text sequences to nested structurecreate_flattened_representation(): Creates patient-level count matricesprocess_mimic_for_generation(): End-to-end MIMIC data processingExample Scripts
4. Baseline Models Script
File:
examples/synthetic_ehr_generation/synthetic_ehr_baselines.pyDemonstrates integration with popular generative model baselines:
Supported Models:
Features:
5. Transformer Example
File:
examples/synthetic_ehr_generation/synthetic_ehr_mimic3_transformer.pyComplete end-to-end example demonstrating:
SyntheticEHRGenerationMIMIC3taskTransformerEHRGeneratormodelConfigurable Parameters:
Integration Updates
6. Module Exports
Files Modified:
pyhealth/models/__init__.py: AddedTransformerEHRGeneratorimportpyhealth/tasks/__init__.py: AddedSyntheticEHRGenerationMIMIC3andSyntheticEHRGenerationMIMIC4importspyhealth/synthetic_ehr_utils/__init__.py: Added utility function exportsExamples
Training a Transformer Model
python synthetic_ehr_baselines.py \ --mimic_root /path/to/mimic3 \ --output_dir ./output \ --mode transformer_baseline \ --epochs 50 \ --batch_size 64Generating Synthetic Data
Dependencies
New optional dependencies for baseline models:
be-great: For GReaT model supportsdv: For CTGAN and TVAE model supportCore PyHealth dependencies remain unchanged.
Breaking Changes
None. This PR only adds new functionality.