# aplusml **Repository Path**: supresong/aplusml ## Basic Information - **Project Name**: aplusml - **Description**: 1111111111111111111 - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-10-06 - **Last Updated**: 2023-10-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # APLUS ML **A** **P**ython **L**ibrary for **U**sefulness **S**imulations of **M**achine **L**earning Models ---- ![Graphical Abstract](assets/graphical%20abstract.png) Corresponding paper: [APLUS - Journal of Biomedical Informatics](https://www.sciencedirect.com/science/article/pii/S1532046423000400?via%3Dihub) Citation: ``` @article{wornow2023aplus, title={APLUS: A Python Library for Usefulness Simulations of Machine Learning Models in Healthcare}, author={Wornow, Michael and Ross, Elsie Gyang and Callahan, Alison and Shah, Nigam H}, journal={Journal of Biomedical Informatics}, pages={104319}, year={2023}, publisher={Elsevier} } ``` ## Installation 1. Run the following commands to install **APLUS ML**: ```bash pip install aplusml ``` 2. Install **graphviz** by [downloading it here](https://graphviz.org/download/). If you're on Mac with `homebrew`, simply run: ``` brew install graphviz ``` ## Motivation APLUS ML is a simulation framework for conducting usefulness assessments of machine learning models in workflows. It aims to quantitatively answer the question: *If I use this ML model within this workflow, will the benefits outweigh the costs, and by how much?* APLUS was originally developed for clinical workflows in healthcare settings, thus all of our examples are healthcare workflow.s. However, APLUS ML is a broadly applicable library to any workflow that involves a machine learning model making decisions on a stream of datapoints, and we encourage contributors from any domain to use and extend APLUS ML. ## Tutorials We showcase APLUS on two clinical workflows: 1. Early detection of peripheral artery disease (PAD) 2. Triaging patients for advanced care planning (ACP) consults Jupyter notebooks for these use cases can be found in the `tutorials/` folder. ### Early Detection of PAD The code used to generate the figures in our paper is located in the `tutorials/` directory in `pad.ipynb`. This notebook loads de-identified patient data from Stanford Hospital, which can be provided upon request. The workflows analyzed can be found in the `workflows/` folder. The doctor-driven workflow is in `pad_doctor.yaml` while the nurse-driven workflow is in `pad_nurse.yaml` This `tutorials/pad.ipynb` was used to generate the following figures from the APLUS paper: ![PAD Figure 1](assets/pad%20figure%201.png) ![PAD Figure 2](assets/pad%20heatmap.png) ### Triaging Patients for ACP Consults The code used to replicate the findings of [Jung et al. 2021](https://pubmed.ncbi.nlm.nih.gov/33355350/) can be found in the `tutorials/` directory in `acp_jung_replication.ipynb`. This notebook loads de-identified patient data from Stanford Hospital, which can be provided upon request. The workflows analyzed can be found in the `workflows/` folder in `acp_jung_replication.yaml` ![ACP Figure](assets/acp%20figure.png) ## Plot Gallery Some additional example plots that can be generated by APLUS are included below: ![Additional Plots](assets/additional%20plots.png) ## Conceptual Overview ### Simulation We use discrete event simulation to simulate our workflow $W$. In other words, we represent the world as occuring through a set of discrete, evenly spaced timesteps $\lambda = 0, 1,...,N$. Each timestep $\lambda$ could represent a second, minute, hour, day, etc., the interpretation is up to the user. Events $A$ and $B$ that occur within the same timestep $\lambda$ can have arbitrary ordering if there does not exist a strict $A \rightarrow B$ or $B \rightarrow A$ dependency between these events. In other words, if 3 patients have an MRI and 2 patients have a blood test on timestep $\lambda = 3$, then assuming none of these events are dependent on each other, the ordering in which the blood tests and MRIs occur will be random. A **"duration"** refers to a number of timesteps (i.e. a length of time). ### Workflow A workflow $W$ is simply a set of states $S$. ### States Each state $s \in S$ has associated with it: 1. A duration $\lambda_s$ representing how many timesteps an agent will wait in this state before transitioning to another state 2. A set of utilities $U_s$ 3. A set of resource deltas $R_s \subseteq R$ that specify how various resources $r \in R_s$ change when an agent arrives at this state 4. A set of transitions $T_s \subseteq T$ 5. A type $\tau_s \in \{\text{start}, \text{normal}, \text{end}\}$. Invariants: * $|\{ s \in S | \tau_s = \text{start} \}| = 1$ * $ \forall s \in S$ such that $\tau_s = \text{start}, |T_s| > 0$ * $ \forall s \in S$ such that $\tau_s = \text{normal}, |T_s| > 0$ * $ \forall s \in S$ such that $\tau_s = \text{end}, |T_s| = 0$ ### Transitions Given the set of all transitions $T$, each transition $t \in T_s \subseteq T$ has associated with it: 1. A source state $s \in S$ 1. A destination state $s' \in S$ (where $s'$ could be the same as $s$) 1. A duration $\lambda_t$ representing how many timesteps an agent will wait, after having chosen this transition $t$, before moving to state $s'$ 2. A condition $c_t \in C$ that, only when TRUE, allows the agent to take this transition $t$ to state $s'$ 2. A set of utilities $U_t$ 3. A set of resource deltas $R_t \subseteq R$ that specify how various resources $r \in R_t$ change after an agent takes this transition ### Utilities Given the set of all utilities $U$, each utility $u \in U$ has associated with it: 1. A value $u_v \in \mathbb{R}$ representing the numeric value of this utility 1. A unit $u_u$ (i.e. QALYs, US dollars, years, etc.) 1. A condition $c_u \in C$ that, only when TRUE, has the simulation record that this utility value $u_v$ for unit $u_u$ was achieved ### Conditions A condition $c \in C$ determines whether a utility or transition can be taken. A condition $c$ can take the form of either: 1. A probability (in which case $\{ c \in \mathbb{R} | 0 \le c \le 1 \}$); OR 1. An arbitrary Python expression which evaluates to TRUE or FALSE ### Resources A resource $r \in R$ is a constrained resource that is shared across all patients. This represents a hospital-level constraint of a workflow (i.e. fiscal budget, number of nurses, MRI machine availability, etc.). Each resource $r$ has associated with it: 1. A level $r_l \in \mathbb{N}$ which represents the current value of the resource 1. An initial amount $r_i \in \mathbb{N}$ which ensures $r_l = r_i$ when $\lambda = 0$ 1. An maximum capacity $r_m \in \mathbb{N}$ which ensures that $r_l \le r_m$ 1. A refill amount $r_a \in \mathbb{N}$ that represents how much this resource gets increased after $\lambda_r$ timesteps have elapsed since the last refill 1. A refill duration $\lambda_r \in \mathbb{N}$ that represents how many timesteps must elapse before the resource is increased to a value of $\max{r_l + r_a, r_m}$ **!! Important Note !!** In order to decrement a resource, you need to specify a **resource delta** on the relevant state/transition. Otherwise, if you just require that `nurse_capacity > 0` for a transition, then the simulation will not automatically decrement `nurse_capacity` by 1 when that transition is taken (which can be surprising to some users). This is often a cause of infinite loops, or situations where changing the $r_i$, $r_m$, or $r_a$ of a resource has no effect on the model's achieved utility. ### Patients Each patient $p \in P$ has associated with it: 1. A start timestep $\lambda_p$ representing the timestep of the simulation at which the patient began progressing through the workflow (i.e. the day that the patient was admitted to the hospital) 1. A current state $s_p \in S$. The patient always starts at a state $s_p$ where $\tau_{s_p} = \text{start}$ 2. A set of **properties** $\Rho_p$ which can be anything (integers, floats, strings, dictionaries, lists, etc.) 3. A **history** object $H_p$ which captures all of the past states, transitions, and utilities that the patient achieved. ### Running a Simulation Each patient $p$ starts his/her workflow at the state $s$, where $\tau_s = \text{start}$. Note: This is the same for all patients Each patient $p$ starts his/her workflow at timestep $\lambda_p$. Note: This varies across all patients. Each patient $p$ then progresses through the states of the workflow, according to the applicable transitions and conditions. The patient stops their journey when either of the following conditions are met: * The patient reaches a state $s$ where $\tau_s = \text{end}$; OR * The simulation is terminated prematurely after a set number of timesteps have occured ## Workflow Schema APLUS requires you to specify the workflow that you want to simulate within a YAML file. The schema of this YAML configuration file is as follows: ``` (+) = optional {a|b} = must be either string a or b metadata: name (+): str path_to_functions (+): str => Path to PY file containing Python functions listed in 'variables' path_to_properties (+): str => Path to CSV file containing Patient properties listed in 'variables' properties_col_for_patient_id (+): str => Name of column in properties file corresponding to the Patient ID patient_sort_preference_property (+): variable: str => Name of property (must be listed in 'variables') to order patients by is_ascending: bool => If TRUE, then ascending; else descending variables: dict [key]: str => Represents ID of state => Must be unique type: str{scalar|resource|property_dist|property_file|simulation|function} (default = "scalar") => 'scalar' = a constant => 'resource' = a hospital resource that fluctuates => 'property' = a per patient property (from a file or randomly sampled) => 'simulation' = tracked by simulation str{sim_current_timestep|time_left_in_sim|time_already_in_sim} % If type == 'scalar'... value: (int|float|bool|str|list|dict|set) NOTE: If specifify a 'set', then need to prepend the set with the '!!set' tag NOTE: 'tuple' is not currently supported % If type == 'resource'... init_amount: int max_amount: int refill_amount: int refill_duration: int % If type == 'property'... % Either load from file... column: str % or constant... value: Any % or randomly sample... distribution: str{bernoulli|exponential|binomial|normal|poisson|uniform} mean (+): float std (+): float start (+): float end (+): float states: dict [key]: str => Represents ID of state => Must be unique label (+): str (default = value of 'id') type (+): str{start|end|intermediate} (default = "intermediate") duration (+): float (default = 0.0) => Waiting this number of timesteps occurs AS SOON AS this state is hit (so BEFORE any transitions from it are calculated) utilities (+): str|float|bool|list[dict] (default = 0.0) => If not a list[dict], then the expression specified as evaluated as Python - value (+): float|str (default = 0.0) => If str, then assume it's a function if (+): str => String is a conditional expression => If TRUE, then set 'value' as utility for this 'unit' => NOTE - These 'if' statements are not mutually exclusive (i.e. multiple ones will simply be summed together) unit (+): str (default = "") resource_deltas (+): dict[float] (default = {}) => [key] = resource from 'variables', [value] = how much to change each resource level AS SOON AS this state is hit (so BEFORE any transitions from it are calculated, but AFTER the duration has occurred) TODO: property_updates (+): dict[float] (default = {}) => [key] = property from 'variables', [value] = new value of this property for patient AS SOON AS this state is hit transitions (+): list[dict] (default = []) dest: str => Must match ID of a state label (+): str (default = "") % Can either have... % - All transitions have an 'if' condition (where if the last transition doesn't have an 'if', it defaults to always TRUE) % - All transitions have a 'prob' condition (where if the last transition doesn't have a 'prob', it defaults to = 1 - (sum of other probs)) % - The first half of transitions have an 'if' condition, but the second have of transitions have a 'prob' (all 'if' transitions must have an 'if', the 'prob' are evaluated conditional on all 'if' being FALSE, and the last 'prob' transition defaults to = 1 - sum(other probs)) if (+): bool|str (default = bool:true) => String is a conditional expression => Must come BEFORE 'prob' (if mixed) => Must always be at least one TRUE condition across all transitions for this state (unless mixed with 'prob') => Conditionals will be evaluted in order and break on first TRUE => If last 'if' isn't specified, defaults to TRUE prob (+): float|str (default = 1 - (sum of other probs)) => String is a variable => Must come AFTER 'if' (if mixed) => If mixed, then 'prob' is conditional probability given all 'if' are FALSE => Must sum to 1 across all 'prob' transitions for this state => If last 'prob' isn't specified, defaults to = 1 - (sum of other probs) ================ % duration (+): float (default = 0.0) => Waiting this number of timesteps occurs BEFORE this transition is taken utilities (+): str|float|bool|list[dict] (default = 0.0) => Same as for state resource_deltas (+): dict[float] (default = {}) => [key] = resource from 'variables', [value] = how much to change each resource level by taking this transition (so AFTER any transitions from it are calculated) ``` ## Development ### Installation ```bash # Download repo git clone https://github.com/som-shahlab/aplus.git cd aplus # Create environment conda create -n aplus python=3.10 -y conda activate aplus pip3 install -e . pip3 install -r requirements.txt ``` ### Repository File Structure Core APLUS module in `src/aplusml` (listed in the order that they should be used): 1. `parse.py` - Functions to parse YAML files into Python objects for use in `sim.py` 1. `sim.py` - Core simulation engine which progresses patients through the desired clinical workflow 1. `run.py` - Wrapper functions around `sim.py` to help run/track multiple simulations 1. `plot.py` - Plotting functions 1. `models.py` - Classes for entities like patients, patient history, utilities, resources, etc. 1. `draw.py` - Functions to draw the workflow graph Supporting files: * `tutorials/` - Contains Jupyter notebooks that demonstrate how to use APLUS * `pad.ipynb` - Demonstrates how to use APLUS to simulate the novel PAD workflow described in the paper * `pad.py` - Helper functions for PAD-specific workflow analysis * `acp_jung_replication.ipynb` - Demonstrates how to use APLUS to replicate the plots of [Jung et al. 2021](https://pubmed.ncbi.nlm.nih.gov/33355350/) * `workflows/` - Contains YAML files that define the workflows analyzed in the paper * `pad_doctor.yaml` - The doctor-driven PAD workflow * `pad_nurse.yaml` - The nurse-driven PAD workflow * `acp_jung_replication.yaml` - The exact same ACP workflow analyzed in [Jung et al. 2021](https://pubmed.ncbi.nlm.nih.gov/33355350/) * `tests/` - Contains unit tests for the APLUS framework * `run_tests.py` - Script to run all unit tests * `test_*.py` - Tests for each module * `test*.yaml` - Workflow YAML files for each corresponding test * `utils.py` - Utility functions for testing * `input/` - Contains input data fed into the simulation * `output/` - Contains output data from the simulations (this is useful for caching results so you don't have to re-run time-consuming simulations) ### Tests The file `tests/run_tests.py` runs all of the `test[d].py` files in the `tests/` directory. Each `test[d].py` file has a corresponding `test[d].yaml` file that serves as its input. To run tests: ``` cd tests python3 run_tests.py ```