# pycox
**Repository Path**: houwei2022/pycox
## Basic Information
- **Project Name**: pycox
- **Description**: https://github.com/havakv/pycox.git
- **Primary Language**: Python
- **License**: BSD-2-Clause
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-06-09
- **Last Updated**: 2022-06-09
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
Time-to-event prediction with PyTorch
Get Started •
Methods •
Evaluation Criteria •
Datasets •
Installation •
References
**pycox** is a python package for survival analysis and time-to-event prediction with [PyTorch](https://pytorch.org), built on the [torchtuples](https://github.com/havakv/torchtuples) package for training PyTorch models. An R version of this package is available at [survivalmodels](https://github.com/RaphaelS1/survivalmodels).
The package contains implementations of various [survival models](#methods), some useful [evaluation metrics](#evaluation-criteria), and a collection of [event-time datasets](#datasets).
In addition, some useful preprocessing tools are available in the `pycox.preprocessing` module.
# Get Started
To get started you first need to install [PyTorch](https://pytorch.org/get-started/locally/).
You can then install **pycox** via pip:
```sh
pip install pycox
```
OR, via conda:
```sh
conda install -c conda-forge pycox
```
We recommend to start with [01_introduction.ipynb](https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/01_introduction.ipynb), which explains the general usage of the package in terms of preprocessing, creation of neural networks, model training, and evaluation procedure.
The notebook use the `LogisticHazard` method for illustration, but most of the principles generalize to the other methods.
Alternatively, there are many examples listed in the [examples folder](https://nbviewer.jupyter.org/github/havakv/pycox/tree/master/examples), or you can follow the tutorial based on the `LogisticHazard`:
- [01_introduction.ipynb](https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/01_introduction.ipynb): General usage of the package in terms of preprocessing, creation of neural networks, model training, and evaluation procedure.
- [02_introduction.ipynb](https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/02_introduction.ipynb): Quantile based discretization scheme, nested tuples with `tt.tuplefy`, entity embedding of categorical variables, and cyclical learning rates.
- [03_network_architectures.ipynb](https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/03_network_architectures.ipynb):
Extending the framework with custom networks and custom loss functions. The example combines an autoencoder with a survival network, and considers a loss that combines the autoencoder loss with the loss of the `LogisticHazard`.
- [04_mnist_dataloaders_cnn.ipynb](https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/04_mnist_dataloaders_cnn.ipynb):
Using dataloaders and convolutional networks for the MNIST data set. We repeat the [simulations](https://peerj.com/articles/6257/#p-41) of [\[8\]](#references) where each digit defines the scale parameter of an exponential distribution.
# Methods
The following methods are available in the `pycox.methods` module.
## Continuous-Time Models:
| Method |
Description |
Example |
| CoxTime |
Cox-Time is a relative risk model that extends Cox regression beyond the proportional hazards [1].
|
notebook |
| CoxCC |
Cox-CC is a proportional version of the Cox-Time model [1].
|
notebook |
| CoxPH (DeepSurv) |
CoxPH is a Cox proportional hazards model also referred to as DeepSurv [2].
|
notebook |
| PCHazard |
The Piecewise Constant Hazard (PC-Hazard) model [12] assumes that the continuous-time hazard function is constant in predefined intervals.
It is similar to the Piecewise Exponential Models [11] and PEANN [14], but with a softplus activation instead of the exponential function.
|
notebook
|
## Discrete-Time Models:
| Method |
Description |
Example |
| LogisticHazard (Nnet-survival) |
The Logistic-Hazard method parametrize the discrete hazards and optimize the survival likelihood [12] [7].
It is also called Partial Logistic Regression [13] and Nnet-survival [8].
|
notebook
|
| PMF |
The PMF method parametrize the probability mass function (PMF) and optimize the survival likelihood [12]. It is the foundation of methods such as DeepHit and MTLR.
|
notebook
|
| DeepHit, DeepHitSingle |
DeepHit is a PMF method with a loss for improved ranking that
can handle competing risks [3].
|
single
competing |
| MTLR (N-MTLR) |
The (Neural) Multi-Task Logistic Regression is a PMF methods proposed by
[9] and [10].
|
notebook
|
| BCESurv |
A method representing a set of binary classifiers that remove individuals as they are censored [15]. The loss is the binary cross entropy of the survival estimates at a set of discrete times, with targets that are indicators of surviving each time.
|
bs_example
|
# Evaluation Criteria
The following evaluation metrics are available with `pycox.evalutation.EvalSurv`.
| Metric |
Description |
| concordance_td |
The time-dependent concordance index evaluated at the event times [4].
|
| brier_score |
The IPCW Brier score (inverse probability of censoring weighted Brier score) [5][6][15].
See Section 3.1.2 of [15] for details.
|
| nbll |
The IPCW (negative) binomial log-likelihood [5][1]. I.e., this is minus the binomial log-likelihood and should not be confused with the negative binomial distribution.
The weighting is performed as in Section 3.1.2 of [15] for details.
|
| integrated_brier_score |
The integrated IPCW Brier score. Numerical integration of the `brier_score` [5][6].
|
| integrated_nbll |
The integrated IPCW (negative) binomial log-likelihood. Numerical integration of the `nbll` [5][1].
|
| brier_score_admin integrated_brier_score_admin |
The administrative Brier score [15]. Works well for data with administrative censoring, meaning all censoring times are observed.
See this example notebook.
|
| nbll_admin integrated_nbll_admin |
The administrative (negative) binomial log-likelihood [15]. Works well for data with administrative censoring, meaning all censoring times are observed.
See this example notebook.
|
# Datasets
A collection of datasets are available through the `pycox.datasets` module.
For example, the following code will download the `metabric` dataset and load it in the form of a pandas dataframe
```python
from pycox import datasets
df = datasets.metabric.read_df()
```
The `datasets` module will store datasets under the installation directory by default. You can specify a different directory by setting the `PYCOX_DATA_DIR` environment variable.
## Real Datasets:
| Dataset |
Size |
Dataset |
Data source |
| flchain |
6,524 |
The Assay of Serum Free Light Chain (FLCHAIN) dataset. See
[1] for preprocessing.
|
source
|
| gbsg |
2,232 |
The Rotterdam & German Breast Cancer Study Group.
See [2] for details.
|
source
|
| kkbox |
2,814,735 |
A survival dataset created from the WSDM - KKBox's Churn Prediction Challenge 2017 with administrative censoring.
See [1] and [15] for details.
Compared to kkbox_v1, this data set has more covariates and censoring times.
Note: You need
Kaggle credentials to access the dataset.
|
source
|
| kkbox_v1 |
2,646,746 |
A survival dataset created from the WSDM - KKBox's Churn Prediction Challenge 2017.
See [1] for details.
This is not the preferred version of this data set. Use kkbox instead.
Note: You need
Kaggle credentials to access the dataset.
|
source
|
| metabric |
1,904 |
The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC).
See [2] for details.
|
source
|
| nwtco |
4,028 |
Data from the National Wilm's Tumor (NWTCO).
|
source
|
| support |
8,873 |
Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT).
See [2] for details.
|
source
|
## Simulated Datasets:
| Dataset |
Size |
Dataset |
Data source |
| rr_nl_nph |
25,000 |
Dataset from simulation study in [1].
This is a continuous-time simulation study with event times drawn from a
relative risk non-linear non-proportional hazards model (RRNLNPH).
|
SimStudyNonLinearNonPH
|
| sac3 |
100,000 |
Dataset from simulation study in [12].
This is a discrete time dataset with 1000 possible event-times.
|
SimStudySACCensorConst
|
| sac_admin5 |
50,000 |
Dataset from simulation study in [15].
This is a discrete time dataset with 1000 possible event-times.
Very similar to `sac3`, but with fewer survival covariates and administrative censoring determined by 5 covariates.
|
SimStudySACAdmin
|
# Installation
**Note:** *This package is still in its early stages of development, so please don't hesitate to report any problems you may experience.*
The package only works for python 3.6+.
Before installing **pycox**, please install [PyTorch](https://pytorch.org/get-started/locally/) (version >= 1.1).
You can then install the package with
```sh
pip install pycox
```
For the bleeding edge version, you can instead install directly from github (consider adding `--force-reinstall`):
```sh
pip install git+git://github.com/havakv/pycox.git
```
## Install from Source
Installation from source depends on [PyTorch](https://pytorch.org/get-started/locally/), so make sure a it is installed.
Next, clone and install with
```sh
git clone https://github.com/havakv/pycox.git
cd pycox
pip install .
```
# References
\[1\] Håvard Kvamme, Ørnulf Borgan, and Ida Scheel. Time-to-event prediction with neural networks and Cox regression. *Journal of Machine Learning Research*, 20(129):1–30, 2019. \[[paper](http://jmlr.org/papers/v20/18-424.html)\]
\[2\] Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. Deepsurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. *BMC Medical Research Methodology*, 18(1), 2018. \[[paper](https://doi.org/10.1186/s12874-018-0482-1)\]
\[3\] Changhee Lee, William R Zame, Jinsung Yoon, and Mihaela van der Schaar. Deephit: A deep learning approach to survival analysis with competing risks. *In Thirty-Second AAAI Conference on Artificial Intelligence*, 2018. \[[paper](http://medianetlab.ee.ucla.edu/papers/AAAI_2018_DeepHit)\]
\[4\] Laura Antolini, Patrizia Boracchi, and Elia Biganzoli. A time-dependent discrimination index for survival data. *Statistics in Medicine*, 24(24):3927–3944, 2005. \[[paper](https://doi.org/10.1002/sim.2427)\]
\[5\] Erika Graf, Claudia Schmoor, Willi Sauerbrei, and Martin Schumacher. Assessment and comparison of prognostic classification schemes for survival data. *Statistics in Medicine*, 18(17-18):2529–2545, 1999. \[[paper](https://onlinelibrary.wiley.com/doi/abs/10.1002/%28SICI%291097-0258%2819990915/30%2918%3A17/18%3C2529%3A%3AAID-SIM274%3E3.0.CO%3B2-5)\]
\[6\] Thomas A. Gerds and Martin Schumacher. Consistent estimation of the expected brier score in general survival models with right-censored event times. *Biometrical Journal*, 48 (6):1029–1040, 2006. \[[paper](https://onlinelibrary.wiley.com/doi/abs/10.1002/bimj.200610301?sid=nlm%3Apubmed)\]
\[7\] Charles C. Brown. On the use of indicator variables for studying the time-dependence of parameters in a response-time model. *Biometrics*, 31(4):863–872, 1975.
\[[paper](https://www.jstor.org/stable/2529811?seq=1#metadata_info_tab_contents)\]
\[8\] Michael F. Gensheimer and Balasubramanian Narasimhan. A scalable discrete-time survival model for neural networks. *PeerJ*, 7:e6257, 2019.
\[[paper](https://peerj.com/articles/6257/)\]
\[9\] Chun-Nam Yu, Russell Greiner, Hsiu-Chin Lin, and Vickie Baracos. Learning patient- specific cancer survival distributions as a sequence of dependent regressors. *In Advances in Neural Information Processing Systems 24*, pages 1845–1853. Curran Associates, Inc., 2011.
\[[paper](https://papers.nips.cc/paper/4210-learning-patient-specific-cancer-survival-distributions-as-a-sequence-of-dependent-regressors)\]
\[10\] Stephane Fotso. Deep neural networks for survival analysis based on a multi-task framework. *arXiv preprint arXiv:1801.05512*, 2018.
\[[paper](https://arxiv.org/pdf/1801.05512.pdf)\]
\[11\] Michael Friedman. Piecewise exponential models for survival data with covariates. *The Annals of Statistics*, 10(1):101–113, 1982.
\[[paper](https://projecteuclid.org/euclid.aos/1176345693)\]
\[12\] Håvard Kvamme and Ørnulf Borgan. Continuous and discrete-time survival prediction with neural networks. *arXiv preprint arXiv:1910.06724*, 2019.
\[[paper](https://arxiv.org/pdf/1910.06724.pdf)\]
\[13\] Elia Biganzoli, Patrizia Boracchi, Luigi Mariani, and Ettore Marubini. Feed forward neural networks for the analysis of censored survival data: a partial logistic regression approach. *Statistics in Medicine*, 17(10):1169–1186, 1998.
\[[paper](https://onlinelibrary.wiley.com/doi/abs/10.1002/(SICI)1097-0258(19980530)17:10%3C1169::AID-SIM796%3E3.0.CO;2-D)\]
\[14\] Marco Fornili, Federico Ambrogi, Patrizia Boracchi, and Elia Biganzoli. Piecewise exponential artificial neural networks (PEANN) for modeling hazard function with right censored data. *Computational Intelligence Methods for Bioinformatics and Biostatistics*, pages 125–136, 2014.
\[[paper](https://link.springer.com/chapter/10.1007%2F978-3-319-09042-9_9)\]
\[15\] Håvard Kvamme and Ørnulf Borgan. The Brier Score under Administrative Censoring: Problems and Solutions. *arXiv preprint arXiv:1912.08581*, 2019.
\[[paper](https://arxiv.org/pdf/1912.08581.pdf)\]