# pycox **Repository Path**: houwei2022/pycox ## Basic Information - **Project Name**: pycox - **Description**: https://github.com/havakv/pycox.git - **Primary Language**: Python - **License**: BSD-2-Clause - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-06-09 - **Last Updated**: 2022-06-09 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

Time-to-event prediction with PyTorch

Get Started • Methods • Evaluation Criteria • Datasets • Installation • References

**pycox** is a python package for survival analysis and time-to-event prediction with [PyTorch](https://pytorch.org), built on the [torchtuples](https://github.com/havakv/torchtuples) package for training PyTorch models. An R version of this package is available at [survivalmodels](https://github.com/RaphaelS1/survivalmodels). The package contains implementations of various [survival models](#methods), some useful [evaluation metrics](#evaluation-criteria), and a collection of [event-time datasets](#datasets). In addition, some useful preprocessing tools are available in the `pycox.preprocessing` module. # Get Started To get started you first need to install [PyTorch](https://pytorch.org/get-started/locally/). You can then install **pycox** via pip: ```sh pip install pycox ``` OR, via conda: ```sh conda install -c conda-forge pycox ``` We recommend to start with [01_introduction.ipynb](https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/01_introduction.ipynb), which explains the general usage of the package in terms of preprocessing, creation of neural networks, model training, and evaluation procedure. The notebook use the `LogisticHazard` method for illustration, but most of the principles generalize to the other methods. Alternatively, there are many examples listed in the [examples folder](https://nbviewer.jupyter.org/github/havakv/pycox/tree/master/examples), or you can follow the tutorial based on the `LogisticHazard`: - [01_introduction.ipynb](https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/01_introduction.ipynb): General usage of the package in terms of preprocessing, creation of neural networks, model training, and evaluation procedure. - [02_introduction.ipynb](https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/02_introduction.ipynb): Quantile based discretization scheme, nested tuples with `tt.tuplefy`, entity embedding of categorical variables, and cyclical learning rates. - [03_network_architectures.ipynb](https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/03_network_architectures.ipynb): Extending the framework with custom networks and custom loss functions. The example combines an autoencoder with a survival network, and considers a loss that combines the autoencoder loss with the loss of the `LogisticHazard`. - [04_mnist_dataloaders_cnn.ipynb](https://nbviewer.jupyter.org/github/havakv/pycox/blob/master/examples/04_mnist_dataloaders_cnn.ipynb): Using dataloaders and convolutional networks for the MNIST data set. We repeat the [simulations](https://peerj.com/articles/6257/#p-41) of [\[8\]](#references) where each digit defines the scale parameter of an exponential distribution. # Methods The following methods are available in the `pycox.methods` module. ## Continuous-Time Models:

Method	Description	Example
CoxTime	Cox-Time is a relative risk model that extends Cox regression beyond the proportional hazards [1].	notebook
CoxCC	Cox-CC is a proportional version of the Cox-Time model [1].	notebook
CoxPH (DeepSurv)	CoxPH is a Cox proportional hazards model also referred to as DeepSurv [2].	notebook
PCHazard	The Piecewise Constant Hazard (PC-Hazard) model [12] assumes that the continuous-time hazard function is constant in predefined intervals. It is similar to the Piecewise Exponential Models [11] and PEANN [14], but with a softplus activation instead of the exponential function.	notebook

## Discrete-Time Models:

Method	Description	Example
LogisticHazard (Nnet-survival)	The Logistic-Hazard method parametrize the discrete hazards and optimize the survival likelihood [12] [7]. It is also called Partial Logistic Regression [13] and Nnet-survival [8].	notebook
PMF	The PMF method parametrize the probability mass function (PMF) and optimize the survival likelihood [12]. It is the foundation of methods such as DeepHit and MTLR.	notebook
DeepHit, DeepHitSingle	DeepHit is a PMF method with a loss for improved ranking that can handle competing risks [3].	single competing
MTLR (N-MTLR)	The (Neural) Multi-Task Logistic Regression is a PMF methods proposed by [9] and [10].	notebook
BCESurv	A method representing a set of binary classifiers that remove individuals as they are censored [15]. The loss is the binary cross entropy of the survival estimates at a set of discrete times, with targets that are indicators of surviving each time.	bs_example

# Evaluation Criteria The following evaluation metrics are available with `pycox.evalutation.EvalSurv`.

Metric	Description
concordance_td	The time-dependent concordance index evaluated at the event times [4].
brier_score	The IPCW Brier score (inverse probability of censoring weighted Brier score) [5][6][15]. See Section 3.1.2 of [15] for details.
nbll	The IPCW (negative) binomial log-likelihood [5][1]. I.e., this is minus the binomial log-likelihood and should not be confused with the negative binomial distribution. The weighting is performed as in Section 3.1.2 of [15] for details.
integrated_brier_score	The integrated IPCW Brier score. Numerical integration of the `brier_score` [5][6].
integrated_nbll	The integrated IPCW (negative) binomial log-likelihood. Numerical integration of the `nbll` [5][1].
brier_score_admin integrated_brier_score_admin	The administrative Brier score [15]. Works well for data with administrative censoring, meaning all censoring times are observed. See this example notebook.
nbll_admin integrated_nbll_admin	The administrative (negative) binomial log-likelihood [15]. Works well for data with administrative censoring, meaning all censoring times are observed. See this example notebook.

# Datasets A collection of datasets are available through the `pycox.datasets` module. For example, the following code will download the `metabric` dataset and load it in the form of a pandas dataframe ```python from pycox import datasets df = datasets.metabric.read_df() ``` The `datasets` module will store datasets under the installation directory by default. You can specify a different directory by setting the `PYCOX_DATA_DIR` environment variable. ## Real Datasets:

Dataset	Size	Dataset	Data source
flchain	6,524	The Assay of Serum Free Light Chain (FLCHAIN) dataset. See [1] for preprocessing.	source
gbsg	2,232	The Rotterdam & German Breast Cancer Study Group. See [2] for details.	source
kkbox	2,814,735	A survival dataset created from the WSDM - KKBox's Churn Prediction Challenge 2017 with administrative censoring. See [1] and [15] for details. Compared to kkbox_v1, this data set has more covariates and censoring times. Note: You need Kaggle credentials to access the dataset.	source
kkbox_v1	2,646,746	A survival dataset created from the WSDM - KKBox's Churn Prediction Challenge 2017. See [1] for details. This is not the preferred version of this data set. Use kkbox instead. Note: You need Kaggle credentials to access the dataset.	source
metabric	1,904	The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC). See [2] for details.	source
nwtco	4,028	Data from the National Wilm's Tumor (NWTCO).	source
support	8,873	Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT). See [2] for details.	source

## Simulated Datasets:

Dataset	Size	Dataset	Data source
rr_nl_nph	25,000	Dataset from simulation study in [1]. This is a continuous-time simulation study with event times drawn from a relative risk non-linear non-proportional hazards model (RRNLNPH).	SimStudyNonLinearNonPH
sac3	100,000	Dataset from simulation study in [12]. This is a discrete time dataset with 1000 possible event-times.	SimStudySACCensorConst
sac_admin5	50,000	Dataset from simulation study in [15]. This is a discrete time dataset with 1000 possible event-times. Very similar to `sac3`, but with fewer survival covariates and administrative censoring determined by 5 covariates.	SimStudySACAdmin

# Installation **Note:** *This package is still in its early stages of development, so please don't hesitate to report any problems you may experience.* The package only works for python 3.6+. Before installing **pycox**, please install [PyTorch](https://pytorch.org/get-started/locally/) (version >= 1.1). You can then install the package with ```sh pip install pycox ``` For the bleeding edge version, you can instead install directly from github (consider adding `--force-reinstall`): ```sh pip install git+git://github.com/havakv/pycox.git ``` ## Install from Source Installation from source depends on [PyTorch](https://pytorch.org/get-started/locally/), so make sure a it is installed. Next, clone and install with ```sh git clone https://github.com/havakv/pycox.git cd pycox pip install . ``` # References \[1\] Håvard Kvamme, Ørnulf Borgan, and Ida Scheel. Time-to-event prediction with neural networks and Cox regression. *Journal of Machine Learning Research*, 20(129):1–30, 2019. \[[paper](http://jmlr.org/papers/v20/18-424.html)\] \[2\] Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. Deepsurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. *BMC Medical Research Methodology*, 18(1), 2018. \[[paper](https://doi.org/10.1186/s12874-018-0482-1)\] \[3\] Changhee Lee, William R Zame, Jinsung Yoon, and Mihaela van der Schaar. Deephit: A deep learning approach to survival analysis with competing risks. *In Thirty-Second AAAI Conference on Artificial Intelligence*, 2018. \[[paper](http://medianetlab.ee.ucla.edu/papers/AAAI_2018_DeepHit)\] \[4\] Laura Antolini, Patrizia Boracchi, and Elia Biganzoli. A time-dependent discrimination index for survival data. *Statistics in Medicine*, 24(24):3927–3944, 2005. \[[paper](https://doi.org/10.1002/sim.2427)\] \[5\] Erika Graf, Claudia Schmoor, Willi Sauerbrei, and Martin Schumacher. Assessment and comparison of prognostic classification schemes for survival data. *Statistics in Medicine*, 18(17-18):2529–2545, 1999. \[[paper](https://onlinelibrary.wiley.com/doi/abs/10.1002/%28SICI%291097-0258%2819990915/30%2918%3A17/18%3C2529%3A%3AAID-SIM274%3E3.0.CO%3B2-5)\] \[6\] Thomas A. Gerds and Martin Schumacher. Consistent estimation of the expected brier score in general survival models with right-censored event times. *Biometrical Journal*, 48 (6):1029–1040, 2006. \[[paper](https://onlinelibrary.wiley.com/doi/abs/10.1002/bimj.200610301?sid=nlm%3Apubmed)\] \[7\] Charles C. Brown. On the use of indicator variables for studying the time-dependence of parameters in a response-time model. *Biometrics*, 31(4):863–872, 1975. \[[paper](https://www.jstor.org/stable/2529811?seq=1#metadata_info_tab_contents)\] \[8\] Michael F. Gensheimer and Balasubramanian Narasimhan. A scalable discrete-time survival model for neural networks. *PeerJ*, 7:e6257, 2019. \[[paper](https://peerj.com/articles/6257/)\] \[9\] Chun-Nam Yu, Russell Greiner, Hsiu-Chin Lin, and Vickie Baracos. Learning patient- specific cancer survival distributions as a sequence of dependent regressors. *In Advances in Neural Information Processing Systems 24*, pages 1845–1853. Curran Associates, Inc., 2011. \[[paper](https://papers.nips.cc/paper/4210-learning-patient-specific-cancer-survival-distributions-as-a-sequence-of-dependent-regressors)\] \[10\] Stephane Fotso. Deep neural networks for survival analysis based on a multi-task framework. *arXiv preprint arXiv:1801.05512*, 2018. \[[paper](https://arxiv.org/pdf/1801.05512.pdf)\] \[11\] Michael Friedman. Piecewise exponential models for survival data with covariates. *The Annals of Statistics*, 10(1):101–113, 1982. \[[paper](https://projecteuclid.org/euclid.aos/1176345693)\] \[12\] Håvard Kvamme and Ørnulf Borgan. Continuous and discrete-time survival prediction with neural networks. *arXiv preprint arXiv:1910.06724*, 2019. \[[paper](https://arxiv.org/pdf/1910.06724.pdf)\] \[13\] Elia Biganzoli, Patrizia Boracchi, Luigi Mariani, and Ettore Marubini. Feed forward neural networks for the analysis of censored survival data: a partial logistic regression approach. *Statistics in Medicine*, 17(10):1169–1186, 1998. \[[paper](https://onlinelibrary.wiley.com/doi/abs/10.1002/(SICI)1097-0258(19980530)17:10%3C1169::AID-SIM796%3E3.0.CO;2-D)\] \[14\] Marco Fornili, Federico Ambrogi, Patrizia Boracchi, and Elia Biganzoli. Piecewise exponential artificial neural networks (PEANN) for modeling hazard function with right censored data. *Computational Intelligence Methods for Bioinformatics and Biostatistics*, pages 125–136, 2014. \[[paper](https://link.springer.com/chapter/10.1007%2F978-3-319-09042-9_9)\] \[15\] Håvard Kvamme and Ørnulf Borgan. The Brier Score under Administrative Censoring: Problems and Solutions. *arXiv preprint arXiv:1912.08581*, 2019. \[[paper](https://arxiv.org/pdf/1912.08581.pdf)\]