# CCMI **Repository Path**: zhoub86/CCMI ## Basic Information - **Project Name**: CCMI - **Description**: Classifier based mutual information, conditional mutual information estimation; conditional independence testing - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-03-18 - **Last Updated**: 2021-03-18 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # CCMI Code for reproducing key results in the paper [CCMI : Classifier based Conditional Mutual Information Estimation](https://arxiv.org/abs/1906.01824) by Sudipto Mukherjee, Himanshu Asnani and Sreeram Kannan. If you use the code, please cite our paper. The code can be used for mutual information and conditional mutual information estimation; conditional independence testing. ## Dependencies The code has been tested with the following versions of packages. - Python 3.6.5 - Tensorflow 1.11.0 - xgboost 0.80 (Optional : To run CCIT baseline for conditional independence testing) ## Running CCMI on your own data-sets First cd to the folder containing CCMI code, ```bash $ cd CIT $ python ``` and then can run CCMI as shown in the example below (you could have this code snippet as a Python script) : ```bash >> from CCMI import CCMI >> import numpy as np >> X = np.random.randn(5000, 1) >> Y = np.random.randn(5000, 1) >> Z = np.random.randn(5000, 1) >> model_indp = CCMI(X, Y, Z, tester = 'Classifier', metric = 'donsker_varadhan', num_boot_iter = 10, h_dim = 64, max_ep = 20) >> cmi_est_indp = model_indp.get_cmi_est() >> print(cmi_est_indp) -0.0003 >> Y = 0.5*X + np.random.normal(loc = 0, scale = 0.2, size = (X.shape)) >> model_dep = CCMI(X, Y, Z, tester = 'Classifier', metric = 'donsker_varadhan', num_boot_iter = 10, h_dim = 64, max_ep = 20) >> cmi_est_dep = model_dep.get_cmi_est() >> print(cmi_est_dep) 0.9707 ``` ## Reproducing Results of the paper ### CMI Estimation : Synthetic data generation ./data/gen_cmi_data.py - Contains several categories of synthetic data generators that have ground truth CMI values known. The models have X and Y as 1-dimensional variables, while dimension of Z can scale. Model-I in the paper corresponds to 'Category F' and Model-II to 'Category G'. To generate data from a particular category (say category F) with given dimension (say dz = 20) and number of samples (say N = 5000), run the following from inside 'data' folder: ```bash $ PYTHONPATH='..' python gen_cmi_data.py --cat F --num_th 5 --dz 20 ``` (Note: PYTHONPATH='..' is required because NPEET code is in the parent folder, but gen_cmi_data.py is run from ./data/) For ease of use, we have provided a bash script './data/gen_synthetic_data_bash.sh' which will generate all the data-sets used for linear and non-linear CMI estimation experiments in the paper. So, alternatively, to generate all data-sets used in the paper, just run ```bash $ ./gen_synthetic_data_bash.sh ``` (Note: Due to random functions used to simulate data and different random seeds, the exact values of true and estimated CMI for generated data-sets will be different from those in the paper.) ### CMI Estimation : Running the Estimators To run Generator+Classifier estimators, first cd to CMI_Est and then run : ```bash $ cd CMI_Est $ python main_CMI_Est.py --mimic cgan --tester Classifier --metric donsker_varadhan --cat F --num_th 5 --dz 20 ``` Similarly for other Generators, ```bash $ python main_CMI_Est.py --mimic cvae --tester Classifier --metric donsker_varadhan --cat F --num_th 5 --dz 20 $ python main_CMI_Est.py --mimic knn --tester Classifier --metric donsker_varadhan --cat F --num_th 5 --dz 20 ``` For difference-based CMI estimates, run the following (Classifier-MI and f-MINE respectively) : ```bash $ python main_CMI_Est.py --mimic mi_diff --tester Classifier --metric donsker_varadhan --cat F --num_th 5 --dz 20 $ python main_CMI_Est.py --mimic mi_diff --tester Neural --metric f_divergence --cat F --num_th 5 --dz 20 ``` For ease of use, we have provided bash scripts './CMI_Est/run_\_mimic.sh' which will run the corresponding estimator on all linear and non-linear CMI estimation experiments in the paper. For example, to obtain estimates from CGAN+Classifier, run the following ```bash $ ./run_cgan_mimic.sh ``` Similary, run_cvae_mimic.sh, run_knn_mimic.sh, run_mi_diff_mimic.sh, run_mi_diff_mimic_neural.sh, run_ksg_baseline.sh . (Note : Make sure to first create the data-sets using './data/gen_synthetic_data_bash.sh' before running the estimation scripts.) ### Conditional Independence Testing : Synthetic data generation To generate post-Non-Linear cosine data-sets, do the following : ```bash $ cd data $ python gen_cit_postNonLin_data.py ``` ### Conditional Independence Testing : Running the Testers To run CCMI for conditional independence testing on synthetic data, run the following : ```bash $ cd CIT $ python main_CCMI_postNonLin.py ``` And for flow-cytometry real data-sets : ```bash $ python main_CCMI_flowCyto.py ``` For comparison with state-of-the-art CI-Tester (CCIT), we have also provided code to run it for synthetic and real data-sets. ```bash $ python main_CCIT_postNonLin.py --dz 1 ``` Similarly, run for the other dimensions {5, 10, 20, 50, 100}. And for flow-cytometry real data-sets : ```bash $ python main_CCIT_flowCyto.py ```