# snape **Repository Path**: pingfanrenbiji/snape ## Basic Information - **Project Name**: snape - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-04-21 - **Last Updated**: 2021-04-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README [![Build status](https://travis-ci.org/mbernico/snape.svg?branch=master)](https://travis-ci.org/mbernico/snape) [![Coverage Status](https://coveralls.io/repos/github/mbernico/snape/badge.svg?branch=master)](https://coveralls.io/github/mbernico/snape?branch=master) # Snape Snape is a convenient artificial dataset generator that wraps sklearn's make_classification and make_regression and then adds in 'realism' features such as complex formating, varying scales, categorical variables, and missing values. ## Motivation Snape was primarily created for academic and educational settings. It has been used to create datasets that are unique per student, per assignment for various homework assignments. It has also been used to create class wide assessments in conjunction with 'Kaggle In the Classroom.' Other users have suggested non-academic uses cases as well, including 'interview screening problems,' model comparison, etc. ## Installation ### Via Github ```bash git clone https://github.com/mbernico/snape.git cd snape python setup.py install ``` ### Via pip *Coming Soon...* ## Quick Start Snape can run either as a python module or as a command line application. ### Command Line Usage #### Creating a Dataset From the main directory in the git repo: ```bash python snape/make_dataset.py -c example/config_classification.json ``` Will use the configuration file example/config_classification.json to create an artificial dataset called 'my_dataset' (which is specified in the json config, more on this later...). The dataset will consist of three files: * my_dataset_train.csv (80% of the artificial dataset with all dependent and independent variables) * my_dataset_test.csv (20% of the artificial dataset with only the dependent variables present) * my_dataset_testkey.csv (the same 20% as _test, including the dependent variables) Note that if a star schema is generated, additional csv files will be generated. There will be one extra csv file per dimension, but only the main 'fact table' dataset will be split into test and train files. The train and test files can be given to a student. The student can respond with a file of predictions, which can be scored against the testkey as follows: #### Scoring a Dataset ```bash snape/score_dataset.py -p example/student_predictions.csv -k example/student_testkey.csv ``` Snape's score_dataset.py will attempt to detect the problem type and then score it, printing some metrics ``` Problem Type Detection: binary ---Binary Classification Score--- precision recall f1-score support 0 0.81 0.99 0.89 1601 1 0.50 0.06 0.11 399 avg / total 0.75 0.80 0.73 2000 ``` ### Python Module Usage #### Creating a Dataset ```python from snape.make_dataset import make_dataset # configuration json examples can be found in doc conf = { "type": "classification", "n_classes": 2, "n_samples": 1000, "n_features": 10, "out_path": "./", "output": "my_dataset", "n_informative": 3, "n_duplicate": 0, "n_redundant": 0, "n_clusters": 2, "weights": [0.8, 0.2], "pct_missing": 0.00, "insert_dollar": "Yes", "insert_percent": "Yes", "n_categorical": 0, "star_schema": "No", "label_list": [] } make_dataset(config=conf) ``` #### Scoring a Dataset ```python from snape.score_dataset import score_dataset # a dataset's testkey can be compared to a prediction file using score_dataset() results = score_dataset(y_file="student_testkey.csv", y_hat_file="student_predictions.csv") # results is a tuple of (a_primary_metric, classification_report) print("AUC = " + str(results[0])) print(results[1]) ```` ## Dataset Generation Config 1. [Classification JSON](doc/config_classification.json.md) 2. [Regression JSON](doc/config_regression.json.md) ## Why Snape? Snape is primarily used for creating complex datasets that *challenge* students and teach defense against the dark arts of machine learning. :)