# snape

**Repository Path**: pingfanrenbiji/snape

## Basic Information

- **Project Name**: snape
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-04-21
- **Last Updated**: 2021-04-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

[![Build status](https://travis-ci.org/mbernico/snape.svg?branch=master)](https://travis-ci.org/mbernico/snape)
[![Coverage Status](https://coveralls.io/repos/github/mbernico/snape/badge.svg?branch=master)](https://coveralls.io/github/mbernico/snape?branch=master)

# Snape

Snape is a convenient artificial dataset generator that wraps sklearn's make_classification and make_regression
and then adds in 'realism' features such as complex formating, varying scales, categorical variables,
and missing values.

## Motivation

Snape was primarily created for academic and educational settings.  It has been used to create datasets that are unique per
student, per assignment for various homework assignments.  It has also been used to create class wide assessments in
conjunction with 'Kaggle In the Classroom.'

Other users have suggested non-academic uses cases as well, including 'interview screening problems,' model comparison,
etc.

## Installation


### Via Github
```bash
git clone https://github.com/mbernico/snape.git
cd snape
python setup.py install
```
### Via pip
*Coming Soon...*

## Quick Start

Snape can run either as a python module or as a command line application.

### Command Line Usage

#### Creating a Dataset

From the main directory in the git repo:
```bash

python snape/make_dataset.py -c example/config_classification.json
```
Will use the configuration file example/config_classification.json to create an artificial dataset called 'my_dataset'
(which is specified in the json config, more on this later...).

The dataset will consist of three files:
*  my_dataset_train.csv   (80% of the artificial dataset with all dependent and independent variables)
*  my_dataset_test.csv    (20% of the artificial dataset with only the dependent variables present)
*  my_dataset_testkey.csv (the same 20% as _test, including the dependent variables)

Note that if a star schema is generated, additional csv files will be generated. There will be one extra csv file per dimension, but only the main 'fact table' dataset will be split into test and train files. 

The train and test files can be given to a student.  The student can respond with a file of predictions, which can be
scored against the testkey as follows:

#### Scoring a Dataset

```bash
snape/score_dataset.py  -p example/student_predictions.csv  -k example/student_testkey.csv
```
Snape's score_dataset.py will attempt to detect the problem type and then score it, printing some metrics


```
Problem Type Detection: binary
---Binary Classification Score---
             precision    recall  f1-score   support

          0       0.81      0.99      0.89      1601
          1       0.50      0.06      0.11       399

avg / total       0.75      0.80      0.73      2000
```


### Python Module Usage


#### Creating a Dataset
```python
from snape.make_dataset import make_dataset

# configuration json examples can be found in doc
conf = {
    "type": "classification",
    "n_classes": 2,
    "n_samples": 1000,
    "n_features": 10,
    "out_path": "./",
    "output": "my_dataset",
    "n_informative": 3,
    "n_duplicate": 0,
    "n_redundant": 0,
    "n_clusters": 2,
    "weights": [0.8, 0.2],
    "pct_missing": 0.00,
    "insert_dollar": "Yes",
    "insert_percent": "Yes",
    "n_categorical": 0,
    "star_schema": "No",
    "label_list": []
}

make_dataset(config=conf)
```


#### Scoring a Dataset

```python
from snape.score_dataset import score_dataset

# a dataset's testkey can be compared to a prediction file using score_dataset()
results = score_dataset(y_file="student_testkey.csv", y_hat_file="student_predictions.csv")
# results is a tuple of (a_primary_metric, classification_report)
print("AUC = " + str(results[0]))
print(results[1])
````


## Dataset Generation Config

1.  [Classification JSON](doc/config_classification.json.md)
2.  [Regression JSON](doc/config_regression.json.md)


## Why Snape?
Snape is primarily used for creating complex datasets that *challenge* students and teach defense against the dark
arts of machine learning.  :)