# class-balancer **Repository Path**: initialdream1659/class-balancer ## Basic Information - **Project Name**: class-balancer - **Description**: Simple package for dealing with unbalanced data sets. - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2018-08-21 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # class-balancer Simple package for dealing with unbalanced data sets in Python. # Overview I modified and simplified work from: `https://github.com/fmfn/UnbalancedDataset`, which has since become `https://github.com/scikit-learn-contrib/imbalanced-learn` -- and is probably much better than this code. However, that code seems to be more of a repository of techniques, while this gets the job done and does a few things they do not do. 1) This package is designed to balance classes by undersampling the over represented class *and* oversampling the under represented group -- meeting in the middle if you will. - setting frac = 1 will exactly balance the classes - frac < 1 will make the classes closer to balanced but not all the way. This is a heuristic, that I found works well, in addition to modifying the data less - with **undersampling** first the number of data points to remove is calculated, then Tomek links are calculated and if enough exist will be randomly selected to be removed, else all of them will be removed and finally random undersampling will occur up to the number of data points to remove - with **oversampling** we use the SMOTE method and remove any Tomek links we may have inadvertantly created 2) The fit method generates a dictionary of weights that each class should be scaled by. This allows for chaining oversampling and undersampling to acheive the correct class balance. 3) You can ignore all of this and just use it as a simple black box that solves your issues. # Installation ``` git clone https://github.com/cthacker/class-balancer.git cd class-balancer python setup.py install ``` # Example Use ```python import matplotlib.pyplot as plt import numpy as np import sklearn.datasets import seaborn # not needed, makes more aesthetic plots #import class balancer from balancer import ClassBalancer np.random.seed(0) X, y = sklearn.datasets.make_moons(400, noise=0.15) ax = plt.subplot(3, 1, 1) ax.set_title("Original Data") plt.scatter(X[:, 0], X[:, 1], s=40, c=y, cmap=plt.cm.Spectral) plt.xlim((-1.5, 2.5)) plt.ylim((-1.5, 1.5)) # remove 70% of class 0 delmask = [] for i, cl in enumerate(y): if cl == 0: if np.random.random_integers(1, 10) > 3: delmask.append(i) X = np.delete(X, delmask, axis=0) y = np.delete(y, delmask, axis=0) ax = plt.subplot(3, 1, 2) ax.set_title("Red data has been artificially reduced") plt.scatter(X[:, 0], X[:, 1], s=40, c=y, cmap=plt.cm.Spectral) plt.xlim((-1.5, 2.5)) plt.ylim((-1.5, 1.5)) newclass = ClassBalancer(random_state=0, verbose=True, frac=1.0) newx, newy = newclass.fit_transform(X, y) ax = plt.subplot(3, 1, 3) ax.set_title("Classes are now balanced") plt.scatter(newx[:, 0], newx[:, 1], s=40, c=newy, cmap=plt.cm.Spectral) plt.xlim((-1.5, 2.5)) plt.ylim((-1.5, 1.5)) plt.show() ``` **Output** ``` Determining class statistics... 2 classes detected: {0: 52, 1: 200} with weights: {0: 2.4230769230769229, 1: 0.63} Start resampling ... 3 Tomek links found. Under-sampling performed, removed tomek links: Counter({1: 197, 0: 52}) Under-sampling performed, new total: Counter({1: 126, 0: 52}) Determining class statistics... 2 classes detected: {0: 52, 1: 126} with weights: {0: 2.4230769230769229, 1: 0.63} Generated 73 new samples ... Over-sampling performed: Counter({1.0: 126, 0.0: 125}) ``` ![Class Balancer on Moon data set](/example/balanced_data.png?raw=true "Balanced Data")