A Fast XGBoost Feature Selection Algorithm (plus other sklearn tree-based classifiers)
Automated processes like Boruta showed early promise as they were able to provide superior performance with Random Forests, but has some deficiencies including slow computation time: especially with high dimensional data. Regardless of the run time, Boruta does perform well on Random Forests, but performs poorly on other algorithms such as boosting or neural networks. Similar deficiencies occur with regularization on LASSO, elastic net, or ridge regressions in that they perform well on linear regressions, but poorly on other modern algorithms.
I am proposing and demonstrating a feature selection algorithm (called BoostARoota) in a similar spirit to Boruta utilizing XGBoost as the base model rather than a Random Forest. The algorithm runs in a fraction of the time it takes Boruta and has superior performance on a variety of datasets. While the spirit is similar to Boruta, BoostARoota takes a slightly different approach for the removal of attributes that executes much faster.
Easiest way is to use pip
:
$ pip install boostaroota
This module is built for use in a similar manner to sklearn with fit()
, transform()
, etc. In order to use the package, it does require X to be one-hot-encoded(OHE), so using the pandas function pd.get_dummies(X)
may be helpful as it determines which variables are categorical and converts them into dummy variables. This package does rely on pandas under the hood so data must be passed in as a pandas dataframe.
Assuming you have X and Y split, you can run the following:
from boostaroota import BoostARoota
import pandas as pd
#OHE the variables - BoostARoota may break if not done
x = pd.getdummies(x)
#Specify the evaluation metric: can use whichever you like as long as recognized by XGBoost
#EXCEPTION: multi-class currently only supports "mlogloss" so much be passed in as eval_metric
br = BoostARoota(metric='logloss')
#Fit the model for the subset of variables
br.fit(x, y)
#Can look at the important variables - will return a pandas series
br.keep_vars_
#Then modify dataframe to only include the important variables
br.transform(x)
It's really that simple! Of course, as we build more functionality there may be a few more Keep in mind that since you are OHE, if you have a numeric variable that is imported by python as a character, pd.get_dummies() will convert those numeric into many columns. This can cause your DataFrame to explode in size, giving unexpected results and high run times.
###New as of 1/22/2018, can insert any sklearn tree-based learner into BoostARoota Please be aware that this hasn't been fully tested out for which parameters (cutoff, iterations, etc) are optimal. Currently, that will require some trial and error on the user's part.
For example, to use another classifer, you will initialize the object and then pass that object into the BoostARoota object like so:
from sklearn.ensemble import ExtraTreesClassifier
clf = ExtraTreesClassifier()
br = BoostARoota(clf=clf)
new_train = br.fit_transform(x, y)
You can also view a complete demo here.
The default parameters are optimally chosen for the widest range of input dataframes. However, there are cases where other values could be more optimal.
Similar in spirit to Boruta, BoostARoota creates shadow features, but modifies the removal step.
BoostARoota is shorted to BAR and the below table is utilizing the LSVT dataset from the UCI datasets. The algorithm has been tested on other datasets. If you are interested in the specifics of the testing please take a look at the testBAR.py script. The basics are that it is run through 5-fold CV, with the model selection performed on the training set and then predicting on the heldout test set. It is done this way to avoid overfitting the feature selection process.
All tests are run on a 12 core (hyperthreaded) Intel i7. - Future iterations will compare run times on a 28 core Xeon, 120 cores on Spark, and running xgboost on a GPU.
Data Set | Target | Boruta Time | BoostARoota Time | BoostARoota LogLoss | Boruta LogLoss | All Features LogLoss | BAR >= All |
---|---|---|---|---|---|---|---|
LSVT | 0/1 | 50.289s | 0.487s | 0.5617 | 0.6950 | 0.7311 | Yes |
HR | 0/1 | 33.704s | 0.485s | 0.1046 | 0.1003 | 0.1047 | Yes |
Fraud | 0/1 | 38.619s | 1.790s | 0.4333 | 0.4353 | 0.4333 | Yes |
As can be seen, the speed up from BoostARoota is around 100x with substantial reductions in log loss. Part of this speed up is that Boruta is running single threaded, while BoostARoota (on XGB) is running on all 12 cores. Not sure how this time speed up works with larger datasets as of yet.
This has also been tested on Kaggle's House Prices. With nothing done except running BoostARoota and evaluated on RMSE, all features scored .15669, while BoostARoota scored 0.1560.
The text file FS_algo_basics.txt
details how I was thinking through the algorithm and what additional functionality was thought about during the creation.
This project has found some initial successes and there are a number of directions it can head. It would be great to have some additional help if you are willing/able. Whether it is directly contributing to the codebase or just giving some ideas, any help is appreciated. The goal is to make the algorithm as robust as possible. The primary focus right now is on the components under Future Implementations, but are in active development. Please reach out to see if there is anything you would like to contribute in that part to make sure we aren't duplicating work.
A special thanks to Progressive Leasing for sponsoring this research.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。