# Feature-Selection **Repository Path**: initialdream1659/Feature-Selection ## Basic Information - **Project Name**: Feature-Selection - **Description**: Features selection algorithm based on self selected algorithm, loss function and validation method - **Primary Language**: Python - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2018-09-13 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # MLFeatureSelection [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![PyPI version](https://badge.fury.io/py/MLFeatureSelection.svg)](https://pypi.org/project/MLFeatureSelection/) General features selection based on certain machine learning algorithm and evaluation methods **Divesity, Flexible and Easy to use** More features selection method will be included in the future! ## Quick Installation ```python pip3 install MLFeatureSelection ``` ## Modulus in version 0.0.7 - Modulus for selecting features based on greedy algorithm (from MLFeatureSelection import sequence_selection) - Modulus for removing features based on features importance (from MLFeatureSelection import importance_selection) - Modulus for removing features based on correlation coefficient (from MLFeatureSelection import coherence_selection) - Modulus for reading the features combination from log file (from MLFeatureSelection.tools import readlog) ## Modulus Usage - sequence_selection ```python from MLFeatureSelection import sequence_selection from sklearn.linear_model import LogisticRegression sf = sequence_selection.Select(Sequence = True, Random = True, Cross = False) sf.ImportDF(df,label = 'Label') #import dataframe and label sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function handle and optimize direction, 'ascend' for AUC, ACC, 'descend' for logloss etc. sf.InitialNonTrainableFeatures(notusable) #those features that is not trainable in the dataframe, user_id, string, etc sf.InitialFeatures(initialfeatures) #initial initialfeatures as list sf.GenerateCol() #generate features for selection sf.clf = LogisticRegression() #set the selected algorithm, can be any algorithm sf.SetLogFile('record.log') #log file sf.run(validate) #run with validation function, validate is the function handle of the validation function, return best features combination ``` - importance_selection ```python from MLFeatureSelection import importance_selection import xgboost as xgb sf = importance_selection.Select() sf.ImportDF(df,label = 'Label') #import dataframe and label sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function and optimize direction sf.InitialFeatures() #initial features, input sf.SelectRemoveMode(batch = 2) sf.clf = xgb.XGBClassifier() sf.SetLogFile('record.log') #log file sf.run(validate) #run with validation function, return best features combination ``` - coherence_selection ```python from MLFeatureSelection import coherence_selection import xgboost as xgb sf = coherence_selection.Select() sf.ImportDF(df,label = 'Label') #import dataframe and label sf.ImportLossFunction(lossfunction, direction = 'ascend') #import loss function and optimize direction sf.InitialFeatures() #initial features, input sf.SelectRemoveMode(batch = 2) sf.clf = xgb.XGBClassifier() sf.SetLogFile('record.log') #log file sf.run(validate) #run with validation function, return best features combination ``` - log reader ```python from MLFeatureSelection.tools import readlog logfile = 'record.log' logscore = 0.5 #any score in the logfile features_combination = readlog(logfile, logscore) ``` - format of validate and lossfunction define your own: **validate**: validation method in function , ie k-fold, last time section valdate, random sampling validation, etc **lossfunction**: model performance evaluation method, ie logloss, auc, accuracy, etc ```python def validate(X, y, features, clf, lossfunction): """define your own validation function with 5 parameters input as X, y, features, clf, lossfunction clf is set by SetClassifier() lossfunction is import earlier features will be generate automatically function return score and trained classfier """ clf.fit(X[features],y) y_pred = clf.predict(X[features]) score = lossfuntion(y_pred,y) return score, clf def lossfunction(y_pred, y_test): """define your own loss function with y_pred and y_test return score """ return np.mean(y_pred == y_test) ``` ## DEMO More examples are added in example folder include: - Demo contain all modulus can be found here ([demo](https://github.com/duxuhao/Feature-Selection/blob/master/Demo.py)) - Simple Titanic with 5-fold validation and evaluated by accuracy ([demo](https://github.com/duxuhao/Feature-Selection/tree/master/example/titanic)) - Demo for S1, S2 score improvement in JData 2018 predict purchase time competition ([demo](https://github.com/duxuhao/Feature-Selection/tree/master/example/JData2018)) - Demo for IJCAI 2018 CTR prediction ([demo](https://github.com/duxuhao/Feature-Selection/tree/master/example/IJCAI-2018)) ## PLAN - better API introduction will be completed next before the end of 06/2018 ## This features selection method achieved - **1st** in Rong360 -- https://github.com/duxuhao/rong360-season2 - **6th** in JData-2018 (Peter Du) -- https://github.com/duxuhao/JData-2018 - **12nd** in IJCAI-2018 1st round -- https://github.com/duxuhao/IJCAI-2018-2 ## Function Parameters ### sf = sequence_selection.Select(Sequence=True, Random=True, Cross=True) #### Parameters: **Sequence** (_bool_, optional, (defualt=True)) - switch for sequence selection selection include forward,backward and simulate anneal selection **Random** (_bool, optional, (defualt=True)_) - switch for randomly selection of features combination **Cross** (_bool_, optional, (defualt=True)) - switch for cross term generate, need to set sf.ImportCrossMethod() after ### sf.ImportDF(df,label) #### Parameters: **df** (_pandas.DataFrame_) - dataframe includes include all features **label** (_str_) - name of the label column ### sf.ImportLossFunction(lossfunction,direction) #### Parameters: **lossfunction** (_function handle_) - handle of the loss function, function should return score as float (logloss, AUC, etc) **direction** (_str,'ascend'/'descend'_) - direction to improve, 'descend' for logloss, 'ascend' for AUC, etc ### sf.InitialFeatures(features) #### Parameters: **features** (_list, optional, (defualt=[])_) - list of initial features combination, empty list will drive code to start from nothing list with all trainable features will drive code to start backward searching at the beginning ### sf.InitialNonTrainableFeatures(features) #only for sequence selection #### Parameters: **features** (_list_) - list of features that not trainable (labelname, string, datetime, etc) ### sf.GenerateCol(key=None,selectstep=1) #only for sequence selection #### Parameters: **key** (_str, optional, default=None_) - only the features with keyword will be seleted, default to be None **selectstep** (_int, optional, default=1_) - value for features selection step ### sf.SelectRemoveMode(frac=1,batch=1,key='') #### Parameters: **frac** (_float, optional, default=1_) - percentage of delete features from all features default to be set as using the batch **batch** (_int, optional, default=1_) - delete features quantity every iteration **key** (_str, optional, default=None_) - only delete the features with keyword ### sf.ImportCrossMethod(CrossMethod) #### Parameters: **CrossMethod** (_dict_) - different cross method like add, divide, multiple and substraction ### sf.AddPotentialFeatures(features) #### Parameters: **features** (_list_, optional, default=[]_) - list of strong features, switch for simulate anneal ### sf.SetTimeLimit(TimeLimit=inf) #### Parameters: **TimeLimit** (_float, optional, default=inf_) - maximum running time, unit in minute ### sf.SetFeaturesLimit(FeaturesLimit=inf) #### Parameters: **FeaturesLimit** (int, optional, default=inf_) - maximum feature quantity ### sf.SetClassifier(clf) #### Parameters: **clf** (_predictor_) - classfier or estimator, sklearn, xgboost, lightgbm, etc. Need to match the validate function ### sf.SetLogFile(logfile) #### Parameters: **logfile** (_str_) - log file name ### sf.run(validate) #### Parameters: **validate** (_function handle_) - function return evaluation score and predictor input features dataset X, label series Y, used features, predictor, lossfunction handle ## Algorithm details (selecting features based on greedy algorithm) ![Procedure](https://github.com/duxuhao/Feature-Selection/blob/master/Procedure0.png)