Python 中的最近收缩质心

原文：https://machinelearningmastery.com/nearest-shrunken-centroids-with-python/

最近质心是一种线性分类机器学习算法。

它包括根据训练数据集中的基于类的质心预测新示例的类标签。

最近收缩质心算法是一种扩展，它涉及将基于类的质心向整个训练数据集的质心移动，并移除那些在区分类时不太有用的输入变量。

因此，最近收缩质心算法执行自动形式的特征选择，使其适用于具有大量输入变量的数据集。

在本教程中，您将发现最近收缩质心分类机器学习算法。

完成本教程后，您将知道:

最近收缩质心是一种简单的线性机器学习分类算法。
如何使用 Sklearn 的最近收缩质心模型进行拟合、评估和预测。
如何在给定数据集上调整最近收缩质心算法的超参数。

我们开始吧。

Nearest Shrunken Centroids With Python

Giuseppe Milo 拍摄的最近的 Python 缩小质心照片，保留部分权利。

教程概述

本教程分为三个部分；它们是:

最近质心算法
用 Sklearn 最近的质心
调谐最近质心超参数

用 Sklearn 最近的质心

最近收缩质心可通过最近质心类在 Sklearn Python 机器学习库中获得。

该类允许通过“度量”参数配置算法中使用的距离度量，对于欧几里德距离度量，该参数默认为“欧几里德”。

这可以更改为其他内置指标，如“曼哈顿”

...
# create the nearest centroid model
model = NearestCentroid(metric='euclidean')

默认情况下，不使用收缩，但是可以通过“收缩 _ 阈值参数指定收缩，该参数采用 0 到 1 之间的浮点值。

...
# create the nearest centroid model
model = NearestCentroid(metric='euclidean', shrink_threshold=0.5)

我们可以用一个工作示例来演示最近的收缩形心。

首先，让我们定义一个综合分类数据集。

我们将使用 make_classification()函数创建一个包含 1000 个示例的数据集，每个示例有 20 个输入变量。

该示例创建并汇总数据集。

# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

运行该示例将创建数据集，并确认数据集的行数和列数。

(1000, 20) (1000,)

我们可以通过重复分层 k 折交叉验证类来拟合和评估最近收缩质心模型。我们将在测试装具中使用 10 次折叠和三次重复。

我们将使用欧几里德距离和无收缩的默认配置。

...
# create the nearest centroid model
model = NearestCentroid()

下面列出了评估合成二进制分类任务的最近收缩质心模型的完整示例。

# evaluate an nearest centroid model on the dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import NearestCentroid
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# define model
model = NearestCentroid()
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize result
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

运行该示例评估合成数据集上的最近收缩质心算法，并报告 10 倍交叉验证的三次重复的平均精确率。

鉴于学习算法的随机性，您的具体结果可能会有所不同。考虑运行这个例子几次。

在这种情况下，我们可以看到模型达到了大约 71%的平均精确率。

Mean Accuracy: 0.711 (0.055)

我们可能会决定使用最近的收缩质心作为最终模型，并根据新数据进行预测。

这可以通过在所有可用数据上拟合模型并调用传递新数据行的 predict() 函数来实现。

我们可以用下面列出的完整示例来演示这一点。

# make a prediction with a nearest centroid model on the dataset
from sklearn.datasets import make_classification
from sklearn.neighbors import NearestCentroid
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# define model
model = NearestCentroid()
# fit model
model.fit(X, y)
# define new data
row = [2.47475454,0.40165523,1.68081787,2.88940715,0.91704519,-3.07950644,4.39961206,0.72464273,-4.86563631,-6.06338084,-1.22209949,-0.4699618,1.01222748,-0.6899355,-0.53000581,6.86966784,-3.27211075,-6.59044146,-2.21290585,-3.139579]
# make a prediction
yhat = model.predict([row])
# summarize prediction
print('Predicted Class: %d' % yhat)

运行该示例符合模型，并对新的数据行进行类别标签预测。

Predicted Class: 0

接下来，我们可以看看配置模型超参数。

调谐最近质心超参数

必须为特定数据集配置最近收缩形心方法的超参数。

也许最重要的超参数是通过“收缩阈值参数控制的收缩。在值网格(如 0.1 或 0.01)上测试 0 到 1 之间的值是一个好主意。

下面的例子使用 GridSearchCV 类和我们定义的值网格来演示这一点。

# grid search shrinkage for nearest centroid
from numpy import arange
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import NearestCentroid
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# define model
model = NearestCentroid()
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['shrink_threshold'] = arange(0, 1.01, 0.01)
# define search
search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print('Mean Accuracy: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

运行该示例将使用重复的交叉验证来评估配置的每个组合。

鉴于学习算法的随机性，您的具体结果可能会有所不同。试着运行这个例子几次。

在这种情况下，我们可以看到我们获得了比默认情况下稍好的结果，71.4%对 71.1%。我们可以看到模型分配了一个 0.53 的收缩阈值值。

Mean Accuracy: 0.714
Config: {'shrink_threshold': 0.53}

另一个关键配置是使用的距离度量，可以根据输入变量的分布来选择。

可以使用任何内置的距离测量，如下所列:

metrics . pair . pair _ distance API。

常见的距离测量包括:

cityblock '，'余弦'，'欧几里德'，' l1 '，' l2 '，'曼哈顿'

有关如何计算这些距离度量的更多信息，请参见教程:

4 机器学习的距离度量

假设我们的输入变量是数字，我们的数据集只支持“欧几里德”和“曼哈顿”

我们可以在网格搜索中包含这些指标；下面列出了完整的示例。

# grid search shrinkage and distance metric for nearest centroid
from numpy import arange
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import NearestCentroid
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# define model
model = NearestCentroid()
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['shrink_threshold'] = arange(0, 1.01, 0.01)
grid['metric'] = ['euclidean', 'manhattan']
# define search
search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print('Mean Accuracy: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

运行该示例符合模型，并使用交叉验证发现给出最佳结果的超参数。

鉴于学习算法的随机性，您的具体结果可能会有所不同。试着运行这个例子几次。

在这种情况下，我们可以看到，使用无收缩和曼哈顿代替欧几里德距离测量，我们获得了略好的 75%的精确率。

Mean Accuracy: 0.750
Config: {'metric': 'manhattan', 'shrink_threshold': 0.0}

这些实验的一个很好的扩展是将数据规范化或标准化作为建模管道的一部分添加到数据中。

进一步阅读

如果您想更深入地了解这个主题，本节将提供更多资源。

蜜蜂

文章

摘要

在本教程中，您发现了最近收缩质心分类机器学习算法。

具体来说，您了解到:

最近收缩质心是一种简单的线性机器学习分类算法。
如何使用 Sklearn 的最近收缩质心模型进行拟合、评估和预测。
如何在给定数据集上调整最近收缩质心算法的超参数。

你有什么问题吗？ 在下面的评论中提问，我会尽力回答。

天龙 / ml-mastery-zh

Python 中的最近收缩质心

教程概述

最近质心算法

用 Sklearn 最近的质心

调谐最近质心超参数

进一步阅读

教程

报纸

书

蜜蜂

文章

摘要

简介

发行版

贡献者

近期动态

天龙 / ml-mastery-zh .gitee-modal { width: 500px !important; }

Python 中的最近收缩质心

教程概述

最近质心算法

用 Sklearn 最近的质心

调谐最近质心超参数

进一步阅读

教程

报纸

书

蜜蜂

文章

摘要

简介

发行版

贡献者

近期动态

搜索帮助

天龙 / ml-mastery-zh