1. 监督学习：回归算法¶

1.1 线性回归及非线性回归¶

1.1.1 一元线性回归Unary linear regression：一个自变量一个因变量
代价函数Cost function：最小二乘法（使误差平方和最小）
相关系数（线性关系强弱），决定系数（范围更广）
梯度下降法Gradient descent：从初始值求导代价函数，使为零值并移动步长，迭代优化得到代价函数的全局最小值/局部极小值（近似解）
（取合适大小的学习率，参数要同步更新）
线性回归的代价函数是凸函数convex，最后会达到全局最小值，适用梯度下降法

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from mpl_toolkits.mplot3d import Axes3D
import pydotplus as pyd
from IPython.display import Image

from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, VotingClassifier
from mlxtend.classifier import StackingClassifier
from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN
from sklearn.decomposition import PCA

df = pd.read_csv('data.csv',header=None)
df.head()

X = df.iloc[:,0].values.reshape(-1,1)
y = df.iloc[:,1].values.reshape(-1,1)

linear = LinearRegression().fit(X, y)

print('Coefficient:', linear.coef_)
print('Intercept', linear.intercept_)
plt.plot(X, y, '.')
plt.plot(X, linear.predict(X))
plt.show()

Coefficient: [[1.32243102]]
Intercept [7.99102098]

1.1.2 多元线性回归Multivariable linear regression：多个自变量一个因变量

df = pd.read_csv('Delivery.csv',header=None)
df.head()

X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values.reshape(-1,1)

linear = LinearRegression().fit(X, y)

print('Coefficient:', linear.coef_)
print('Intercept', linear.intercept_)

x0 = df.iloc[:,0]
x1 = df.iloc[:,1]

ax = plt.figure().add_subplot(projection='3d')
ax.scatter(x0,x1,y,c='r',s=100)

x0, x1 = np.meshgrid(x0, x1)
z = x0*linear.coef_[0][0] + x1*linear.coef_[0][1] + linear.intercept_[0]
ax.plot_surface(x0, x1, z)
plt.show()

Coefficient: [[0.0611346  0.92342537]]
Intercept [-0.86870147]

1.1.3 多项式回归Polynomial regression：曲线，提取多项式系数再用线性

df = pd.read_csv('job.csv')
df.head()

X = df.iloc[:,1].values.reshape(-1,1)
y = df.iloc[:,2].values.reshape(-1,1)

X_poly = PolynomialFeatures(5).fit_transform(X)
linear = LinearRegression().fit(X_poly, y)

plt.plot(X, y, '.')
plt.plot(X, linear.predict(X_poly))
plt.show()

1.2 有偏回归：岭回归, LASSO, 弹性网¶

标准方程法: 直接求代价函数的全局最优解，时间复杂度O(n^3)，n=特征数量
特征缩放: 使特征数值达到合适大小

数据归一化Normalization:
1. 取值(0, 1): Value = (Value - min) / (max - min)
2. 取值(-1, 1): Value = ((Value - min)/(max - min)-0.5) * 2
均值标准化Standardization (z-score)：
1. 一般取值(-0.5, 0.5): Value = (Value - mean) / std

交叉验证法: 数据样本小，循环使每份数据都作训练集和测试集
过拟合Overfitting: 模型过于复杂,训练集好,测试集差 (欠拟合,正确拟合)
防止过拟合: 减少干扰特征, 增加数据量, 正则化
正则化Regularization: L2正则化(加平方和), L1正则化(加绝对值和)

1.2.1 岭回归Ridge regression: 最早用于解决数据特征比样本点多时不是满秩矩阵不可逆的问题, 代价函数L2正则化, 现在也用于在估计中加入偏差使更准确, 也可解决多重共线性的问题, 是一种有偏估计
选择岭系数值使各回归系数的岭估计基本稳定, 残差平方和增大不多
1.2.2 LASSO: 擅长处理多重共线性的数据, 也是有偏估计, 代价函数L1正则化, 会使某些(噪声)参数的系数等于零, 而岭回归只会接近于零
1.2.3 弹性网Elastic Net: 岭回归和LASSO正则特点的结合

df = pd.read_csv('longley.csv',index_col=0)
df.head()

X = df.iloc[:,1:]
y = df.iloc[:,0]
alphas_test = np.linspace(0.001, 1)

ridge = RidgeCV(alphas=alphas_test, store_cv_values=True).fit(X, y)
lasso = LassoCV().fit(X, y)
elastic = ElasticNetCV().fit(X, y)

print('Alpha of Ridge:', ridge.alpha_)
print('Coefficient of Ridge:', ridge.coef_)
print('\nAlpha of LASSO:', lasso.alpha_)
print('Coefficient of LASSO:', lasso.coef_)
print('\nAlpha of Elastic Net:', elastic.alpha_)
print('Coefficient of Elastic Net:', elastic.coef_)

plt.figure(figsize=(14,4))
plt.subplot(121)
plt.plot(alphas_test, ridge.cv_values_.mean(axis=0))
plt.plot(ridge.alpha_, min(ridge.cv_values_.mean(axis=0)), 'ro')
plt.title('Alpha values and loss function')

plt.subplot(122)
x = np.arange(len(y))
plt.plot(x, y, label='real')
plt.plot(x, ridge.predict(X), 'y', label='Ridge')
plt.plot(x, lasso.predict(X), 'g', label='LASSO')
plt.plot(x, elastic.predict(X), 'r', label='Elastic Net')
plt.legend()
plt.show()

Alpha of Ridge: 0.40875510204081633
Coefficient of Ridge: [ 0.21595168  0.02539448  0.00769513 -1.51557268 -0.39023809 -0.04027195]

Alpha of LASSO: 14.134043936116361
Coefficient of LASSO: [0.10093575 0.00586331 0.00599214 0.         0.         0.        ]

Alpha of Elastic Net: 30.310944054302677
Coefficient of Elastic Net: [0.1006612  0.00589596 0.00593021 0.         0.         0.        ]

2. 监督学习：分类算法¶

2.1 逻辑回归Logistic regression¶

适合处理二分类问题
2.1.1 线性逻辑回归
预测函数Sigmoid/Logistic function取值(0, 1), 决策边界Decision boundary为内函数等于零
代价函数零值和一值分段, 不同于线性回归的代价函数, 也用梯度下降
多分类问题转化为多次二分类, 也可正则化
正确率与召回率(Precision & Recall), F1值(F-Score)是两者综合, 取值都在(0, 1), 越接近1效果越好

df = pd.read_csv('LR-testSet.csv',header=None)
df.head()

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

logi = LogisticRegression().fit(X, y)

print('Classification report:')
print(classification_report(y, logi.predict(X)))
print('Score:', logi.score(X, y))

plt.plot(X[y==0][0], X[y==0][1], 'bo', label='label 0')
plt.plot(X[y==1][0], X[y==1][1], 'rx', label='label 1')
plt.legend()

x_boundary = np.array([min(X[0]), max(X[0])])
y_boundary = (-logi.intercept_ - x_boundary*logi.coef_[0][0]) / logi.coef_[0][1]
plt.plot(x_boundary, y_boundary, 'k')
plt.show()

Classification report:
              precision    recall  f1-score   support

           0       0.96      0.94      0.95        47
           1       0.94      0.96      0.95        53

    accuracy                           0.95       100
   macro avg       0.95      0.95      0.95       100
weighted avg       0.95      0.95      0.95       100

Score: 0.95

2.1.2 非线性逻辑回归: 提取多项式特征再用线性逻辑回归 (同多项式回归)

df = pd.read_csv('LR-testSet2.txt',header=None)
df.head()

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_poly = PolynomialFeatures(5).fit_transform(X)
logi = LogisticRegression().fit(X_poly, y)

print('Classification report:')
print(classification_report(y, logi.predict(X_poly)))
print('Score:', logi.score(X_poly, y))

xx, yy = np.meshgrid(np.arange(min(X[0])-1, max(X[0])+1, 0.02),
                     np.arange(min(X[1])-1, max(X[1])+1, 0.02))

zz = np.c_[xx.ravel(), yy.ravel()]
zz_poly = PolynomialFeatures(5).fit_transform(zz)
zz_predict = logi.predict(zz_poly).reshape(xx.shape)

plt.contourf(xx, yy ,zz_predict, alpha=0.8)
plt.scatter(X[0], X[1], c=y)
plt.show()

Classification report:
              precision    recall  f1-score   support

           0       0.90      0.77      0.83        60
           1       0.79      0.91      0.85        58

    accuracy                           0.84       118
   macro avg       0.85      0.84      0.84       118
weighted avg       0.85      0.84      0.84       118

Score: 0.8389830508474576

2.2 KNN(K-Nearest Neighbor)¶

最近邻规则分类, 用欧式距离Euclidean distance
k值一般取单数, 算法复杂度较高, 样本分布不平衡时 (某一样本过大)可能不准确

iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

knn = KNeighborsClassifier().fit(X_train, y_train)

print('Classification report:')
print(classification_report(y_test, knn.predict(X_test)))
print('Score:', knn.score(X_test, y_test))

Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        20
           1       1.00      0.91      0.95        11
           2       0.93      1.00      0.97        14

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45

Score: 0.9777777777777777

2.3 决策树Decision tree¶

适合分析离散数据 (连续数据先转成离散)
信息熵Entropy: 对不确定性的度量, 信息熵越大, 不确定性越高
ID3算法: 根据信息增益最大化来划分决策树的节点, 信息增益最大的作为根节点, 处理连续变量时以信息增益最大化来划分阈值,并转化为离散变量
2.3.1 C4.5算法: 引入增益率求信息增益比, 以改进ID3方法倾向于选择因子数较多的变量的问题

df = pd.read_csv('AllElectronics.csv', index_col=0)
df.head()

X = pd.get_dummies(df.iloc[:,:-1])
y = df.iloc[:,-1]

dtree = DecisionTreeClassifier(criterion='entropy').fit(X,y)

dot = export_graphviz(dtree, feature_names=X.columns, class_names=y.unique(), filled=True, rounded=True)
graph = pyd.graph_from_dot_data(dot)
graph.set_size("5,5!")
Image(graph.create_png())

2.3.2 CART算法: 根据基尼不纯度Gini最小化来进行特征选择, 递归地构建二叉决策树, 基尼系数增益值最大的属性为根节点属性, 对小规模数据集有效

df = pd.read_csv('cart.csv', index_col=0)
df.head()

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

dtree = DecisionTreeClassifier().fit(X,y)

dot = export_graphviz(dtree, feature_names=X.columns, class_names=['No','Yes'], filled=True, rounded=True)
graph = pyd.graph_from_dot_data(dot)
graph.set_size("3,5!")
Image(graph.create_png())

2.3.3 剪枝Pruning: 包括预剪枝和后剪枝, 减少节点, 降低算法复杂度, 防止过拟合

def plotLr(model,title,a):
    print(f'Score for {title}: {model.score(X_test, y_test)}')
    xx, yy = np.meshgrid(np.arange(min(X[0])-1, max(X[0])+1, 0.02),
                         np.arange(min(X[1])-1, max(X[1])+1, 0.02))
    zz_predict = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    plt.subplot(a)
    plt.title(title)
    plt.contourf(xx, yy ,zz_predict, alpha=0.8)
    plt.scatter(X[0], X[1], c=y)

df = pd.read_csv('LR-testSet2.txt',header=None)
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y)

dtree = DecisionTreeClassifier().fit(X_train, y_train)
pruning_dtree = DecisionTreeClassifier(max_depth=6,min_samples_split=4).fit(X_train, y_train)

plt.figure(figsize=(14,4))
print('Score of train:', dtree.score(X_train, y_train))
plotLr(dtree,'Non-pruning',121)
print('Score of train:', pruning_dtree.score(X_train, y_train))
plotLr(pruning_dtree,'Pruning',122)

Score of train: 1.0
Score for Non-pruning: 0.8
Score of train: 0.8522727272727273
Score for Pruning: 0.7666666666666667

2.4 朴素贝叶斯Naive bayes¶

适合对文本数据的分析, 使用先验概率, 贝叶斯算法在有多个特征的情况下会使统计量巨大, 朴素贝叶斯就是假定特征之间相互独立
2.4.1 伯努利模型Bernoulli不考虑重复词语，多项式模型Multinomial考虑重复词语，混合模型训练时考虑测试时不考虑，高斯模型Gaussian适合处理连续变量

iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)

ber_nb = BernoulliNB().fit(X_train, y_train)
mul_nb = MultinomialNB().fit(X_train, y_train)
gau_nb = GaussianNB().fit(X_train, y_train)

print('BernoulliNB score:', ber_nb.score(X_test, y_test))
print('Confusion matrix:\n',confusion_matrix(y_test, ber_nb.predict(X_test)))
print('\nMultinomialNB score:', mul_nb.score(X_test, y_test))
print('Confusion matrix:\n',confusion_matrix(y_test, mul_nb.predict(X_test)))
print('\nGaussianNB score:', gau_nb.score(X_test, y_test))
print('Confusion matrix:\n',confusion_matrix(y_test, gau_nb.predict(X_test)))

BernoulliNB score: 0.2631578947368421
Confusion matrix:
 [[10  0  0]
 [14  0  0]
 [14  0  0]]

MultinomialNB score: 0.9210526315789473
Confusion matrix:
 [[10  0  0]
 [ 0 12  2]
 [ 0  1 13]]

GaussianNB score: 0.9473684210526315
Confusion matrix:
 [[10  0  0]
 [ 0 14  0]
 [ 0  2 12]]

2.4.2 词袋模型Bag of words: 用一组无序的单词来表达一段文本, 显示单词和出现次数, 把文本数据用CountVectorizer矢量化再使用模型, 也可过滤停用词

def printBow(i):
    word = cv.get_feature_names()[order[i]]
    freq = count_total[order[i]]
    print(f'No.{i+1}: The word [{word}] occurs [{freq}] times.')

news = datasets.fetch_20newsgroups(subset='all')
X, y = news.data[:3000], news.target[:3000]

cv = CountVectorizer(stop_words='english')
X_cv = cv.fit_transform(X)
count_total = X_cv.toarray().sum(axis=0)
order = np.argsort(-count_total)
print('In 3000 news (frequency):')
for i in range(5): printBow(i)

X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_cv = cv.transform(X_train)
X_test_cv = cv.transform(X_test)
mul_nb = MultinomialNB().fit(X_train_cv, y_train)
print('\nScore for train:',mul_nb.score(X_train_cv, y_train))
print('Score for test:',mul_nb.score(X_test_cv, y_test))

In 3000 news (frequency):
No.1: The word [edu] occurs [5835] times.
No.2: The word [ax] occurs [4931] times.
No.3: The word [com] occurs [3299] times.
No.4: The word [subject] occurs [3213] times.
No.5: The word [lines] occurs [3148] times.

Score for train: 0.9888888888888889
Score for test: 0.78

2.4.3 TF-IDF算法: 提取词频Term Frequency(TF), 过滤停用词Stop words, 再以逆文档频率Inverse Document Frequency(IDF)作为权重, IDF与词的常见程度成反比, 将TF和IDF相乘得到词的重要性, 用TfidfVectorizer完成向量化和TFIDF处理

def printTf(i):
    weight = X_tf.toarray()[i]
    order = np.argsort(-weight)
    word_list = []
    for j in range(5):
        word = tf.get_feature_names()[order[j]]
        word_list.append(word)
    print(f'In news {i+1}: {word_list}')

def printCv(i):
    count = X_cv.toarray()[i]
    order = np.argsort(-count)
    word_list = []
    for j in range(5):
        word = cv.get_feature_names()[order[j]]
        word_list.append(word)
    print(f'In news {i+1}: {word_list}')

news = datasets.fetch_20newsgroups(subset='all')
X, y = news.data[:3000], news.target[:3000]

tf = TfidfVectorizer(stop_words='english')
X_tf = tf.fit_transform(X)
print('Top 5 key words (TFIDF):')
for i in range(5): printTf(i)
print('\nTop 5 key words (frequency):')
for i in range(5): printCv(i)

X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_tf = tf.transform(X_train)
X_test_tf = tf.transform(X_test)
mul_nb = MultinomialNB().fit(X_train_tf, y_train)
print('\nScore for train:',mul_nb.score(X_train_tf, y_train))
print('Score for test:',mul_nb.score(X_test_tf, y_test))

Top 5 key words (TFIDF):
In news 1: ['pens', 'jagr', 'devils', 'fans', 'bit']
In news 2: ['vlb', 'uoknor', 'ecn', 'nebuchadnezzar', 'mblawson']
In news 3: ['hilmi', 'armenia', 'elchibey', 'announced', 'weapons']
In news 4: ['bus', 'dma', 'scsi', 'data', 'transfers']
In news 5: ['jasmine', 'inexpensive', 'drive', 'utility', 'tape']

Top 5 key words (frequency):
In news 1: ['pens', 'bit', 'going', 'regular', 'cmu']
In news 2: ['card', 'vlb', 'video', 'edu', 'uoknor']
In news 3: ['weapons', 'armenia', 'turkish', 'armenians', 'hilmi']
In news 4: ['bus', 'data', 'scsi', 'dma', 'ibm']
In news 5: ['drive', 'use', 'utility', 'cmu', 'driver']

Score for train: 0.9733333333333334
Score for test: 0.7493333333333333

2.5 支持向量机SVM(Support Vector Machines)¶

适合图像识别等复杂分类情况, SVM寻找区分两类的超平面Hyper plane使边际Margin最大, 是深度学习 (2012) 出现前最好的算法
2.5.1 线性可分情况
运算转化为凸优化问题: 1.无约束优化问题, 费马定理 2.带等式约束的优化问题, 拉格朗日乘子法 3.带不等式约束的优化问题, KKT条件 (拉格朗日乘子法的推广)
拉格朗日乘子法进一步转化为对偶问题, 使用SMO算法优化

df = pd.read_csv('LR-testSet.csv',header=None)
X, y = df.iloc[:,:-1], df.iloc[:,-1]

svc = SVC(kernel='linear').fit(X, y)
print('Some support vectors:\n', svc.support_vectors_[:5])
print('Index of the support vectors:', svc.support_[:5])
print('Number of support vectors (on each side):', svc.n_support_)

x = np.array([min(X[0]), max(X[0])])
k = -svc.coef_[0][0]/svc.coef_[0][1]
d = -svc.intercept_/svc.coef_[0][1]
y_bountry = k*x + d

v1 = svc.support_vectors_[5]
v2 = svc.support_vectors_[-1]
y_v1 = k*x + (v1[1] - k*v1[0])
y_v2 = k*x + (v2[1] - k*v2[0])

plt.plot(X[y==0][0], X[y==0][1], 'bo', label='label 0')
plt.plot(X[y==1][0], X[y==1][1], 'rx', label='label 1')
plt.plot(x, y_bountry, 'k', label='Decision border')
plt.plot(x, y_v1, 'r--', label='Support vector 1')
plt.plot(x, y_v2, 'b--', label='Support vector 2')
plt.legend()
plt.show()

Some support vectors:
 [[-0.752157  6.53862 ]
 [-1.322371  7.152853]
 [-0.397822  8.058397]
 [ 1.015399  7.571882]
 [-1.510047  6.061992]]
Index of the support vectors: [ 2  3 52 75 80]
Number of support vectors (on each side): [6 6]

线性不可分情况时: 引入松弛变量Slack variable和惩罚函数Penalty function
使分错的点越少越好, 距离分类边界越近越好, 改进线性不可分情况下的对偶问题
2.5.2 非线性情况时: 把低维空间的非线性问题映射到高维空间, 变成求解线性问题
可能导致维度灾难大幅增加运算时间, 引入核函数解决非线性映射: h次多项式核函数, 高斯径向基函数核函数, S型核函数
SVM优点: 1. 模型的算法复杂度取决于支持向量个数不容易过拟合 2.模型完全依赖于支持向量不依赖其余非支持向量 3.如果训练得出支持向量数量少模型容易被范化

df = pd.read_csv('LR-testSet2.txt',header=None)
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y)

svc = SVC().fit(X_train, y_train)
print('Score of train:', svc.score(X_train, y_train))
plotLr(svc,'SVC',111)

Score of train: 0.8409090909090909
Score for SVC: 0.7666666666666667

2.6 神经网络Neural network¶

即深度学习Deep learning, 近年随大数据算力算法而迅速发展
单层感知器SLP(Single Layer Perceptron): 不能解决非线性问题如异或问题
线性神经网络: 结构与感知器相似, 激活函数由sign函数改为线性purelin函数 (y=x), 引入非线性项可解决异或问题
Delta学习规则: 利用梯度下降法的连续感知器学习规则
BP神经网络(Back Propagation Neural Network): 误差反向传播, 解决了多层神经网络的学习问题, 网络结构包括输入层, 隐藏层和输出层
常见激活函数Activation function: Sigmoid函数(逻辑回归函数)(0,1), Tanh函数(双曲正切函数)(-1,1), Softsign函数(比Tanh平滑)(-1,1), ReLU函数(线性整流函数)(最常用)
多层感知器MLP(Multi Layer Perceptron): 应用BP算法, 包含隐藏层

digits = datasets.load_digits()
X, y = digits.data, digits.target

X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y)

mlp = MLPClassifier().fit(X_train, y_train)

print('Confusion matrix:')
print(confusion_matrix(y_test, mlp.predict(X_test)))
print('\nClassification report:')
print(classification_report(y_test, mlp.predict(X_test)))

Confusion matrix:
[[40  0  0  0  0  0  0  0  0  0]
 [ 0 41  0  0  0  0  0  0  0  0]
 [ 0  0 47  1  0  0  0  0  0  0]
 [ 0  0  0 39  0  0  0  0  1  1]
 [ 0  0  0  0 42  0  0  1  0  1]
 [ 0  0  0  0  0 45  0  0  0  1]
 [ 0  0  0  0  1  0 40  0  1  0]
 [ 0  0  0  0  0  0  0 47  0  1]
 [ 0  3  1  2  0  0  0  0 54  0]
 [ 0  0  0  0  0  1  0  0  2 37]]

Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        40
           1       0.93      1.00      0.96        41
           2       0.98      0.98      0.98        48
           3       0.93      0.95      0.94        41
           4       0.98      0.95      0.97        44
           5       0.98      0.98      0.98        46
           6       1.00      0.95      0.98        42
           7       0.98      0.98      0.98        48
           8       0.93      0.90      0.92        60
           9       0.90      0.93      0.91        40

    accuracy                           0.96       450
   macro avg       0.96      0.96      0.96       450
weighted avg       0.96      0.96      0.96       450

2.7 集成学习Ensemble learning¶

组合多个学习器以得到一个更好的学习器
2.7.1 装袋Bagging: 也叫Bootstrap aggregating, 是一种有放回抽样, 分出多组数据分别使用模型并组合投票, 适合复杂的数据, 个体学习器之间不存在强依赖关系

def plotIris(model,title,a):
    print(f'Score for {title}: {model.score(X_test, y_test)}')
    xx, yy = np.meshgrid(np.arange(min(X[:,0])-1, max(X[:,0])+1, 0.02),
                         np.arange(min(X[:,1])-1, max(X[:,1])+1, 0.02))
    zz_predict = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    plt.subplot(a)
    plt.title(title)
    plt.contourf(xx, yy ,zz_predict)
    plt.scatter(X[:,0], X[:,1], c=y)

iris = datasets.load_iris()
X = iris.data[:,:2]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

knn = KNeighborsClassifier().fit(X_train, y_train)
dtree = DecisionTreeClassifier().fit(X_train, y_train)
bagging_knn = BaggingClassifier(knn, n_estimators=100).fit(X_train, y_train)
bagging_dtree = BaggingClassifier(dtree, n_estimators=100).fit(X_train, y_train)

plt.figure(figsize=(14,8))
plotIris(knn,'knn',221)
plotIris(bagging_knn,'bagging knn',223)
plotIris(dtree,'dtree',222)
plotIris(bagging_dtree,'bagging dtree',224)

Score for knn: 0.8
Score for bagging knn: 0.8
Score for dtree: 0.6
Score for bagging dtree: 0.6444444444444445

2.7.2 随机森林Random forest: RF=决策树+Bagging+随机属性选择, 用bagging随机产生样本, 再随机选择特征建立CART决策树, 多颗决策树构成随机森林并投票表决结果, 一般比单一决策树准确

df = pd.read_csv('LR-testSet2.txt',header=None)
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y)

dtree = DecisionTreeClassifier().fit(X_train, y_train)
RF = RandomForestClassifier().fit(X_train, y_train)

plt.figure(figsize=(14,4))
plotLr(dtree,'dtree',121)
plotLr(RF,'RF',122)

Score for dtree: 0.8666666666666667
Score for RF: 0.8666666666666667

2.7.3 自适应增强Adaboost (提升算法boosting): 抽样时增大被前一弱分类器错误分类的样本的权值, 引入并训练下一弱分类器, 直到达到预定错误率或最大迭代次数时才确定最终的强分类器, 在强分类器中加大错误率低的弱分类器的权重, 个体分类器之间存在强依赖关系

df = pd.read_csv('LR-testSet2.txt',header=None)
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y)

dtree = DecisionTreeClassifier(max_depth=3).fit(X_train, y_train)
adaboost = AdaBoostClassifier(dtree).fit(X_train, y_train)

plt.figure(figsize=(14,4))
plotLr(dtree,'dtree',121)
plotLr(adaboost,'adaboost',122)

Score for dtree: 0.7666666666666667
Score for adaboost: 0.7666666666666667

2.7.4 Stacking: 使用多个不同分类器对训练集进行预测, 把预测结果输入一个次级分类器再输出得到最终预测结果
Voting: 对不同分类器的预测结果直接投票, 没有次级分类器

def plotIrisCV(model,title,a):
    score = cross_val_score(model, X, y, cv=3).mean()
    print(f'CV Score for {title}: {score}')
    xx, yy = np.meshgrid(np.arange(min(X[:,0])-1, max(X[:,0])+1, 0.02),
                         np.arange(min(X[:,1])-1, max(X[:,1])+1, 0.02))
    zz_predict = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    plt.subplot(a)
    plt.title(title)
    plt.contourf(xx, yy ,zz_predict)
    plt.scatter(X[:,0], X[:,1], c=y)

iris = datasets.load_iris()
X = iris.data[:,1:3]
y = iris.target

knn = KNeighborsClassifier(n_neighbors=1).fit(X, y)
dtree = DecisionTreeClassifier().fit(X, y)
logi = LogisticRegression().fit(X, y)
gau_nb = GaussianNB().fit(X, y)

stacking = StackingClassifier([knn,dtree,logi,gau_nb], logi).fit(X, y)
voting = VotingClassifier([('1',knn),('2',dtree),('3',logi),('4',gau_nb)]).fit(X, y)

plt.figure(figsize=(14,12))
plotIrisCV(knn,'knn',321)
plotIrisCV(dtree,'dtree',322)
plotIrisCV(logi,'logi',323)
plotIrisCV(gau_nb,'gau_nb',324)
plotIrisCV(stacking,'stacking',325)
plotIrisCV(voting,'voting',326)

CV Score for knn: 0.9066666666666667
CV Score for dtree: 0.9133333333333334
CV Score for logi: 0.9533333333333333
CV Score for gau_nb: 0.9133333333333334
CV Score for stacking: 0.9333333333333332
CV Score for voting: 0.94

3. 无监督学习：聚类及降维算法¶

3.1 K-Means¶

3.1.1 标准K-Means: 以随机k个点为中心进行聚类, 将靠近的对象归类, 逐次迭代更新各聚类中心的值, 直到聚类结果不再发生变化 (基于距离)

data = np.genfromtxt('kmeans.txt')
data[:5]

array([[ 1.658985,  4.285136],
       [-3.453687,  3.424321],
       [ 4.838138, -1.151539],
       [-5.379713, -3.362104],
       [ 0.972564,  2.924086]])

kmeans = KMeans(4).fit(data)
result = kmeans.predict(data)
centers = kmeans.cluster_centers_
print('Cluster centers:\n', centers)

mark = ['or', 'ob', 'og', 'oy']
for i,d in enumerate(data):
    plt.plot(d[0], d[1], mark[result[i]])

mark = ['*r', '*b', '*g', '*y']
for i,center in enumerate(centers):
    plt.plot(center[0], center[1], mark[i], markersize=20)

xx, yy = np.meshgrid(np.arange(min(data[:,0])-1, max(data[:,0])+1, 0.02),
                     np.arange(min(data[:,1])-1, max(data[:,1])+1, 0.02))
zz = kmeans.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.contourf(xx, yy ,zz, alpha=0.8)
plt.show()

Cluster centers:
 [[ 2.6265299   3.10868015]
 [-3.38237045 -2.9473363 ]
 [ 2.80293085 -2.7315146 ]
 [-2.46154315  2.78737555]]

K-Means缺点: 1.对初始质心比较敏感容易陷入局部最小值 2.需要用户自己选择合适的k值 3.只能根据距离分类, 不能处理密度分布的数据 4.数据量大时收敛较慢
优化1: 使用多次随机初始化并计算每次建模的代价函数, 取代价函数最小结果为聚类结果
优化2: 使用肘部法则选择k值 (观察代价函数关于k值的图像)
优化3: 使用DBSCAN
优化4: 使用Mini Batch K-means
3.1.2 Mini Batch K-means: 是K-means的变种, 每次只抽取小的数据子集进行训练, 大大减少了计算时间, 结果一般只略差于标准算法, 适合大数据量的情况

data = np.genfromtxt('kmeans.txt')
kmeansMini = MiniBatchKMeans(4).fit(data)

mark = ['or', 'ob', 'og', 'oy']
for i,d in enumerate(data):
    plt.plot(d[0], d[1], mark[kmeansMini.predict(data)[i]])

mark = ['*r', '*b', '*g', '*y']
for i,center in enumerate(kmeansMini.cluster_centers_):
    plt.plot(center[0], center[1], mark[i], markersize=20)

3.2 DBSCAN¶

将具有足够高密度的区域划分为簇, 并可以发现任何形状的聚类 (基于密度)
指定合适的Epsilon和Minpoints, 若点p的E领域有超过Minpoints个点则创建以p为核心点的新簇, 将核心点直接密度可达或密度可达的点加入相应的簇, 合并核心点密度相连的簇, 当没有新的点能被添加到任何簇时算法结束
缺点: 数据量大时要求较大内存支持和I/O消耗, 聚类的密度不均匀聚类间距差相差大时效果较差
优点: 不需要输入聚类个数, 聚类簇的形状没有要求, 可以输入过滤噪声的参数(Epsilon和Minpoints)

x1, y1 = datasets.make_circles(2000, factor=0.2, noise=0.1)
x2, y2 = datasets.make_blobs(1000, centers=[[1.2,1.2]], cluster_std=0.1)
x = np.concatenate((x1, x2))

y_kmeans = KMeans(3).fit_predict(x)
y_dbscan = DBSCAN(eps=0.2, min_samples=50).fit_predict(x)

ratio = len(y_dbscan[y_dbscan[:]==-1]) / len(y_dbscan)
clusters = len(set(y_dbscan)) - (1 if -1 in y_dbscan else 0)
print('Noise ratio:', format(ratio,'.2%'))
print('Estimated number of clusters: %d'% clusters)

plt.figure(figsize=(14,4))
plt.subplot(121)
plt.title('K-Means')
plt.scatter(x[:,0], x[:,1], c=y_kmeans)
plt.subplot(122)
plt.title('DBSCAN')
plt.scatter(x[:,0], x[:,1], c=y_dbscan)
plt.show()

Noise ratio: 0.60%
Estimated number of clusters: 3

3.3 主成分分析PCA(Principal Component Analysis)¶

是一种降维算法, 可用于多维数据的可视化, 找到数据最重要的方向并作投影
数据预处理(中心化), 求样本的协方差矩阵并做特征值分解, 选出最大的k个特征值对应的k个特征向量, 并将原始数据投影到选取的特征向量上
方差描述离散程度, 协方差描述相关性: 接近1正相关, 接近-1负相关, 接近0不相关

iris = datasets.load_iris()
X, y = iris.data, iris.target
print('Data shape:', X.shape)

X_2d = PCA(2).fit_transform(X)
X_3d = PCA(3).fit_transform(X)

fig = plt.figure(figsize=(14,4))
ax = fig.add_subplot(121)
ax.scatter(X_2d[:,0], X_2d[:,1], c=y)
plt.title('4D - 2D')

ax = fig.add_subplot(122, projection='3d')
ax.scatter(X_3d[:,0], X_3d[:,1], X_3d[:,2], c=y)
plt.title('4D - 3D')
plt.show()

Data shape: (150, 4)

	0	1
0	32.502345	31.707006
1	53.426804	68.777596
2	61.530358	62.562382
3	47.475640	71.546632
4	59.813208	87.230925

	Position	Level	Salary
0	Business Analyst	1	45000
1	Junior Consultant	2	50000
2	Senior Consultant	3	60000
3	Manager	4	80000
4	Country Manager	5	110000

	GNP.deflator	GNP	Unemployed	Armed.Forces	Population	Year	Employed
1947	83.0	234.289	235.6	159.0	107.608	1947	60.323
1948	88.5	259.426	232.5	145.6	108.632	1948	61.122
1949	88.2	258.054	368.2	161.6	109.773	1949	60.171
1950	89.5	284.599	335.1	165.0	110.929	1950	61.187
1951	96.2	328.975	209.9	309.9	112.075	1951	63.221

	0	1	2
0	-0.017612	14.053064	0
1	-1.395634	4.662541	1
2	-0.752157	6.538620	0
3	-1.322371	7.152853	0
4	0.423363	11.054677	0

	0	1	2
0	0.051267	0.69956	1
1	-0.092742	0.68494	1
2	-0.213710	0.69225	1
3	-0.375000	0.50219	1
4	-0.513250	0.46564	1

	age	income	student	credit_rating	class_buys_computer
RID
1	youth	high	no	fair	no
2	youth	high	no	excellent	no
3	middle_aged	high	no	fair	yes
4	senior	medium	no	fair	yes
5	senior	low	yes	fair	yes

	0	1	2
0	100	4	9.3
1	50	3	4.8
2	100	4	8.9
3	100	2	6.5
4	50	2	4.2

	house_yes	house_no	single	married	divorced	income	label
RID
1	1	0	1	0	0	125	0
2	0	1	0	1	0	100	0
3	0	1	1	0	0	70	0
4	1	0	0	1	0	120	0
5	0	1	0	0	1	95	1