機器學習筆記(2) 整合學習隨機森林先導知識

隨機森林 · 發表 2018-12-01 00:07:00

摘要：每一種機器學習演算法都可以看做是一種看待資料的視角. 就像我們看待一個問題,一個觀點一樣.每一種視角必然有他合理的地方,也有他片面的地方.對機器學習而言,也是一樣.所以為了提高我們對資料的瞭解程度,我們要儘可能地從多個視角考察我們的資料. 這樣對新的test data,不管是分...

每一種機器學習演算法都可以看做是一種看待資料的視角.

就像我們看待一個問題,一個觀點一樣.每一種視角必然有他合理的地方,也有他片面的地方.對機器學習而言,也是一樣.所以為了提高我們對資料的瞭解程度,我們要儘可能地從多個視角考察我們的資料. 這樣對新的test data,不管是分類還是迴歸,我們才可能有更高的預測準確率.

實際上上述過程,就是所謂的ensemble。

整合學習

機器學習中的整合學習就是將選擇若干演算法，針對同樣的train data去訓練模型，然後看看結果，使用投票機制，少數服從多數，用多數演算法給出的結果當作最終的決策依據，這就是整合學習的核心思路.

1.voting

from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(estimators=[
('log_clf', LogisticRegression()), 
('svm_clf', SVC()),
('dt_clf', DecisionTreeClassifier(random_state=666))],
voting='hard')

ofollow,noindex">VotingClassifier

class sklearn.ensemble. VotingClassifier (estimators , voting='hard' , weights=None , n_jobs=1 , flatten_transform=None )

這裡解釋一下voting這個引數：

舉一個例子，假設有3個模型，針對同一個二分類問題，將每種類別都計算出了概率：

模型1 A-99%，B-1%
模型2 A-49%，B-51%
模型3 A-49%，B-51%

如果單純地投票的話,會分類為B. 這就是所謂的hard voting。

然而顯然是有問題的,因為模型1非常確認類別應該是A（99%），而模型2和模型3幾乎無法認定是A還是B（49% VS 51%）,那麼這種情況下，將結果分類為A是更合理的.

這也就引入了soft voting。即根據概率來投票. p(A)=(0.99 + 0.49 + 0.49)/3 = 0.657 p(B)=(0.01+0.51+0.51)/3 = 0.343 p(A)>p(B)所以應該分類為A。

2.bagging

從投票的角度來說,雖然有了很多機器學習演算法,但是還是不夠多！所以我們想建立儘可能多的子模型,整合各種子模型的意見.同時又要保證子模型之間要有差異,否則就失去了投票的意義.

我們想要儘可能多的子模型
子模型之間要有差異性

那麼怎麼保證子模型的差異性呢？

一種簡單的方法：讓機器學習演算法只訓練訓練集的一部分. 那麼這又帶來一個問題,每個子模型只學習到了一部分的訓練資料資訊,那麼這種子模型的預測準確率不就很低了嗎？答案是肯定的,單個子模型的準確率確實會降低,但是沒有關係.

比如單個子模型的準確率為51%

那麼整個系統的準確率為：$$P=\sum_{i=m/2}^mC_m^ip^i(1-p)^{m-i}$$

import numpy as np
from scipy.special import comb, permdef f(x,n):
r = 0
for i in range(x,n+1):
r += comb(n,i)*np.power(0.51,i)*np.power(0.49,n-i)
return r 

f(2,3) = 0.5149980000000001
f(251,500) = 0.6564399889597903

由以上程式碼可以看到,當子模型的準確率為51%時，如果一個系統有3個子模型,那麼系統的準確率為51.5%。當一個系統有500個子模型時,準確率則到了65.6%.

怎麼樣從訓練資料中取出一部分呢？即如何取樣？

放回取樣 bagging 更常用.
不放回取樣 pasting

我們把放回取樣叫bagging，不放回取樣叫pasting。

放回取樣的方式可以訓練更多的模型. 在一次模型的fit中,比如樣本500,每次取100,不放回取樣最多隻能訓練5個子模型. 放回可以訓練成千上萬個子模型.並且由pasting能訓練的次數太少,這500個樣本劃分成怎樣的5個100就有講究了,可能會對最後的結果帶來很大的影響. bagging的話在成千上萬個子模型的訓練中就一定程度上消除了這種隨機性.

out of bag(OOB)

放回取樣的一個問題是:在有限次的取樣過程中,有一部分樣本可能一直沒被選取到.大概有37%的樣本沒有取到.

數學證明可以參考一下：

37%的由來

我們可以用這部分沒被取樣的資料集作為我們的驗證集. sklearn中的oob_score_就是相應的驗證集得到的分數.

scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier" target="_blank" rel="nofollow,noindex">sklearn中的bagging

class sklearn.ensemble. BaggingClassifier (base_estimator=None , n_estimators=10 , max_samples=1.0 , max_features=1.0 , bootstrap=True , bootstrap_features=False , oob_score=False , warm_start=False , n_jobs=None , random_state=None , verbose=0 ) [source] ¶

base_estimator : object or None, optional (default=None)

The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.

n_estimators : int, optional (default=10)

The number of base estimators in the ensemble.

max_samples : int or float, optional (default=1.0)

The number of samples to draw from X to train each base estimator.

If int, then draw max_samples samples.
If float, then draw max_samples * X.shape[0] samples.

max_features : int or float, optional (default=1.0)

The number of features to draw from X to train each base estimator.

If int, then draw max_features features.
If float, then draw max_features * X.shape[1] features.

bootstrap : boolean, optional (default=True)

Whether samples are drawn with replacement.

bootstrap_features : boolean, optional (default=False)

Whether features are drawn with replacement.

oob_score : bool, optional (default=False)

Whether to use out-of-bag samples to estimate the generalization error.

warm_start : bool, optional (default=False)

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new ensemble. Seesary.html#term-warm-start" rel="nofollow,noindex" target="_blank">the Glossary .

New in version 0.17: warm_start constructor parameter.

n_jobs : int or None, optional (default=None)

The number of jobs to run in parallel for bothfit and predict . None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used bynp.random .

verbose : int, optional (default=0)

Controls the verbosity when fitting and predicting.

max_samples 每個子模型取樣的樣本數

bootstrap 為true表示放回取樣

oob_score 是否使用out-of-bag samples做驗證

1 bagging_clf = BaggingClassifier(DecisionTreeClassifier(),
2n_estimators=500, max_samples=100,
3bootstrap=True, oob_score=True,
4n_jobs=-1)
5 bagging_clf.fit(X, y)

隨機森林Random Forest

瞭解了前面ensemble的相關概念後,就很容易理解隨機森林了. 所謂隨機森林,就是由很多個decision tree做ensemble得到的模型.

後面的文章會繼續詳細介紹random forest

機器學習筆記系列文章列表

機器學習筆記(1) 決策樹

機器學習筆記(2) 整合學習隨機森林先導知識

機器學習筆記(2) 整合學習 隨機森林先導知識

隨機森林Random Forest

您可能也會喜歡…

機器學習筆記(2) 整合學習隨機森林先導知識