[譯] 機器學習之隨機森林實戰
最近在medium中看到William Koehrsen,發現其分享了數十篇python相關的高質量的資料分析文章。我想盡量抽時間將他的文章翻譯過來,分享給大家。
作者:William Koehrsen
標題“《Random Forest Simple Explanation-Understanding the random forest with an intuitive example》
翻譯:大鄧
昨天分享了 ofollow,noindex"> 五分鐘帶你瞭解隨機森林 ,今天我們以一個小案例來看看如何應用python來實現隨機森林。
任務介紹
隨機森林屬於監督學習,訓練模型時需要同時輸入 特徵矩陣X
和 靶向量target
。本文將使用 西雅圖的NOAA氣候網站
的資料,其中 靶向量target(因變數:實際氣溫)是連續型數值。
資料介紹
本文使用 西雅圖的NOAA氣候網站
的csv檔案資料,該csv有9個欄位,分別是
-
year:2016年
-
month: 月份
-
day:年份中的第幾天
-
week:一週之中的第幾天
-
temp_2:該條記錄2天前的最高氣溫
-
temp_1:該條記錄1天前的最高氣溫
-
average:歷史上這天的平均最高氣溫
-
actual: 當天實際最高氣溫
-
friend: 某個朋友的預測值
執行步驟
在我們開始程式設計之前,我們應該提供一個簡短的行動指南,讓我們保持正確的軌道。 一旦我們遇到問題和模型,以下步驟就構成了任何機器學習工作流程的基礎:
-
獲取資料
-
準備機器學習模型資料
-
建立基準線模型(baseline)
-
在訓練資料上訓練模型
-
對測試資料進行預測
-
檢驗分類器訓練的效果
獲取資料
import pandas as pd features = pd.read_csv('temps.csv') features.head(5)
One-Hot編碼
資料中的week列是文字資料,一共有7種。這裡使用one-hot方式將其編碼。其實week這一列對模型訓練幫助很小,在這裡也算幫助大家一起學習pandas
One-hot編碼前:
One-hot編碼後:
features = pd.get_dummies(features)
features.head(5)
特徵矩陣和靶向量
#靶向量(因變數) targets = features['actual'] # 從特徵矩陣中移除actual這一列 #axis=1表示移除列的方向是列方向 features= features.drop('actual', axis = 1) # 特徵名列表 feature_list = list(features.columns)
將資料分為訓練集和測試集
from sklearn.model_selection import train_test_split train_features, test_features, train_targets, test_targets = train_test_split(features, targets, test_size = 0.25, random_state = 42)
建立基準線模型(baseline)
為了能對比自己訓練的模型好壞,我們建立一個參考的基準線。這裡我們假設使用average看做基準線,看看訓練出的隨機森林模型預測效果與average這個基準比較對比孰優孰劣。
import numpy as np #選中test_features所有行 #選中test_features中average列 baseline_preds = test_features.loc[:, 'average'] baseline_errors = abs(baseline_preds - test_targets) print('平均誤差: ', round(np.mean(baseline_errors), 2))
執行結果
平均基準誤差:5.06
訓練隨機森林模型
from sklearn.ensemble import RandomForestRegressor #1000個決策樹 rf = RandomForestRegressor(n_estimators= 1000, random_state=42) rf.fit(train_features, train_targets)
執行結果
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1, oob_score=False, random_state=42, verbose=0, warm_start=False)
檢驗模型訓練效果
predictions = rf.predict(test_features) errors = abs(predictions - test_targets) print('平均誤差:', round(np.mean(errors), 2))
執行解果
平均誤差: 3.87
準確率
#計算平均絕對百分誤差mean absolute percentage error (MAPE) mape = 100 * (errors / test_targets) accuracy = 100 - np.mean(mape) print('準確率:', round(accuracy, 2), '%.')
準確率: 93.94 %.
視覺化決策樹
模型中的決策樹有 1000 個,這裡我隨便選一個決策樹視覺化。視覺化部分發現在python3.7執行出問題。3.6正常
print('模型中的決策樹有',len(rf.estimators_), '個')
執行結果
模型中的決策樹有 1000 個
檢視模型中前5個決策樹
#從1000個決策樹中抽選出前5個看看 rf.estimators_[:5]
執行結果
[DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=1608637542, splitter='best'), DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=1273642419, splitter='best'), DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=1935803228, splitter='best'), DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=787846414, splitter='best'), DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=996406378, splitter='best')]
在本文中只隨機選擇一個決策樹將其視覺化
from sklearn.tree import export_graphviz import pydot # 從這1000個決策樹中,我心情好,就選第6個決策樹吧。 tree = rf.estimators_[5] #將決策樹輸出到dot檔案中 export_graphviz(tree, out_file = 'tree.dot', feature_names = feature_list, rounded = True, precision = 1) # 將dot檔案轉化為圖結構 (graph, ) = pydot.graph_from_dot_file('tree.dot') #將graph圖輸出為png圖片檔案 graph.write_png('tree.png')
print('該決策樹的最大深度(層數)是:', tree.tree_.max_depth)
執行結果
該決策樹的最大深度(層數)是: 13
決策樹層數太多,太複雜。我們精簡決策樹,設定max_depth=3
rf_small = RandomForestRegressor(n_estimators=10, max_depth = 3, random_state=42) rf_small.fit(train_features, train_labels) tree_small = rf_small.estimators_[5] export_graphviz(tree_small, out_file = 'small_tree.dot', feature_names = feature_list, rounded = True, precision = 1) (graph, ) = pydot.graph_from_dot_file('small_tree.dot') graph.write_png('small_tree.png')
特徵重要性
#獲得特徵重要性資訊 importances = list(rf.feature_importances_) feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)] #重要性從高到低排序 feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True) # Print out the feature and importances [print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]
執行結果
Variable: temp_1Importance: 0.66 Variable: averageImportance: 0.15 Variable: forecast_noaaImportance: 0.05 Variable: forecast_accImportance: 0.03 Variable: dayImportance: 0.02 Variable: temp_2Importance: 0.02 Variable: forecast_underImportance: 0.02 Variable: friendImportance: 0.02 Variable: monthImportance: 0.01 Variable: yearImportance: 0.0 Variable: week_FriImportance: 0.0 Variable: week_MonImportance: 0.0 Variable: week_SatImportance: 0.0 Variable: week_SunImportance: 0.0 Variable: week_ThursImportance: 0.0 Variable: week_TuesImportance: 0.0 Variable: week_WedImportance: 0.0
特徵重要性視覺化
import matplotlib.pyplot as plt %matplotlib inline #設定畫布風格 plt.style.use('fivethirtyeight') # list of x locations for plotting x_values = list(range(len(importances))) # Make a bar chart plt.bar(x_values, importances, orientation = 'vertical') # Tick labels for x axis plt.xticks(x_values, feature_list, rotation='vertical') # Axis labels and title plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');
(看到這裡了,大家幫忙動動金手指支援大鄧創作O(∩_∩)O~)
精選文章
深度學習之 圖解LSTM
PyTorch實戰: 使用卷積神經網路對照片進行分類