[譯] 機器學習之隨機森林實戰

隨機森林 · 發表 2018-10-20 07:29:32

摘要：最近在medium中看到William Koehrsen，發現其分享了數十篇python相關的高質量的資料分析文章。我想盡量抽時間將他的文章翻譯過來，分享給大家。作者：William Koehrsen 標題“《Random Forest Simple Explanation-Unde...

最近在medium中看到William Koehrsen，發現其分享了數十篇python相關的高質量的資料分析文章。我想盡量抽時間將他的文章翻譯過來，分享給大家。

作者：William Koehrsen

標題“《Random Forest Simple Explanation-Understanding the random forest with an intuitive example》

翻譯：大鄧

昨天分享了 ofollow,noindex"> 五分鐘帶你瞭解隨機森林，今天我們以一個小案例來看看如何應用python來實現隨機森林。

任務介紹

隨機森林屬於監督學習，訓練模型時需要同時輸入 特徵矩陣X 和 靶向量target 。本文將使用 西雅圖的NOAA氣候網站 的資料，其中靶向量target（因變數：實際氣溫）是連續型數值。

資料介紹

本文使用 西雅圖的NOAA氣候網站 的csv檔案資料，該csv有9個欄位，分別是

year:2016年
month: 月份
day:年份中的第幾天
week:一週之中的第幾天
temp_2:該條記錄2天前的最高氣溫
temp_1:該條記錄1天前的最高氣溫
average:歷史上這天的平均最高氣溫
actual: 當天實際最高氣溫
friend: 某個朋友的預測值

執行步驟

在我們開始程式設計之前，我們應該提供一個簡短的行動指南，讓我們保持正確的軌道。一旦我們遇到問題和模型，以下步驟就構成了任何機器學習工作流程的基礎：

獲取資料
準備機器學習模型資料
建立基準線模型（baseline）
在訓練資料上訓練模型
對測試資料進行預測
檢驗分類器訓練的效果

獲取資料

import pandas as pd

features = pd.read_csv('temps.csv')
features.head(5)

One-Hot編碼

資料中的week列是文字資料，一共有7種。這裡使用one-hot方式將其編碼。其實week這一列對模型訓練幫助很小，在這裡也算幫助大家一起學習pandas

One-hot編碼前:

One-hot編碼後:

features = pd.get_dummies(features)
features.head(5)

特徵矩陣和靶向量

#靶向量（因變數）
targets = features['actual']

# 從特徵矩陣中移除actual這一列
#axis=1表示移除列的方向是列方向
features= features.drop('actual', axis = 1)

# 特徵名列表
feature_list = list(features.columns)

將資料分為訓練集和測試集

from sklearn.model_selection import train_test_split

train_features, test_features, train_targets, test_targets = train_test_split(features, targets, 
test_size = 0.25,
random_state = 42)

建立基準線模型（baseline）

為了能對比自己訓練的模型好壞，我們建立一個參考的基準線。這裡我們假設使用average看做基準線，看看訓練出的隨機森林模型預測效果與average這個基準比較對比孰優孰劣。

import numpy as np

#選中test_features所有行
#選中test_features中average列
baseline_preds = test_features.loc[:, 'average']


baseline_errors = abs(baseline_preds - test_targets)
print('平均誤差: ', round(np.mean(baseline_errors), 2))

執行結果

平均基準誤差:5.06

訓練隨機森林模型

from sklearn.ensemble import RandomForestRegressor

#1000個決策樹
rf = RandomForestRegressor(n_estimators= 1000, random_state=42)
rf.fit(train_features, train_targets)

執行結果

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
oob_score=False, random_state=42, verbose=0, warm_start=False)

檢驗模型訓練效果

predictions = rf.predict(test_features)

errors = abs(predictions - test_targets)

print('平均誤差:', round(np.mean(errors), 2))

執行解果

平均誤差: 3.87

準確率

#計算平均絕對百分誤差mean absolute percentage error (MAPE)
mape = 100 * (errors / test_targets)

accuracy = 100 - np.mean(mape)
print('準確率:', round(accuracy, 2), '%.')

準確率: 93.94 %.

視覺化決策樹

模型中的決策樹有 1000 個，這裡我隨便選一個決策樹視覺化。視覺化部分發現在python3.7執行出問題。3.6正常

print('模型中的決策樹有',len(rf.estimators_), '個')

執行結果

模型中的決策樹有 1000 個

檢視模型中前5個決策樹

#從1000個決策樹中抽選出前5個看看
rf.estimators_[:5]

執行結果

[DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1608637542, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1273642419, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1935803228, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=787846414, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=996406378, splitter='best')]

在本文中只隨機選擇一個決策樹將其視覺化

from sklearn.tree import export_graphviz
import pydot

# 從這1000個決策樹中，我心情好，就選第6個決策樹吧。
tree = rf.estimators_[5]

#將決策樹輸出到dot檔案中
export_graphviz(tree, 
out_file = 'tree.dot', 
feature_names = feature_list, 
rounded = True, 
precision = 1)

# 將dot檔案轉化為圖結構
(graph, ) = pydot.graph_from_dot_file('tree.dot')

#將graph圖輸出為png圖片檔案
graph.write_png('tree.png')

print('該決策樹的最大深度（層數）是:', tree.tree_.max_depth)

執行結果

該決策樹的最大深度（層數）是: 13

決策樹層數太多，太複雜。我們精簡決策樹，設定max_depth=3

rf_small = RandomForestRegressor(n_estimators=10, max_depth = 3, random_state=42)
rf_small.fit(train_features, train_labels)

tree_small = rf_small.estimators_[5]

export_graphviz(tree_small, out_file = 'small_tree.dot', 
feature_names = feature_list, 
rounded = True, 
precision = 1)

(graph, ) = pydot.graph_from_dot_file('small_tree.dot')

graph.write_png('small_tree.png')

特徵重要性

#獲得特徵重要性資訊
importances = list(rf.feature_importances_)

feature_importances = [(feature, round(importance, 2)) 
for feature, importance in zip(feature_list, importances)]

#重要性從高到低排序
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

執行結果

Variable: temp_1Importance: 0.66
Variable: averageImportance: 0.15
Variable: forecast_noaaImportance: 0.05
Variable: forecast_accImportance: 0.03
Variable: dayImportance: 0.02
Variable: temp_2Importance: 0.02
Variable: forecast_underImportance: 0.02
Variable: friendImportance: 0.02
Variable: monthImportance: 0.01
Variable: yearImportance: 0.0
Variable: week_FriImportance: 0.0
Variable: week_MonImportance: 0.0
Variable: week_SatImportance: 0.0
Variable: week_SunImportance: 0.0
Variable: week_ThursImportance: 0.0
Variable: week_TuesImportance: 0.0
Variable: week_WedImportance: 0.0

特徵重要性視覺化

import matplotlib.pyplot as plt
%matplotlib inline

#設定畫布風格
plt.style.use('fivethirtyeight')

# list of x locations for plotting
x_values = list(range(len(importances)))

# Make a bar chart
plt.bar(x_values, importances, orientation = 'vertical')

# Tick labels for x axis
plt.xticks(x_values, feature_list, rotation='vertical')

# Axis labels and title
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');