機器學習與分類演算法
為了實現分類演算法,我們使用最經典的iris資料集。首先匯入對應的資料集,這裡假設已經進行了相關的資料預處理(清洗、去重、補全)以及正則化後。
之後將資料集拆分出訓練集和測試集,用於交叉驗證。
>>> from sklearn.datasets import load_iris >>> from sklearn.model_selection import train_test_split >>> data = load_iris() >>> x = data['data'] >>> y = data['target'] >>> x_train,x_test,y_train,y_test = train_test_split(x,y,shuffle=True)
首先我們採用kNN演算法進行測試:
>>> from sklearn.neighbors import KNeighborsClassifier >>> k = 3 >>> clf = KNeighborsClassifier(n_neighbors=k) >>> clf.fit(x_train,y_train)
接著我們檢視其準確度:
>>> clf.score(x_train,y_train) 0.9910714285714286 >>> clf.score(x_test,y_test) 0.8947368421052632
我們取出1個元素檢視其內容:
>>> x_sample = x[0] >>> x_sample array([5.1, 3.5, 1.4, 0.2]) >>> y_sample = y[0] >>> y_sample 0 >>> y_pred = clf.predict(x_sample.reshape(-1,4)) #輸入與訓練時相同的格式 >>> y_pred array([0]) >>> neighbors = clf.kneighbors(x_sample.reshape(-1,4),return_distance=False) >>> neighbors array([[ 70, 106,40]]) >>> y[neighbors] array([[1, 2, 0]])
我們隨機挑選了1個數據作為樣例傳入,可以發現其預測是正確的。之後獲取其近鄰,發現結果有點差強人意。而且訓練集與測試集的分數之差有10%的大小。
之後我們檢視決策樹的效果:
>>> from sklearn.tree import DecisionTreeClassifier >>> clf = DecisionTreeClassifier(min_samples_split=3) >>> clf.fit(x_train,y_train) >>> clf.score(x_train,y_train) 1.0 >>> clf.score(x_test,y_test) 0.8947368421052632 >>> x_sample = x[70] >>> y_sample = y[70] >>> y_sample 1 >>> clf.predict(x_sample.reshape(-1,4)) array([2])
可以發現測試集的結果還是偏低,說明過擬合了。此時對於決策樹而言,需要剪枝,即降低層數。
下面試下隨機森林的效果:
>>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.metrics import classification_report,accuracy_score >>> estimator = RandomForestClassifier(n_estimators=100) >>> estimator.fit(x_train,y_train) >>> x_pred = estimator.predict(x_train) >>> x_score = accuracy_score(y_train,x_pred) #用於分類時分數的計算,與score方法一致 >>> x_score 1.0 >>> test_pred = estimator.predict(x_test) >>> test_score = accuracy_score(y_test,test_pred) >>> test_score 0.8947368421052632
之後使用向量機來看下分類的結果:
>>> from sklearn.svm import SVC >>> clf = SVC(c=1.0,kernel='linear') >>> clf.score(x_test,y_test) 0.9473684210526315
可以發現使用向量機的方式比之前的方式要好。
對於樸素貝葉斯檢視下其效果:
>>> from sklearn.naive_bayes import MultinomialNB >>> clf = MultinomialNB(alpha=0.0001) >>> clf.fit(x_train,y_train) >>> clf.score(x_train,y_train) 0.8214285714285714 >>> clf.score(x_test,y_test) 0.7105263157894737
可以發現使用多項分佈的貝葉斯分類反而效果很一般。如果使用的是高斯分佈則有:
>>> from sklearn.naive_bayes import GaussianNB >>> clf = GaussianNB() >>> clf.fit(x_train,y_train) >>> clf.score(x_train,y_train) 0.9642857142857143 >>> clf.score(x_test,y_test) 0.8947368421052632 >>> pred = clf.predict(x_test) >>> print(classification_report(y_test,pred)) precisionrecallf1-scoresupport 01.001.001.0010 10.880.880.8816 20.830.830.8312 micro avg0.890.890.8938 macro avg0.900.900.9038 weighted avg0.890.890.8938
可以發現其準確率又有所提升。