Python-根據成績分析是否繼續深造
案例:該資料集的是一個關於每個學生成績的資料集,接下來我們對該資料集進行分析,判斷學生是否適合繼續深造
資料集特徵展示
1GRE 成績 (290 to 340) 2TOEFL 成績(92 to 120) 3學校等級 (1 to 5) 4自身的意願 (1 to 5) 5推薦信的力度 (1 to 5) 6CGPA成績 (6.8 to 9.92) 7是否有研習經驗 (0 or 1) 8讀碩士的意向 (0.34 to 0.97)
1.匯入包
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import os,sys
2.匯入並檢視資料集
df = pd.read_csv("D:\\machine-learning\\score\\Admission_Predict.csv",sep = ",") print('There are ',len(df.columns),'columns') for c in df.columns: sys.stdout.write(str(c)+', '
There are9 columns Serial No., GRE Score, TOEFL Score, University Rating, SOP, LOR , CGPA, Research, Chance of Admit , 一共有9列特徵
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 400 entries, 0 to 399 Data columns (total 9 columns): Serial No.400 non-null int64 GRE Score400 non-null int64 TOEFL Score400 non-null int64 University Rating400 non-null int64 SOP400 non-null float64 LOR400 non-null float64 CGPA400 non-null float64 Research400 non-null int64 Chance of Admit400 non-null float64 dtypes: float64(4), int64(5) memory usage: 28.2 KB 資料集資訊: 1.資料有9個特徵,分別是學號,GRE分數,託福分數,學校等級,SOP,LOR,CGPA,是否參加研習,進修的機率 2.資料集中沒有空值 3.一共有400條資料
# 整理列名稱 df = df.rename(columns={'Chance of Admit ':'Chance of Admit'}) # 顯示前5列資料 df.head()
3.檢視每個特徵的相關性
fig,ax = plt.subplots(figsize=(10,10)) sns.heatmap(df.corr(),ax=ax,annot=True,linewidths=0.05,fmt='.2f',cmap='magma') plt.show()
結論:1.最有可能影響是否讀碩士的特徵是GRE,CGPA,TOEFL成績
2.影響相對較小的特徵是LOR,SOP,和Research
4.資料視覺化,雙變數分析
4.1 進行Research的人數
print("Not Having Research:",len(df[df.Research == 0])) print("Having Research:",len(df[df.Research == 1])) y = np.array([len(df[df.Research == 0]),len(df[df.Research == 1])]) x = np.arange(2) plt.bar(x,y) plt.title("Research Experience") plt.xlabel("Canditates") plt.ylabel("Frequency") plt.xticks(x,('Not having research','Having research')) plt.show()
結論:進行research的人數是219,本科沒有research人數是181
4.2 學生的託福成績
y = np.array([df['TOEFL Score'].min(),df['TOEFL Score'].mean(),df['TOEFL Score'].max()]) x = np.arange(3) plt.bar(x,y) plt.title('TOEFL Score') plt.xlabel('Level') plt.ylabel('TOEFL Score') plt.xticks(x,('Worst','Average','Best')) plt.show()
結論:最低分92分,最高分滿分,進修學生的英語成績很不錯
4.3 GRE成績
df['GRE Score'].plot(kind='hist',bins=200,figsize=(6,6)) plt.title('GRE Score') plt.xlabel('GRE Score') plt.ylabel('Frequency') plt.show()
結論:310和330的分值的學生居多
4.4 CGPA和學校等級的關係
plt.scatter(df['University Rating'],df['CGPA']) plt.title('CGPA Scores for University ratings') plt.xlabel('University Rating') plt.ylabel('CGPA') plt.show()
結論:學校越好,學生的GPA可能就越高
4.5 GRE成績和CGPA的關係
plt.scatter(df['GRE Score'],df['CGPA']) plt.title('CGPA for GRE Scores') plt.xlabel('GRE Score') plt.ylabel('CGPA') plt.show()
結論:GPA基點越高,GRE分數越高,2者的相關性很大
4.6 託福成績和GRE成績的關係
df[df['CGPA']>=8.5].plot(kind='scatter',x='GRE Score',y='TOEFL Score',color='red') plt.xlabel('GRE Score') plt.ylabel('TOEFL Score') plt.title('CGPA >= 8.5') plt.grid(True) plt.show()
結論:多數情況下GRE和託福成正相關,但是GRE分數高,託福一定高。
4.6 學校等級和是否讀碩士的關係
s = df[df['Chance of Admit'] >= 0.75]['University Rating'].value_counts().head(5) plt.title('University Ratings of Candidates with an 75% acceptance chance') s.plot(kind='bar',figsize=(20,10),cmap='Pastel1') plt.xlabel('University Rating') plt.ylabel('Candidates') plt.show()
結論:排名靠前的學校的學生,進修的可能性更大
4.7 SOP和GPA的關係
plt.scatter(df['CGPA'],df['SOP']) plt.xlabel('CGPA') plt.ylabel('SOP') plt.title('SOP for CGPA') plt.show()
結論: GPA很高的學生,選擇讀碩士的自我意願更強烈
4.8 SOP和GRE的關係
plt.scatter(df['GRE Score'],df['SOP']) plt.xlabel('GRE Score') plt.ylabel('SOP') plt.title('SOP for GRE Score') plt.show()
結論:讀碩士意願強的學生,GRE分數較高
5.模型
5.1 準備資料集
# 讀取資料集 df = pd.read_csv('D:\\machine-learning\\score\\Admission_Predict.csv',sep=',') serialNO = df['Serial No.'].values df.drop(['Serial No.'],axis=1,inplace=True) df = df.rename(columns={'Chance of Admit ':'Chance of Admit'}) # 分割資料集 y = df['Chance of Admit'].values x = df.drop(['Chance of Admit'],axis=1) from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42) # 歸一化資料 from sklearn.preprocessing import MinMaxScaler scaleX = MinMaxScaler(feature_range=[0,1]) x_train[x_train.columns] = scaleX.fit_transform(x_train[x_train.columns]) x_test[x_test.columns] = scaleX.fit_transform(x_test[x_test.columns])
5.2 迴歸
5.2.1 線性迴歸
from sklearn.linear_model import LinearRegression lr = LinearRegression() lr.fit(x_train,y_train) y_head_lr = lr.predict(x_test) print('Real value of y_test[1]: '+str(y_test[1]) + ' -> predict value: ' + str(lr.predict(x_test.iloc[[1],:]))) print('Real value of y_test[2]: '+str(y_test[2]) + ' -> predict value: ' + str(lr.predict(x_test.iloc[[2],:]))) from sklearn.metrics import r2_score print('r_square score: ',r2_score(y_test,y_head_lr)) y_head_lr_train = lr.predict(x_train) print('r_square score(train data):',r2_score(y_train,y_head_lr_train))
5.2.2 隨機森林迴歸
from sklearn.ensemble import RandomForestRegressor rfr = RandomForestRegressor(n_estimators=100,random_state=42) rfr.fit(x_train,y_train) y_head_rfr = rfr.predict(x_test) print('Real value of y_test[1]: '+str(y_test[1]) + ' -> predict value: ' + str(rfr.predict(x_test.iloc[[1],:]))) print('Real value of y_test[2]: '+str(y_test[2]) + ' -> predict value: ' + str(rfr.predict(x_test.iloc[[2],:]))) from sklearn.metrics import r2_score print('r_square score: ',r2_score(y_test,y_head_rfr)) y_head_rfr_train = rfr.predict(x_train) print('r_square score(train data):',r2_score(y_train,y_head_rfr_train))
5.2.3 決策樹迴歸
from sklearn.tree import DecisionTreeRegressor dt = DecisionTreeRegressor(random_state=42) dt.fit(x_train,y_train) y_head_dt = dt.predict(x_test) print('Real value of y_test[1]: '+str(y_test[1]) + ' -> predict value: ' + str(dt.predict(x_test.iloc[[1],:]))) print('Real value of y_test[2]: '+str(y_test[2]) + ' -> predict value: ' + str(dt.predict(x_test.iloc[[2],:]))) from sklearn.metrics import r2_score print('r_square score: ',r2_score(y_test,y_head_dt)) y_head_dt_train = dt.predict(x_train) print('r_square score(train data):',r2_score(y_train,y_head_dt_train))
5.2.4 三種迴歸方法比較
y = np.array([r2_score(y_test,y_head_lr),r2_score(y_test,y_head_rfr),r2_score(y_test,y_head_dt)]) x = np.arange(3) plt.bar(x,y) plt.title('Comparion of Regression Algorithms') plt.xlabel('Regression') plt.ylabel('r2_score') plt.xticks(x,("LinearRegression","RandomForestReg.","DecisionTreeReg.")) plt.show()
結論 : 迴歸演算法中,線性迴歸的效能更優
5.2.5 三種迴歸方法與實際值的比較
red = plt.scatter(np.arange(0,80,5),y_head_lr[0:80:5],color='red') blue = plt.scatter(np.arange(0,80,5),y_head_rfr[0:80:5],color='blue') green = plt.scatter(np.arange(0,80,5),y_head_dt[0:80:5],color='green') black = plt.scatter(np.arange(0,80,5),y_test[0:80:5],color='black') plt.title('Comparison of Regression Algorithms') plt.xlabel('Index of candidate') plt.ylabel('Chance of admit') plt.legend([red,blue,green,black],['LR','RFR','DT','REAL']) plt.show()
結論:在資料集中有70%的候選人有可能讀碩士,從上圖來看還有些點沒有很好的得到預測
5.3 分類演算法
5.3.1 準備資料
df = pd.read_csv('D:\\machine-learning\\score\\Admission_Predict.csv',sep=',') SerialNO = df['Serial No.'].values df.drop(['Serial No.'],axis=1,inplace=True) df = df.rename(columns={'Chance of Admit ':'Chance of Admit'}) y = df['Chance of Admit'].values x = df.drop(['Chance of Admit'],axis=1) from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42) from sklearn.preprocessing import MinMaxScaler scaleX = MinMaxScaler(feature_range=[0,1]) x_train[x_train.columns] = scaleX.fit_transform(x_train[x_train.columns]) x_test[x_test.columns] = scaleX.fit_transform(x_test[x_test.columns]) # 如果chance >0.8, chance of admit 就是1,否則就是0 y_train_01 = [1 if each > 0.8 else 0 for each in y_train] y_test_01 = [1 if each > 0.8 else 0 for each in y_test] y_train_01 = np.array(y_train_01) y_test_01 = np.array(y_test_01)
5.3.2 邏輯迴歸
from sklearn.linear_model import LogisticRegression lrc = LogisticRegression() lrc.fit(x_train,y_train_01) print('score: ',lrc.score(x_test,y_test_01)) print('Real value of y_test_01[1]: '+str(y_test_01[1]) + ' -> predict value: ' + str(lrc.predict(x_test.iloc[[1],:]))) print('Real value of y_test_01[2]: '+str(y_test_01[2]) + ' -> predict value: ' + str(lrc.predict(x_test.iloc[[2],:]))) from sklearn.metrics import confusion_matrix cm_lrc = confusion_matrix(y_test_01,lrc.predict(x_test)) f,ax = plt.subplots(figsize=(5,5)) sns.heatmap(cm_lrc,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax) plt.title('Test for Test dataset') plt.xlabel('predicted y values') plt.ylabel('real y value') plt.show() from sklearn.metrics import recall_score,precision_score,f1_score print('precision_score is : ',precision_score(y_test_01,lrc.predict(x_test))) print('recall_score is : ',recall_score(y_test_01,lrc.predict(x_test))) print('f1_score is : ',f1_score(y_test_01,lrc.predict(x_test))) # Test for Train Dataset: cm_lrc_train = confusion_matrix(y_train_01,lrc.predict(x_train)) f,ax = plt.subplots(figsize=(5,5)) sns.heatmap(cm_lrc_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax) plt.title('Test for Train dataset') plt.xlabel('predicted y values') plt.ylabel('real y value') plt.show()
結論:1.通過混淆矩陣,邏輯迴歸演算法在訓練集樣本上,有23個分錯的樣本,有72人想進一步讀碩士
2.在測試集上有7個分錯的樣本
5.3.3 支援向量機(SVM)
from sklearn.svm import SVC svm = SVC(random_state=1,kernel='rbf') svm.fit(x_train,y_train_01) print('score: ',svm.score(x_test,y_test_01)) print('Real value of y_test_01[1]: '+str(y_test_01[1]) + ' -> predict value: ' + str(svm.predict(x_test.iloc[[1],:]))) print('Real value of y_test_01[2]: '+str(y_test_01[2]) + ' -> predict value: ' + str(svm.predict(x_test.iloc[[2],:]))) from sklearn.metrics import confusion_matrix cm_svm = confusion_matrix(y_test_01,svm.predict(x_test)) f,ax = plt.subplots(figsize=(5,5)) sns.heatmap(cm_svm,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax) plt.title('Test for Test dataset') plt.xlabel('predicted y values') plt.ylabel('real y value') plt.show() from sklearn.metrics import recall_score,precision_score,f1_score print('precision_score is : ',precision_score(y_test_01,svm.predict(x_test))) print('recall_score is : ',recall_score(y_test_01,svm.predict(x_test))) print('f1_score is : ',f1_score(y_test_01,svm.predict(x_test))) # Test for Train Dataset: cm_svm_train = confusion_matrix(y_train_01,svm.predict(x_train)) f,ax = plt.subplots(figsize=(5,5)) sns.heatmap(cm_svm_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax) plt.title('Test for Train dataset') plt.xlabel('predicted y values') plt.ylabel('real y value') plt.show()
結論:1.通過混淆矩陣,SVM演算法在訓練集樣本上,有22個分錯的樣本,有70人想進一步讀碩士
2.在測試集上有8個分錯的樣本
5.3.4 樸素貝葉斯
from sklearn.naive_bayes import GaussianNB nb = GaussianNB() nb.fit(x_train,y_train_01) print('score: ',nb.score(x_test,y_test_01)) print('Real value of y_test_01[1]: '+str(y_test_01[1]) + ' -> predict value: ' + str(nb.predict(x_test.iloc[[1],:]))) print('Real value of y_test_01[2]: '+str(y_test_01[2]) + ' -> predict value: ' + str(nb.predict(x_test.iloc[[2],:]))) from sklearn.metrics import confusion_matrix cm_nb = confusion_matrix(y_test_01,nb.predict(x_test)) f,ax = plt.subplots(figsize=(5,5)) sns.heatmap(cm_nb,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax) plt.title('Test for Test dataset') plt.xlabel('predicted y values') plt.ylabel('real y value') plt.show() from sklearn.metrics import recall_score,precision_score,f1_score print('precision_score is : ',precision_score(y_test_01,nb.predict(x_test))) print('recall_score is : ',recall_score(y_test_01,nb.predict(x_test))) print('f1_score is : ',f1_score(y_test_01,nb.predict(x_test))) # Test for Train Dataset: cm_nb_train = confusion_matrix(y_train_01,nb.predict(x_train)) f,ax = plt.subplots(figsize=(5,5)) sns.heatmap(cm_nb_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax) plt.title('Test for Train dataset') plt.xlabel('predicted y values') plt.ylabel('real y value') plt.show()
結論:1.通過混淆矩陣,樸素貝葉斯演算法在訓練集樣本上,有20個分錯的樣本,有78人想進一步讀碩士
2.在測試集上有7個分錯的樣本
5.3.5 隨機森林分類器
from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier(n_estimators=100,random_state=1) rfc.fit(x_train,y_train_01) print('score: ',rfc.score(x_test,y_test_01)) print('Real value of y_test_01[1]: '+str(y_test_01[1]) + ' -> predict value: ' + str(rfc.predict(x_test.iloc[[1],:]))) print('Real value of y_test_01[2]: '+str(y_test_01[2]) + ' -> predict value: ' + str(rfc.predict(x_test.iloc[[2],:]))) from sklearn.metrics import confusion_matrix cm_rfc = confusion_matrix(y_test_01,rfc.predict(x_test)) f,ax = plt.subplots(figsize=(5,5)) sns.heatmap(cm_rfc,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax) plt.title('Test for Test dataset') plt.xlabel('predicted y values') plt.ylabel('real y value') plt.show() from sklearn.metrics import recall_score,precision_score,f1_score print('precision_score is : ',precision_score(y_test_01,rfc.predict(x_test))) print('recall_score is : ',recall_score(y_test_01,rfc.predict(x_test))) print('f1_score is : ',f1_score(y_test_01,rfc.predict(x_test))) # Test for Train Dataset: cm_rfc_train = confusion_matrix(y_train_01,rfc.predict(x_train)) f,ax = plt.subplots(figsize=(5,5)) sns.heatmap(cm_rfc_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax) plt.title('Test for Train dataset') plt.xlabel('predicted y values') plt.ylabel('real y value') plt.show()
結論:1.通過混淆矩陣,隨機森林演算法在訓練集樣本上,有0個分錯的樣本,有88人想進一步讀碩士
2.在測試集上有5個分錯的樣本
5.3.6 決策樹分類器
from sklearn.tree import DecisionTreeClassifier dtc = DecisionTreeClassifier(criterion='entropy',max_depth=3) dtc.fit(x_train,y_train_01) print('score: ',dtc.score(x_test,y_test_01)) print('Real value of y_test_01[1]: '+str(y_test_01[1]) + ' -> predict value: ' + str(dtc.predict(x_test.iloc[[1],:]))) print('Real value of y_test_01[2]: '+str(y_test_01[2]) + ' -> predict value: ' + str(dtc.predict(x_test.iloc[[2],:]))) from sklearn.metrics import confusion_matrix cm_dtc = confusion_matrix(y_test_01,dtc.predict(x_test)) f,ax = plt.subplots(figsize=(5,5)) sns.heatmap(cm_dtc,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax) plt.title('Test for Test dataset') plt.xlabel('predicted y values') plt.ylabel('real y value') plt.show() from sklearn.metrics import recall_score,precision_score,f1_score print('precision_score is : ',precision_score(y_test_01,dtc.predict(x_test))) print('recall_score is : ',recall_score(y_test_01,dtc.predict(x_test))) print('f1_score is : ',f1_score(y_test_01,dtc.predict(x_test))) # Test for Train Dataset: cm_dtc_train = confusion_matrix(y_train_01,dtc.predict(x_train)) f,ax = plt.subplots(figsize=(5,5)) sns.heatmap(cm_dtc_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax) plt.title('Test for Train dataset') plt.xlabel('predicted y values') plt.ylabel('real y value') plt.show()
結論:1.通過混淆矩陣,決策樹演算法在訓練集樣本上,有20個分錯的樣本,有78人想進一步讀碩士
2.在測試集上有7個分錯的樣本
5.3.7 K臨近分類器
from sklearn.neighbors import KNeighborsClassifier scores = [] for each in range(1,50): knn_n = KNeighborsClassifier(n_neighbors = each) knn_n.fit(x_train,y_train_01) scores.append(knn_n.score(x_test,y_test_01)) plt.plot(range(1,50),scores) plt.xlabel('k') plt.ylabel('Accuracy') plt.show() knn = KNeighborsClassifier(n_neighbors=7) knn.fit(x_train,y_train_01) print('score 7 : ',knn.score(x_test,y_test_01)) print('Real value of y_test_01[1]: '+str(y_test_01[1]) + ' -> predict value: ' + str(knn.predict(x_test.iloc[[1],:]))) print('Real value of y_test_01[2]: '+str(y_test_01[2]) + ' -> predict value: ' + str(knn.predict(x_test.iloc[[2],:]))) from sklearn.metrics import confusion_matrix cm_knn = confusion_matrix(y_test_01,knn.predict(x_test)) f,ax = plt.subplots(figsize=(5,5)) sns.heatmap(cm_knn,annot=True,linewidths=0.5,linecolor='red',fmt='.0f',ax=ax) plt.title('Test for Test dataset') plt.xlabel('predicted y values') plt.ylabel('real y value') plt.show() from sklearn.metrics import recall_score,precision_score,f1_score print('precision_score is : ',precision_score(y_test_01,knn.predict(x_test))) print('recall_score is : ',recall_score(y_test_01,knn.predict(x_test))) print('f1_score is : ',f1_score(y_test_01,knn.predict(x_test))) # Test for Train Dataset: cm_knn_train = confusion_matrix(y_train_01,knn.predict(x_train)) f,ax = plt.subplots(figsize=(5,5)) sns.heatmap(cm_knn_train,annot=True,linewidths=0.5,linecolor='blue',fmt='.0f',ax=ax) plt.title('Test for Train dataset') plt.xlabel('predicted y values') plt.ylabel('real y value') plt.show()
結論:1.通過混淆矩陣,K臨近演算法在訓練集樣本上,有22個分錯的樣本,有71人想進一步讀碩士
2.在測試集上有7個分錯的樣本
5.3.8 分類器比較
y = np.array([lrc.score(x_test,y_test_01),svm.score(x_test,y_test_01),nb.score(x_test,y_test_01), dtc.score(x_test,y_test_01),rfc.score(x_test,y_test_01),knn.score(x_test,y_test_01)]) x = np.arange(6) plt.bar(x,y) plt.title('Comparison of Classification Algorithms') plt.xlabel('Classification') plt.ylabel('Score') plt.xticks(x,("LogisticReg.","SVM","GNB","Dec.Tree","Ran.Forest","KNN")) plt.show()
結論:隨機森林和樸素貝葉斯二者的預測值都比較高
5.4 聚類演算法
5.4.1 準備資料
df = pd.read_csv('D:\\machine-learning\\score\\Admission_Predict.csv',sep=',') df = df.rename(columns={'Chance of Admit ':'Chance of Admit'}) serialNo = df['Serial No.'] df.drop(['Serial No.'],axis=1,inplace=True) df = (df - np.min(df)) / (np.max(df)-np.min(df)) y = df['Chance of Admit'] x = df.drop(['Chance of Admit'],axis=1)
5.4.2 降維
from sklearn.decomposition import PCA pca = PCA(n_components=1,whiten=True) pca.fit(x) x_pca = pca.transform(x) x_pca = x_pca.reshape(400) dictionary = {'x':x_pca,'y':y} data = pd.DataFrame(dictionary) print('pca data:',data.head()) print() print('orin data:',df.head())
5.4.3 K均值聚類
from sklearn.cluster import KMeans wcss = [] for k in range(1,15): kmeans = KMeans(n_clusters=k) kmeans.fit(x) wcss.append(kmeans.inertia_) plt.plot(range(1,15),wcss) plt.xlabel('Kmeans') plt.ylabel('WCSS') plt.show() df["Serial No."] = serialNo kmeans = KMeans(n_clusters=3) clusters_knn = kmeans.fit_predict(x) df['label_kmeans'] = clusters_knn plt.scatter(df[df.label_kmeans == 0 ]["Serial No."],df[df.label_kmeans == 0]['Chance of Admit'],color = "red") plt.scatter(df[df.label_kmeans == 1 ]["Serial No."],df[df.label_kmeans == 1]['Chance of Admit'],color = "blue") plt.scatter(df[df.label_kmeans == 2 ]["Serial No."],df[df.label_kmeans == 2]['Chance of Admit'],color = "green") plt.title("K-means Clustering") plt.xlabel("Candidates") plt.ylabel("Chance of Admit") plt.show() plt.scatter(data.x[df.label_kmeans == 0 ],data[df.label_kmeans == 0].y,color = "red") plt.scatter(data.x[df.label_kmeans == 1 ],data[df.label_kmeans == 1].y,color = "blue") plt.scatter(data.x[df.label_kmeans == 2 ],data[df.label_kmeans == 2].y,color = "green") plt.title("K-means Clustering") plt.xlabel("X") plt.ylabel("Chance of Admit") plt.show()
結論:資料集分成三個類別,一部分學生是決定繼續讀碩士,一部分放棄,還有一部分學生的比較猶豫,但是深造的可能性較大
5.4.4 層次聚類
from scipy.cluster.hierarchy import linkage,dendrogram merg = linkage(x,method='ward') dendrogram(merg,leaf_rotation=90) plt.xlabel('data points') plt.ylabel('euclidean distance') plt.show() from sklearn.cluster import AgglomerativeClustering hiyerartical_cluster = AgglomerativeClustering(n_clusters=3,affinity='euclidean',linkage='ward') clusters_hiyerartical = hiyerartical_cluster.fit_predict(x) df['label_hiyerartical'] = clusters_hiyerartical plt.scatter(df[df.label_hiyerartical == 0 ]["Serial No."],df[df.label_hiyerartical == 0]['Chance of Admit'],color = "red") plt.scatter(df[df.label_hiyerartical == 1 ]["Serial No."],df[df.label_hiyerartical == 1]['Chance of Admit'],color = "blue") plt.scatter(df[df.label_hiyerartical == 2 ]["Serial No."],df[df.label_hiyerartical == 2]['Chance of Admit'],color = "green") plt.title('Hierarchical Clustering') plt.xlabel('Candidates') plt.ylabel('Chance of Admit') plt.show() plt.scatter(data[df.label_hiyerartical == 0].x,data.y[df.label_hiyerartical==0],color='red') plt.scatter(data[df.label_hiyerartical == 1].x,data.y[df.label_hiyerartical==1],color='blue') plt.scatter(data[df.label_hiyerartical == 2].x,data.y[df.label_hiyerartical==2],color='green') plt.title('Hierarchical Clustering') plt.xlabel('X') plt.ylabel('Chance of Admit') plt.show()
結論:從層次聚類的結果中,可以看出和K均值聚類的結果一致,只不過確定了聚類k的取值3
結論:通過本詞入門資料集的訓練,可以掌握
1.一些特徵的展示的方法
2.如何呼叫sklearn 的API
3.如何取比較不同模型之間的好壞
程式碼+資料集:https://github.com/Mounment/python-data-analyze/tree/master/kaggle/score
如果有用的話,記得打一個星星,謝謝