scikit-learn的ColumnTransformer和OneHotEncoder

scikit-learn · 發表 2018-10-11 21:41:19

摘要：本文介紹scikit-learn 0.20版本中新增的sklearn.compose.ColumnTransformer 和有所改動的sklearn.preprocessing.OneHotEncoder 。 ColumnTransformer 假設現在有這樣一個...

本文介紹scikit-learn 0.20版本中新增的sklearn.compose.ColumnTransformer 和有所改動的sklearn.preprocessing.OneHotEncoder 。

ColumnTransformer

假設現在有這樣一個場景：有一個數據集，每個樣本包含n個數值型(numeric)特徵，m個標稱型(categorical)特徵，我們在使用這個資料集訓練模型之前，需要對n個數值型特徵做歸一化，對m個標稱型特徵做one-hot編碼？這個要如何實現？

其實這個不難，但挺麻煩的。一般的方式是把數值型的特徵資料列和標稱型資料分別拿出來，然後分別做預處理，處理完之後再拼在一起訓練模型。這樣一方面是麻煩，另一方面比較難保證原來特徵的順序（雖然順序一般沒什麼影響）。scikit-learn在0.20.0版本中新增了一個sklearn.compose.ColumnTransformer 類，通過這個類我們可以對輸入的特徵分別做不同的預處理，並且最終的結果還在一個特徵空間裡面。描述太抽象，直接看官方的裡一個ofollow,noindex" target="_blank">例子：

# Author: Pedro Morales <[email protected]>
#
# License: BSD 3 clause

from __future__ import print_function

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)

# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])

X = data.drop('survived', axis=1)
y = data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

# output:
# model score: 0.790

例子很簡單，裡面的註釋已經表述的比較清楚了：讀入了titanic3.csv 資料集，裡面包含了2個數值型特徵（age ,fare ）和3個標稱型特徵（embarked ,sex ,pclass ），然後對數值型特徵做缺失值處理和歸一化，對標稱型特徵做缺失值處理和One-Hot編碼。例子裡面使用Pipeline將這些操作串了起來。

我們看下sklearn.compose.ColumnTransformer 的原型：

class sklearn.compose.ColumnTransformer(transformers, remainder=’drop’, sparse_threshold=0.3, n_jobs=None, transformer_weights=None)

簡單介紹一下transformers 和remainder 兩個引數：

transformers：該引數是一個由元組組成的列表(list of tuples)，每個元組的結構為：(name, transformer, column):
- name: transformer的名字，隨便起一個字串即可；
- transformer: 支援fit 和transform 的estimator或者passthrough 或者drop . passthrough表示透傳，不對column指定的列做任何轉換；drop表示丟棄指定column指定的列。
- column: 指定對哪些列做轉換操作，所以可以是下標、列名等。
remainder：這個引數的值可以是支援fit 和transform 的estimator或者passthrough 或者drop ，預設值是drop，其功能和transformers引數非常像：
- drop：表示將column指定的列之外的其他列都丟棄；
- passthrough：表示將column指定的列之外的其他列透傳；
- estimator：表示對column指定的列之外的其他列執行該estimator代表的轉換。

新功能的使用還是非常容易的。

OneHotEncoder

scikit-learn 0.20版本里面另外一個比較重要的改動就是sklearn.preprocessing.OneHotEncoder 除了支援整數外，還支援字串。這樣如果特徵是字串，就省去了原來需要做sklearn.preprocessing.LabelEncoder 的步驟。

老的sklearn.preprocessing.OneHotEncoder 原型：

class sklearn.preprocessing.OneHotEncoder(n_values=’auto’, categorical_features=’all’, dtype=<class ‘numpy.float64’>, sparse=True, handle_unknown=’error’)

新的sklearn.preprocessing.OneHotEncoder 原型：

class sklearn.preprocessing.OneHotEncoder(n_values=None, categorical_features=None, categories=None, sparse=True, dtype=<class ‘numpy.float64’>, handle_unknown=’error’)

可以看到新老API的主要差別是新API增加了一個categories 引數，這個引數是為了替換裡面的n_values 引數；後者在0.22版本中就去掉了。而且如果要OneHotEncoder支援字串的話，就必須使用categories ，不能使用n_values 了。我們簡單介紹一下categories 這個引數，該引數的可取值為：'auto' ( 預設值，表示根據訓練資料自己決定categories；)或一個list of list/array 。前者很容易理解，後者稍微難理解一些，我們通過例子來看。

>>> enc = preprocessing.OneHotEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)
OneHotEncoder(categorical_features=None, categories=None,
dtype=<... 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
>>> enc.transform([['female', 'from US', 'uses Safari'],
...['male', 'from Europe', 'uses Safari']]).toarray()
array([[1., 0., 0., 1., 0., 1.],
[0., 1., 1., 0., 0., 1.]])

# 可以通過categories_屬性檢視所有類別
>>> enc.categories_
[array(['female', 'male'], dtype=object),
 array(['from Europe', 'from US'], dtype=object),
 array(['uses Firefox', 'uses Safari'], dtype=object)]

上例中，categories 採用預設值，如果我們需要使用list實現等效的效果的話，可以按照如下方式改寫上面程式碼：

>>> enc_new = preprocessing.OneHotEncoder(categories=[['male', 'female'],['from Europe', 'from US'],['uses Firefox', 'uses Safari']])
>>> enc_new.fit(X)
OneHotEncoder(categorical_features=None,
categories=[['male', 'female'], ['from Europe', 'from US'], ['uses Firefox', 'uses Safari']],
dtype=<class 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
>>> enc_new.transform([['female', 'from US', 'uses Safari'],
...['male', 'from Europe', 'uses Safari']]).toarray()
array([[1., 0., 0., 1., 0., 1.],
[0., 1., 1., 0., 0., 1.]])
>>> enc_new.categories_
[array(['male', 'female'], dtype=object),
 array(['from US', 'from Europe'], dtype=object),
 array(['uses Safari', 'uses Firefox'], dtype=object)]

例子已經展示的很清楚了，categories 的值取list of list/array時候，裡面的categories[i] 表示第i列特徵的categories。同時需要注意：**在單個特徵的list/array裡面，其值要麼是numeric要麼是string，不能混用；如果是numeric，還需要是排序的。

Reference：

scikit-learn.org

scikit-learn的ColumnTransformer和OneHotEncoder

ColumnTransformer

OneHotEncoder

Reference：

您可能也會喜歡…