使用NLTK做文字分析

NLTK 文字分析 · 發表 2019-03-22 18:38:00

摘要： NLTK（Natural Language Toolkit）是一個功能強大的Python包，它提供了一組自然語言演算法，例如切分詞（Tokenize），詞性標註(Part-Of-Speech Tagging)，詞幹分析(Stem)和命名實體識別(Named Entity Recognition...

NLTK（Natural Language Toolkit）是一個功能強大的Python包，它提供了一組自然語言演算法，例如切分詞（Tokenize），詞性標註(Part-Of-Speech Tagging)，詞幹分析(Stem)和命名實體識別(Named Entity Recognition)，分類演算法（classification）。安裝和引用NLTK

pip install nltk

import nltk

一，切詞

文字是由段落（Paragraph）構成的，段落是由句子（Sentence）構成的，句子是由單詞構成的。切詞是文字分析的第一步，它把文字段落分解為較小的實體（如單詞或句子），每一個實體叫做一個Token，Token是構成句子（sentence ）的單詞，是段落（paragraph）的句子。NLTK能夠實現句子切分和單詞切分兩種功能。

1，句子切分

句子切分是指把段落切分成句子：

from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)

句子切分的結果：

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]

2，單詞切分

單詞切分是把句子切分成單詞

from nltk.tokenize import word_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=word_tokenize(text)
print(tokenized_text)

單詞切分的結果是：

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 
'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.',
'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard']

可以發現，切詞之後，標點符號也包括在結果中。

二，處理切詞

對切詞的處理，需要移除標點符號和移除停用詞和詞彙規範化。

1，移除標點符號

對每個切詞呼叫該函式，移除字串中的標點符號，string.punctuation包含了所有的標點符號，從切詞中把這些標點符號替換為空格。

import string

s='abc.'
s.translate(str.maketrans(string.punctuation," "*len(string.punctuation),"")

2，移除停用詞

停用詞（stopword）是文字中的噪音單詞，沒有任何意義，常用的英語停用詞，例如：is, am, are, this, a, an, the。NLTK的語料庫中由一個停用詞，使用者必須從切詞列表中把停用詞去掉。

from nltk.corpus import stopwords

stop_words = stopwords.words("english")

word_tokens = nltk.tokenize.word_tokenize(text.strip())
filtered_sentence = [w for w in word_tokens if not w in stop_words]

三，詞彙規範化（Lexicon Normalization）

詞彙規範化是指把詞的各種派生形式轉換為詞根，stem是把單詞轉換為詞幹，在NLTK中存在兩種抽取詞幹的方法porter和wordnet。

from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stem = PorterStemmer()

word = "flying"
print("Lemmatized Word:",lem.lemmatize(word,"v"))
print("Stemmed Word:",stem.stem(word))

四，詞性標註

詞性（POS）標記的主要目標是識別給定單詞的語法組，POS標記查詢句子內的關係，併為該單詞分配相應的標籤。

sent = "Albert Einstein was born in Ulm, Germany in 1879."
tokens=nltk.word_tokenize(sent)
nltk.pos_tag(tokens)

五，分類

略

參考文件：

NLTK in Python

Text Analytics for Beginners using NLTK

NLTK學習筆記 -- 字串操作

【NLP】Python NLTK 走進大秦帝國