有道詞典單詞本匯出並轉化成EXCEL

Excel 有道詞典 · 發表 2018-11-08 13:04:02

摘要：今天想複習一下有道詞典裡面收藏的單詞，結果只能匯出.bin, .xml, .txt格式，但是我想把它放在excel裡更容易操作排序、染色等，所以就提取了xml格式，然後用python轉成了xlsx格式。具體程式碼如下： import xml.etree.ElementTree as ET ...

今天想複習一下有道詞典裡面收藏的單詞，結果只能匯出.bin, .xml, .txt格式，但是我想把它放在excel裡更容易操作排序、染色等，所以就提取了xml格式，然後用python轉成了xlsx格式。具體程式碼如下：

import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np

tree = ET.parse('words.xml')
root = tree.getroot()

words = pd.DataFrame(columns = ['word','trans','phonetic'])
for item in root:
df = pd.DataFrame({'word': item[0].text,
'trans': item[1].text,
'phonetic': item[2].text},
index = item)
print(df)
words = pd.concat([words, df], ignore_index = True)
words = words.drop_duplicates()

words.to_excel('words1.xlsx', sheet_name = '1')

在這裡面有兩個問題目前沒有看懂，一個是

words = words.drop_duplicates()

，為什麼要用這個，應為不用他的話每個單詞會列印5遍，沒辦法，我就把重複項去掉了，我也不知道為什麼會列印5遍。另外一個就是在sublime text 3中寫

words.to_excel('words1.xlsx', sheet_name = '1')

的時候，會出現

UnicodeEncodeError: 'gbk' codec can't encode character '\u028c' in position 134: illegal multibyte sequence

可能是因為，‘gbk’搞不定音標吧，我猜的啊，不知道怎麼解決，反正我在jupyter notebook中跑出來了。

就這樣吧！

您有什麼想法，請留言。

資料人網是資料人學習，交流和分享的平臺，誠邀您創造和分享資料知識，共建和共享資料智庫。

有道詞典單詞本匯出並轉化成EXCEL

您可能也會喜歡…