最強預訓練模型BERT的Pytorch實現（非官方）

PyTorch 語言模型 · 發表 2018-10-18 14:43:26

摘要：近期，Google AI 公開了一篇 NLP 論文，介紹了新的語言表徵模型 BERT，該模型被認為是最強 NLP 預訓練模型，重新整理了 11 項 NLP 任務的當前最優效能記錄。今日，機器之心發現 GitHub 上出現了 BERT 的 Pytorch 實現，出自 Scatter Lab 的 ...

近期，Google AI 公開了一篇 NLP 論文，介紹了新的語言表徵模型 BERT，該模型被認為是最強 NLP 預訓練模型，重新整理了 11 項 NLP 任務的當前最優效能記錄。今日，機器之心發現 GitHub 上出現了 BERT 的 Pytorch 實現，出自 Scatter Lab 的 Junseong Kim。

簡介

谷歌 AI 關於 BERT 的論文展示了該模型在多個 NLP 任務上取得的驚豔結果，包括在 SQuAD v1.1 QA 任務上的 F1 得分超過人類。該論文證明，基於 Transformer（自注意力）的編碼器可以有力地替代之前以合理方式訓練的語言模型。更重要的是，該論文表明這一預訓練語言模型可用於任何 NLP 任務，而無需針對任務定製模型架構。

本文主要闡述 BERT 的實現。它的程式碼非常簡單、易懂。一些程式碼基於《ofollow,noindex">Attention is All You Need 》一文中的 annotated Transformer。

該專案目前還在進展階段。程式碼尚未得到驗證。

語言模型預訓練

在這篇論文中，作者展示了語言模型訓練的新方法，即「遮蔽語言模型」（masked language model，MLM）和「預測下一句」。

Masked LM

見原論文：3.3.1 Task #1: Masked LM

Input Sequence: The man went to [MASK] store with [MASK] dog
Target Sequence :thehis

規則：

基於以下子規則，隨機 15% 的輸入 token 將被改變：

80% 的 token 是 [MASK] token。
10% 的 token 是 [RANDOM] token（另一個單詞）。
10% 的 token 將維持不變，但是需要預測。

預測下一句

見原論文：3.3.2 Task #2: Next Sentence Prediction

Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
Label : Is Next

Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext

「這個句子可以繼續和下一個句子聯絡起來嗎？」

理解兩個文字句子之間的關係，這是無法通過語言建模直接獲取的。

規則：

下一句有 50% 的可能是連續的句子。
下一句有 50% 的可能是無關的句子。

使用

注意：你的語料庫一行應該有兩個句子，中間用 (\t) 分隔符隔開。

Welcome to the \t the jungle \n
I can stay \t here all night \n

1. 基於自己的語料庫構建 vocab

python build_vocab.py -c data/corpus.small -o data/corpus.small.vocab

usage: build_vocab.py [-h] -c CORPUS_PATH -o OUTPUT_PATH [-s VOCAB_SIZE]
[-e ENCODING] [-m MIN_FREQ]

optional arguments:
-h, --helpshow this help message and exit
-c CORPUS_PATH, --corpus_path CORPUS_PATH
-o OUTPUT_PATH, --output_path OUTPUT_PATH
-s VOCAB_SIZE, --vocab_size VOCAB_SIZE
-e ENCODING, --encoding ENCODING
-m MIN_FREQ, --min_freq MIN_FREQ

2. 使用自己的語料庫構建 BERT 訓練資料集

python build_dataset.py -d data/corpus.small -v data/corpus.small.vocab -o data/dataset.small

usage: build_dataset.py [-h] -v VOCAB_PATH -c CORPUS_PATH [-e ENCODING] -o
OUTPUT_PATH

optional arguments:
-h, --helpshow this help message and exit
-v VOCAB_PATH, --vocab_path VOCAB_PATH
-c CORPUS_PATH, --corpus_path CORPUS_PATH
-e ENCODING, --encoding ENCODING
-o OUTPUT_PATH, --output_path OUTPUT_PATH

3. 訓練自己的 BERT 模型

python train.py -d data/dataset.small -v data/corpus.small.vocab -o output/

usage: train.py [-h] -d TRAIN_DATASET [-t TEST_DATASET] -v VOCAB_PATH -o
OUTPUT_DIR [-hs HIDDEN] [-n LAYERS] [-a ATTN_HEADS]
[-s SEQ_LEN] [-b BATCH_SIZE] [-e EPOCHS]

optional arguments:
-h, --helpshow this help message and exit
-d TRAIN_DATASET, --train_dataset TRAIN_DATASET
-t TEST_DATASET, --test_dataset TEST_DATASET
-v VOCAB_PATH, --vocab_path VOCAB_PATH
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
-hs HIDDEN, --hidden HIDDEN
-n LAYERS, --layers LAYERS
-a ATTN_HEADS, --attn_heads ATTN_HEADS
-s SEQ_LEN, --seq_len SEQ_LEN
-b BATCH_SIZE, --batch_size BATCH_SIZE
-e EPOCHS, --epochs EPOCHS

原文連結：https://github.com/codertimo/BERT-pytorch

工程谷歌AI 語言模型 NLP