Named Entity Recognition with LSTM-CRF

Aug 17, 2017   #Machine Learning  #Python  #NLP  #Tensorflow 

はじめに

KaggleのNamed Entity Recognition用のデータセットで、Bidirectional LSTM-CRFのTensorflow実装を試してみた。

データセット

KaggleのAnnotated Corpus for Named Entity Recognitionというデータセットで実験してみた。IOBフォーマットで、以下のEntityがある。僕の環境では、csvデータをPythonで読み込むとErrorが発生したので、ファイルのencodeをnkfでUTF-8にむりやり変換してから読み込んだ(数文でErrorが発生したので、今回は切り捨てた)。

Tag Mean
geo Geographical Entity
org Organization
per Person
gpe Geopolitical Entity
tim Time indicator
art Artifact
eve Event
nat Natural Phenomenon

実験の前に

オリジナルの実装では、config.pyに実験設定を記述して実行する設計だったのだが、個人的にはこの方法はあまり好ましくないと思っている。なぜなら、普通は複数の実験条件を試すので、条件をハードコードしてしまうと、条件条件を残すためにconfig.pyの亜種を条件数だけ用意しなければならない。とはいえ、自分の中にもBest Practiceが無いので、今回はtomlを使ってみた。

├── data
├── experiments
│   └── example
│       ├── config.toml
│       └── results
│           └── 0
│           └── 1
├── src

実験条件ごとのディレクトリを用意して、実験条件をtomlファイルで記述する。プログラムの実行時にtomlをコマンドライン引数で渡して、logや結果などの出力ファイルは同じ実験条件ディレクトリに保存する。この構成にした場合、条件と結果をまとめられるのは良いが、hyperparameterの探索範囲を分布(や飛び飛びの区間)で与えたいときに結構めんどくさい。toml含め何らかの構造を持ったテキストで分布を記述する場合、分布の種類と分布ごとのパラメタの指定が必要になる。言語依存になるが、randn(a, b)みたいな設定方法を採用することもできるが、あまり好ましくない。

実験

実験の設定は、以下のtomlファイルの通り。

[path]
output_path = "experiments/example/results/"
model_output = "experiments/example/results/model.weights/"
learning_curves_output = "experiments/example/results/lc.json"
log_path = "experiments/example/results/log.txt"

[data]
# embeddings
dim = 300
dim_char = 100
glove_filename = "data/glove.6B/glove.6B.300d.txt"
# trimmed embeddings (created from glove_filename with build_data.py)
trimmed_filename = "data/glove.6B.50d.trimmed.npz"

# support sequence data type iob and noniob
data_type = "iob"

dev_filename = "data/kaggle-GMB_devel.iob"
test_filename = "data/kaggle-GMB_test.iob"
train_filename = "data/kaggle-GMB_train.iob"

# if not nagive number, max number of examples
max_iter = -1

# vocab (created from dataset with build_data.py)
words_filename = "data/kaggle-GMB_words.txt"
tags_filename = "data/kaggle-GMB_tags.txt"
chars_filename = "data/kaggle-GMB_chars.txt"

[hyperparameters]
# if hyperparameters is list values, do random search
num_random_search = 10
nepochs = 15
dropout = [0.1, 0.3, 0.5, 0.7, 0.9]
batch_size = 10
lr_method = "adam"
lr = [0.001, 0.003, 0.01, 0.03, 0.1]
lr_decay = [0.01, 0.03, 0.1, 0.3, 0.9]
clip = [1, 3, 5, 7, 9]
nepoch_no_imprv = 3
reload = false
# model hyperparameters
hidden_size = [100, 200, 300]
char_hidden_size = [100, 200, 300]

train_embeddings = true
# NOTE: if both chars and crf, only 1.6x slower on GPU
# if crf, training is 1.7x slower on CPU
crf = true
# if char embedding, training is 3.5x slower on CPU
chars = true

結果

best params

ランダムサーチでのbestの設定が以下の通り。

[hyperparameters]
num_random_search = 15
batch_size = 10
nepochs = 15
char_hidden_size = 100
dropout = 0.9
lr_method = "adam"
chars = true
crf = true
reload = false
lr_decay = 0.3
lr = 0.003
train_embeddings = true
clip = 3
hidden_size = 100
nepoch_no_imprv = 3
[path]
learning_curves_output = "experiments/example/results/0/lc.json"
model_output = "experiments/example/results/model.weights/0/"
log_path = "experiments/example/results/0/log.txt"
output_path = "experiments/example/results/0/"
[data]
dim = 300
words_filename = "data/kaggle-GMB_words.txt"
test_filename = "data/kaggle-GMB_test.iob"
max_iter = -1
tags_filename = "data/kaggle-GMB_tags.txt"
dim_char = 100
trimmed_filename = "data/glove.6B.300d.trimmed.npz"
dev_filename = "data/kaggle-GMB_devel.iob"
glove_filename = "data/glove.6B/glove.6B.300d.txt"
train_filename = "data/kaggle-GMB_train.iob"
data_type = "iob"
chars_filename = "data/kaggle-GMB_chars.txt"

学習曲線:チャンク単位でのマクロF値

(t)が学習(trainingg)データ、(v)が検証(validation)データ。

学習曲線:タグ単位でのクラスごとのF値

おわりに

最近まで知らなかったが、data science projectのscaffoldを作れるCookiecutter Data Scienceなるプロジェクトがあるらしい。

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org

関連記事