2 minute read

Training UDPipe 2 on Bosque treebank

This tutorial is a sequence in a previous tutorial that shows how to install UDPipe 2. In this tutorial, we are going to train UDPipe 2 on the Bosque treebank, which is a Brazilian Portuguese treebank annotated from news articles firstly created in 2008, and by 2016 was ported into Universal Dependencies format.

Although this tutorial focuses on the Bosque treebank. The reader should be able to easily extended to other treebanks.

Downloading the dataset

Bosque is available at Github, and currently, it’s on the 2.8 version.

cd ~
git clone https://github.com/UniversalDependencies/UD_Portuguese-Bosque.git

Creating .npz files

The next step is to create .npz files need for training by running the udpipe/scripts/compute_embeddings.sh script by running:

cd ~
./udpipe/scripts/compute_embeddings.sh ~/UD_Portuguese-Bosque

If you get any error regarding package version check if you are on the virtual environment of wembeddings_service as mentioned in the previous tutorial.

Training UDPipe 2 on Bosque

Now you should be able to run the entire training process with the following command:

cd ~/udpipe
python3 udpipe2.py my-model --train ~/UD_Portuguese-Bosque/pt_bosque-ud-train.conllu \
                            --dev ~/UD_Portuguese-Bosque/pt_bosque-ud-dev.conllu \
                            --epochs 8:1e-3,8:1e-4

Additionaly, if you wanna to train only for a specific task, then you should pass the --tags parameter with one (or more) values from “UPOS,XPOS,FEATS,LEMMAS”. For example, the following command will train only for UPOS taks:

cd ~/udpipe
python3 udpipe2.py my-model --train ~/UD_Portuguese-Bosque/pt_bosque-ud-train.conllu \
                            --dev ~/UD_Portuguese-Bosque/pt_bosque-ud-dev.conllu \
                            --epochs 8:1e-3,8:1e-4 \
                            --tags "UPOS"

The resulting model will be saved inside udpipe2/my-model folder.

Evaluating trained model

Firstly, make the predictions on the Bosque test set:

cd ~/udpipe
python3 udpipe2.py my-model --predict --predict_input ~/UD_Portuguese-Bosque/pt_bosque-ud-test.conllu --predict_output my-model-test.conllu

To evaluate your model on the gold standard Bosque test set, run the following command:

cd ~/udpipe
python3 udpipe2_eval.py ~/UD_Portuguese-Bosque/pt_bosque-ud-test.conllu my-model-test.conllu --verbose

The output should be similar to:

Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |    100.00 |    100.00 |    100.00 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |    100.00 |    100.00 |    100.00 |
UPOS       |     90.92 |     90.92 |     90.92 |     90.92
XPOS       |    100.00 |    100.00 |    100.00 |    100.00
UFeats     |    100.00 |    100.00 |    100.00 |    100.00
AllTags    |     90.92 |     90.92 |     90.92 |     90.92
Lemmas     |    100.00 |    100.00 |    100.00 |    100.00
UAS        |      0.00 |      0.00 |      0.00 |      0.00
LAS        |      0.00 |      0.00 |      0.00 |      0.00
CLAS       |      0.00 |      0.00 |      0.00 |      0.00
MLAS       |      0.00 |      0.00 |      0.00 |      0.00
BLEX       |      0.00 |      0.00 |      0.00 |      0.00