Using UDPipe 2
Running UDPipe 2
UDPipe is a well-known tokenizer/tagger/parser for Universal Dependencies. The first version (currently 1.9.3) is easy to install and train models (not easy to train only a tagger, though). Nonetheless, it isn’t easy to use the UDPipe2, where at this moment, doesn’t have documentation since it is a prototype, as mentioned by the author here, and soon a more stable will be available (UDPipe 3).
I tried to use the code from UDPipe 2 prototype to train my models, which should be possible since the authors published results based on that code. The following sections are going through the whole procedure to train UDPipe 2 models.
Requirements
- Python <= 3.7
- virtualenv
Setup environment
Run the following commands to download and setup udpipe2:
git clone https://github.com/ufal/udpipe udpipe2
cd udpipe2
git checkout udpipe-2
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
There is one more dependency that we should install, the wembedding_service
that is a submodule from udpipe2. However, the submodule URL is broken, so we are going to manually download this dependency:
rm -rf wembedding_service
git clone https://github.com/ufal/wembedding_service
Download datasets
In my case, I wanted to train UDPipe 2 on the Bosque Treebank, located here, putting the datasets inside a separated folder:
mkdir datasets
git clone https://github.com/UniversalDependencies/UD_Portuguese-Bosque datasets/bosque
Creating npz files
The UDPipe2 firstly processes all sentences from your datasets folder to a .npz file, that is a zip containing the calculated embeddings from a BERT model.
Firstly, edit the script inside udpipe2/scripts/compute_embeddings.sh
, from:
#!/bin/sh
[ $# -ge 1 ] || { echo Usage: $0 data_directory embedding_args... >&2; exit 1; }
data="$1"; shift
for d in $data/*/; do
for f in $d*.conllu; do
[ $f.npz -nt $f ] && continue
qsub -p 0 -q gpu* -l gpu=1,mem_free=8G,h_data=16G -j y -o $f.log withcuda101 wembedding_service/venv/bin/python wembedding_service/compute_wembeddings.py --format=conllu $f $f.npz "$@"
done
done
to:
#!/bin/sh
[ $# -ge 1 ] || { echo Usage: $0 data_directory embedding_args... >&2; exit 1; }
data="$1"; shift
for d in $data/*/; do
for f in $d*.conllu; do
[ $f.npz -nt $f ] && continue
wembedding_service/venv/bin/python wembedding_service/compute_wembeddings.py --format=conllu $f $f.npz "$@"
done
done
The qsub
command that we removed is only used if you are running on a cluster, which isn’t my case.
Install wembeddings_service dependencies
The next step is to install wembeddings_service
requirements, for that we will need to create another environment since it works with a newer TensorFlow version. Firstly, deactivate from the current environment:
UDPipe 2 uses tensorflow-gpu==1.15.4, wembeddings_service usa tensorflow==2.3.1
deactivate
Then, create the new environment and install the dependencies:
cd wembeddings_service
python3 -m venv we_env
source we_env/bin/activate
pip install - requirements.txt
For the next step, double-check if your Tensorflow version is at 2.3.1.
Next, we are going to execute the creation of the .npz files:
cd udpipe2
bash scripts/compute_embeddings.sh datasets
If everything is ok, you should see .npz files created inside your datasets subfolders, for UD-Bosque the structure I got is:
datasets/
bosque/
pt_bosque-ud-train.conllu
pt_bosque-ud-train.conllu.npz
pt_bosque-ud-test.conllu
pt_bosque-ud-test.conllu.npz
pt_bosque-ud-dev.conllu
pt_bosque-ud-dev.conllu.npz
Every CoNLL-U file should be followed by a .npz file.
Training UDPipe2
cd udpipe2
python3 udpipe2.py my-model --train datasets/bosque/pt_bosque-ud-train.conllu
There will be no output at the standard output, UDPipe2 creates a log file that you can tail it located at my-model/log
.
Predicting CoNLL-U data
Now that you have a trained model, you can easily predict new data with:
cd udpipe
python udpipe2.py my-model --predict --predict_input in.conllu --predict_output out.conllu
Observe that my-model is the path to our trained model.
Evaluating your model
After predicting your input data, you can evaluate it with a gold standard file and your out.conllu
file.
cd udpipe2
python udpipe2_eval.py gold_standard.conllu out.conllu --verbose
The output should be similar to:
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 100.00 | 100.00 | 100.00 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 100.00 | 100.00 | 100.00 |
UPOS | 90.92 | 90.92 | 90.92 | 90.92
XPOS | 100.00 | 100.00 | 100.00 | 100.00
UFeats | 100.00 | 100.00 | 100.00 | 100.00
AllTags | 90.92 | 90.92 | 90.92 | 90.92
Lemmas | 100.00 | 100.00 | 100.00 | 100.00
UAS | 0.00 | 0.00 | 0.00 | 0.00
LAS | 0.00 | 0.00 | 0.00 | 0.00
CLAS | 0.00 | 0.00 | 0.00 | 0.00
MLAS | 0.00 | 0.00 | 0.00 | 0.00
BLEX | 0.00 | 0.00 | 0.00 | 0.00
And that’s it! Thank you for reading this far :)