The task of word ordering, or linearization, is to recover the original order of a shuffled sentence.
Replicating our results can be broken down into two main steps:
- Preprocess Penn Treebank with the splits and tokenization used in our experiments. Instructions are available in data/preprocessing/README_DATASET_CREATION.txt.
- Train, run, and evaluate the NGram and LSTM models of interest. Instructions are available in Usage.txt
1. 预处理
treebank代表标注了语义和语法的语料库,PTB便是其中的一个。
预处理工作是将这个语料库进行划分和标记。instructions如下:
- 下载语料库treebank_3放在datasets/ directory中。
- Run
cd data/preprocessing
- Run
bash create_dependency_files.sh
. 功能包括:
复制WSJ constituency trees
patch the WSJ part of the Penn Treebank using the NP bracketing script
convert the NP-bracketed constituency trees to dependency trees. - Run
bash create_dataset.sh
. 生成核心文件:
ordered PTB files with and without base NP annotations
exact shuffling of the word multisets we used to generate the point estimates in our paper
versions formatted for use with ZGen and the Yara parer.
After following these steps, the folder zgen_data_gold
will contain the gold, ordered sentences
_ref_npsyms.txt
结尾的文件包括BNP注释。
The folder zgen_data_npsyms_freq3_unkUNK
中的文件把不常用的token被替换成特殊符号,也包括shuffle后的文件。包含BNP注释的没有被shuffle(eg.<sonp> Black Monday <eonp>被看作一个整体)
需要的环境包括:
Python: Most recently tested with 2.7.9
NLTK (Python package): Most recently tested with 3.0.4
Java: Most recently tested with 1.8.0_31
安装NLTK包括:
-
sudo pip install -U nltk
安装nltk - 下载
nlkt-data
放在指定目录中的一个即可。
中间有报错:
IOError: Zipfile nltk_data/corpora/ptb.zip' does not contain ptb/datasets/treebank_3_ original/wsj/02/wsj 0200.mrg
解决方案
进入/usr/share/nltk-data/
,把ptb.zip解压,然后把wsj文件夹放入。(貌似并不需要把名称转换为大写)
2. Train, run, and evaluate the NGram and LSTM models
N-gram
准备环境
安装KenLM
Python module:pip install https://github.com/kpu/kenlm/archive/master.zip
命令:(虚拟机内存要分配够)
分为两种情况,N-Gram (Words+BNPs) 和 N-Gram (Words, without future costs)
- 建立5-gram model(参数为5)
bin/lmplz -o 5 -S 3145728K -T /tmp </mnt/shared/order/datasets/zgen_data_npsyms_freq3_unkUNK/npsyms/train_words_with_np_symbols_no_eos.txt > /home/yingtao/output/LM5_noeos_withnpsyms_freq3_unkUNK.arpa
- 建立unigram model(参数为1)
bin/lmplz -o 1 -S 3145728K -T /tmp --discount_fallback </mnt/shared/order/datasets/"zgen_data_npsyms_freq3_unkUNK/npsyms/train_words_with_np_symbols_no_eos.txt" >/home/yingtao/output/LM1_noeos_withnpsyms_freq3_unkUNK.arpa
- Run the N-gram decoder. 把之前准备的shuffled_no_eos.txt作为输入,参考刚才得到的两个gram模型。输出排序后的结果
python ngram_decoder.py /home/yingtao/output/LM5_noeos_withnpsyms_freq3_unkUNK.arpa /mnt/shared/order/datasets/zgen_data_npsyms_freq3_unkUNK/npsyms/valid_words_with_np_symbols_shuffled_no_eos.txt 64 --future /home/yingtao/output/LM1_noeos_withnpsyms_freq3_unkUNK.arpa > /home/yingtao/output/output_valid_with_npsyms_futurecosts_beam64_lm5.txt
- 把unk/UNK随机换成低频词。除去BNP标记。.
python randomly_replace_unkUNK.py \
--generated_reordering_with_unk /home/yingtao/output/output_valid_with_npsyms_futurecosts_beam64_lm5.txt \
--gold_unprocessed /mnt/shared/order/datasets//zgen_data_gold/valid_words_ref_npsyms.txt \
--gold_processed /mnt/shared/order/datasets/zgen_data_npsyms_freq3_unkUNK/npsyms/valid_words_with_np_symbols_no_eos.txt \
--out_file /home/yingtao/output/output_valid_with_npsyms_futurecosts_beam64_lm5_removed_unk.txt \
--remove_npsyms
- 计算得分
./ScoreBLEU.sh -t /home/yingtao/output/output_valid_with_npsyms_futurecosts_beam64_lm5_removed_unk.txt -r /mnt/shared/order/datasets/zgen_data_gold/valid_words_ref.txt -odir /home/yingtao/output/test
没有BNP注释的时候得分会下降: