小牛翻译开源社区

NiuTrans高级用法

配置文件
对高级用户来说,NiuTrans含有很多非常有帮助的特征。所有这些特征都可以通过修改安装包中的配置文件来得以实现。下面将对本系统中使用的配置文件进行描述。

NiuTrans.phrase.user.config
“NiuTrans.phrase.user.config”记录解码器的设置。用户可以修改该文件来使用NiuTrans中更多高级的特征。“NiuTrans.phrase.user.config”包含以下基本特征。

###########################################
### NiuTrans decoder configuration file ###
###          phrase-based system        ###
###              2011-07-01             ###
###########################################

#>>> runtime resource tables

# language model
param="Ngram-LanguageModel-File"     value="../sample-data/lm.trie.data"

# target-side vocabulary
param="Target-Vocab-File"            value="../sample-data/lm.vocab"

# MaxEnt-based lexicalized reordering model
param="ME-Reordering-Table"          value="../training/me.reordering.table"

# MSD lexicalized reordering model
param="MSD-Reordering-Model"         value="../training/msd.reordering.table"

# phrase translation model
param="Phrase-Table"                 value="../training/phrase.translation.table"

#>>> runtime parameters

# number of MERT iterations
param="nround"                       value="10"

# order of n-gram language model
param="ngram"                        value="3"

# use punctuation pruning (1) or not (0)
param="usepuncpruning"               value="1"

# use cube-pruning (1) or not (0)
param="usecubepruning"               value="1"

# use maxent reordering model (1) or not (0)
param="use-me-reorder"               value="1"

# use msd reordering model (1) or not (0)
param="use-msd-reorder"              value="1"

# number of threads
param="nthread"                      value="4"

# how many translations are dumped
param="nbest"                        value="20"

# output OOVs and word-deletions in the translation result
param="outputnull"                   value="0"

# beam size (or beam width)
param="beamsize"                     value="20"

# number of references of dev. set
param="nref"                         value="1"

#>>> model parameters

# features defined in the log-linear model
#  0: n-gram language model
#  1: number of target-words
#  2: Pr(e|f). f->e translation probablilty.
#  3: Lex(e|f). f->e lexical weight
#  4: Pr(f|e). e->f translation probablilty.
#  5: Lex(f|e). e->f lexical weight
#  6: number of phrases
#  7: number of bi-lex links (not fired in current version)
#  8: number of NULL-translation (i.e. word deletion)
#  9: MaxEnt-based lexicalized reordering model
# 10: <UNDEFINED>
# 11: MSD reordering model: Previous & Monotonic
# 12: MSD reordering model: Previous & Swap
# 13: MSD reordering model: Previous & Discontinuous
# 14: MSD reordering model: Following & Monotonic
# 15: MSD reordering model: Following & Swap
# 16: MSD reordering model: Following & Discontinuous

# feature weights
param="weights" \
value="1.000 0.500 0.200 0.200 0.200 0.200 0.500 0.500 -0.100 1.000 0.000 0.100 0.100 0.100 0.100 0.100 0.100"

# bound the feature weight in MERT
# e.g. the first number "-3:7" means that the first feature weight ranges in [-3, 7]
param="ranges" \
value="-3:7 -3:3 0:3 0:0.4 0:3 0:0.4 -3:3 -3:3 -3:0 -3:3 0:0 0:3 0:0.3 0:0.3 0:3 0:0.3 0:0.3"

# fix a dimention (1) or not (0)
param="fixedfs"  value="0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
NiuTrans.phrase.train.model.config
“\config\NiuTrans.phrase.train.model.config”记录训练翻译模型和调序模型的设置,该文件包含以下信息。
###########################################
### NiuTrans  phrase train model config ###
###########################################

# temp file path
param="Lexical-Table"                value="lex"
param="Extract-Phrase-Pairs"         value="extractPhrasePairs"

# phrase table parameters
param="Max-Source-Phrase-Size"       value="3"                          # number greater than 0
param="Max-Target-Phrase-Size"       value="5"                          # number greater than 0
param="Phrase-Cut-Off"               value="0"                          # number not less than 0

# phrase translation model
param="Phrase-Table"                 value="phrase.translation.table"

# maxent lexicalized reordering model
param="ME-max-src-phrase-len"        value="3"                          # > 0 or = -1 (unlimited)
param="ME-max-tar-phrase-len"        value="5"                          # > 0 or = -1 (unlimited)
param="ME-null-algn-word-num"        value="1"                          # >= 0 or = -1 (unlimited)
param="ME-use-src-parse-pruning"     value="0"                          # "0" or "1"
param="ME-src-parse-path"            value="/path/to/src-parse/"        # source parses (one parse per line)
param="ME-max-sample-num"            value="5000000"                    # number greater than 0 or "-1" (unlimited)
param="ME-Reordering-Table"          value="me.reordering.table"

# msd lexicalized reordering model
param="MSD-model-type"               value="1"                          # "1", "2" or "3"
param="MSD-filter-method"            value="tran-table"                 # "tran-table" or "msd-sum-1"
param="MSD-max-phrase-len"           value="7"                          # number greater than 0
param="MSD-Reordering-Model"         value="msd.reordering.table"

实用的功能与技巧

以下信息供您参考
如何生成N-BEST翻译结果
如何扩大Beam宽度
采用何种剪枝方法
如何加速解码器
可使用多少参考译文
如何使用高阶的N元语言模型
如何控制短语翻译表的大小
如何使用更多语料训练ME调序模型
如何使用更多语料训练MSD调序模型
如何使用自定义特征
如何在解码器中加入额外翻译规则

  • 如何生成N-BEST翻译结果
    It can be trivially done by setting parameter "nbest" defined in "NiuTrans.phrase.user.config". E.g. if you want to generate a list of 50-best translations, you can modify "NiuTrans.phrase.user.config" as follows:

    # how many translations are dumped
    param="nbest"                     value="50"
    
  • 如何扩大Beam宽度
    In the NiuTrans system beam width is controlled by the parameter "beamsize" defined in "NiuTrans.phrase.user.config". E.g. if you wish to choose a beam of width 100, you can modify "NiuTrans.phrase.user.config", as follows:

    # beam size (or beam width)
    param="beamsize"                     value="100"
    
  • 采用何种剪枝方法
    The current version supports two pruning methods: punctuation pruning and cube pruning. The first method divides the input sentence into several segments according to punctuations (such as comma). The decoding is then performed on each segment individually. Finally the translation is generated by gluing the translations of these segments. The second method can be regarded as an instance of heuristic search. Here we re-implement the method described in (Chiang, 2007).
        To activate the two pruning techniques, users can fire the triggers "usepuncpruning" and "usecubepruning" defined in "NiuTrans.phrase.user.config". Of course, each of them can be set individually.

    # use punctuation pruning (1) or not (0)
    param="usepuncpruning"               value="1"
    
    # use cube-pruning (1) or not (0)
    param="usecubepruning"               value="1"
    

  • 如何加速解码器
    A straightforward solution is pruning. As described above, punctuation pruning and/or cube pruning can be employed for system speed-up. By default both of them are activated in our system (On Chinese-English translation tasks, they generally lead to a 10-fold speed improvement). Also, multi-thread running-mode can make the system faster if more than one CPU/core is available. To run the system on multiple threads, users can use the parameter "nthread" defined in "NiuTrans.phrase.user.config". E.g. if you want to run decoder with 6 threads, you can set "nthread" like this

    # number of threads
    param="nthread"                      value="6"
    

    To further speed-up the system, another obvious solution is to filter the translation table and the reordering model using input sentences. This feature will be supported in the later version of the system.

  • 可使用多少参考译文
    The NiuTrans system does not any upper limit on the number of reference translations used in either weight tuning or evaluation. E.g. if you want to use 3 references for weight tuning, you can format your tuning data file as follows (Note that "#" indicates a comment here, and SHOULD NOT appear in users' file).

    澳洲 重新 开放 驻 马尼拉 大使馆               # sentence-1
                                                  # a blank line
    australia reopens embassy in manila           # the 1st reference translation
    australia reopened manila embassy             # the 2nd reference translation
    australia reopens its embassy to manila       # the 3rd reference translation
    澳洲 是 与 北韩 有邦交 的 少数 国家 之 .      # sentence-2
    ...

    Then set the "-nref" accordingly. For weight tuning (Note: "-nref 3"),

    $> perl NiuTrans-mert-model.pl \
            -dev   ../sample-data/sample-submission-version/Dev-set/Niu.dev.txt \
            -c     ../work/NiuTrans.phrase.user.config \
            -nref  1 \
            -r     3 \
            -l     ../work/mert-model.log
    

    For evaluation (Note: "-nref 3"),

    ...
    $> perl NiuTrans-generate-xml-for-mteval.pl \
            -1f    1best.out \
            -tf    test-ref.txt \
            -rnum  3
    ...
    
  • 如何使用高阶的N元语言模型
    You first need to choose the order for n-gram language model. E.g. if you prefers a 5-gram languguage model, you can type the following command to train LM (NOTE: "-n 5").

    $> ../bin/NiuTrans.LMTrainer \
            -t  sample-submission-version/LM-training-set/e.lm.txt -n 5 \
            -v  lm.vocab \
            -m  lm.trie.data
    

    Then set the config file accordingly

    $> cd scripts/
    $> perl NiuTrans-generate-mert-config.pl \
            -tmdir  ../work/model/ \
            -lmdir  ../work/lm/ \
            -ngram  5 \
            -o      ../work/NiuTrans.phrase.user.config
    
  • 如何控制短语翻译表的大小
    To avoid extremely large phrase table, "\config\NiuTrans.phrase.train.model.config" defines two parameters "Max-Source-Phrase-Size" and "Max-Target-Phrase-Size" which control the maximum number of words on source-side and target-side of a phrase-pair, respectively. Generally both two parameters greatly impact the number of resulting phrase-pairs. Note that, although extracting larger phrases can increase the coverage rate of a phrase table, it does not always benefit the BLEU improvement due to data sparseness.
        Another way to reduce the size of phrase table is to throw away the low-frequency phrases. This can be done using the parameter "Phrase-Cut-Off" defined in "\config\NiuTrans.phrase.train.model.config". When "Phrase-Cut-Off" is set to n, all phrases appearing equal to or less than n times are thrown away by the NiuTrans system.
        E.g. the example below shows how to obtain a phrase table with areasonable size. In this setting, the maximum number of source words and target words are set to 3 and 5, respectively. Moreover, all phrases with frequency 1 are filtered.

    param="Max-Source-Phrase-Size"       value="3"
    param="Max-Target-Phrase-Size"       value="5"
    param="Phrase-Cut-Off"               value="1"
    
  • 如何使用更多语料训练ME调序模型
    We follow the work of (Xiong et al., 2006) to design the ME-based lexicalized reordering model. In general, the size of the (ME-based) reordering model increases greatly as more training data is involved. Thus several parameters are defined to control the size of the resulting model. They can be found in the configuration file "\config\NiuTrans.phrase.train.model.config", and start with symbol "ME-".
        1. "ME-max-src-phrase-len" and "ME-max-tar-phrase-len" restrict the maximum number of words appearing in the source-side phrase and target-side phrase. Obviously larger "ME-max-src-phrase-len" (or "ME-max-tar-phrase-len") means a smaller model.
        2. "ME-null-algn-word-num" limits the number of unaligned target words that appear between two adjacent blocks.
        3. "ME-use-src-parse-pruning" is a trigger, and indicates using source-side parse to constraint the training sample extraction. In our in-house experiments, using source-side parse as constraints can greatly reduce the size of resulting model but does not lose BLEU score significantly.
        4. "ME-src-parse-path" specifies the file of source parses (one parse per line). It is meaningful only when "ME-use-src-parse-pruning" is turned on.
        5. "ME-max-sample-num" limits the maximum number of extracted samples. Because the ME trainer "maxent(.exe)" cannot work on a very large number of training samples, controlling the maximum number of extracted samples is a reasonable way to avoid the unacceptable training time and memory cost. By default, "ME-max-sample-num" is set to 5000000 in the NiuTrans system. This setting means that only the first 5,000,000 samples affect the model training, and a too large training corpus does not actually result in a larger model.
        To train ME-based reordering model on a larger data set, it is recommended to set the above parameters as follows (for Chinese-to-English translation tasks). Note that this setting requires users to provide the source-side parse trees for pruning.

    param="ME-max-src-phrase-len"        value="3"
    param="ME-max-tar-phrase-len"        value="5"
    param="ME-null-algn-word-num"        value="1"
    param="ME-use-src-parse-pruning"     value="1"                      # if you have source parses
    param="ME-src-parse-path"            value="/path/to/src-parse/"
    param="ME-max-sample-num"            value="-1"                     # depends on how large your corpus is
                                                                        # can be set to a positive number as needed
  • 如何使用更多语料训练MSD调序模型
    It is worth pointing out that the NiuTrans system have three models to calculate M, S, D probabilities. Users can choose one of them using the parameter "MSD-model-type". When "MSD-model-type" is set to "1", the MSD reordering is modeled on word-level, as what the Moses system does. In addition to the basic model, the phrase-based MSD model and the hiearachical phrase-based MSD model (Galley et al., 2008) are also implemented. They can be use when "MSD-model-type" is set to "2" or "3".
        When trained on a large corpus, the MSD model might be very large. The situationis even more severe when model 3 is involved. To alleviate this problem, users can use the parameter "MSD-filter-method" which filters the MSD model using phrase translation table (any entry that is not covered by the phrase table are excluded).
        Also, users can use the parameter "MSD-max-phrase-len" to limit the maximum number of words in a source or target phrase. This parameter can effectively limit the size of the generated MSD model.
        Below gives an example that restricts the MSD to a acceptable size.

    param="MSD-model-type"               value="1"                             # "1", "2" or "3"
    param="MSD-filter-method"            value="tran-table"                    # "tran-table" or "msd-sum-1"
    param="MSD-max-phrase-len"           value="7"                             # number greater than 0
    
  • 如何使用自定义特征
    The NiuTrans system allows users to add self-developed features into the phrase translation table. In the default setting, each entry in the translation table is associated with 6 features. E.g. below is a sample table ("phrase.translation.table"), where each entry is coupled with a 6-dimention feature vector.

    ...
    一定 ||| must ||| -2.35374 -2.90407 -1.60161 -2.12482 1 0
    一定 ||| a certain ||| -2.83659 -1.07536 -4.97444 -1.90004 1 0
    一定 ||| be ||| -4.0444 -5.74325 -2.32375 -4.46486 1 0
    一定 ||| be sure ||| -4.21145 -1.3278 -5.75147 -3.32514 1 0
    一定 ||| ' ll ||| -5.10527 -5.32301 -8.64566 -4.80402 1 0
    ...

    To add new features into the table, users can append them to these feature vectors. E.g. suppose that we wish to add a feature that indicates whether the phrase pair is low-frequency in the training data (appears only once) or not (appears two times or more). We can update the above table, as follows:

    ..
    一定 ||| must ||| -2.35374 -2.90407 -1.60161 -2.12482 1 0 0
    一定 ||| a certain ||| -2.83659 -1.07536 -4.97444 -1.90004 1 0 0
    一定 ||| be ||| -4.0444 -5.74325 -2.32375 -4.46486 1 0 0
    一定 ||| be sure ||| -4.21145 -1.3278 -5.75147 -3.32514 1 0 0
    一定 ||| ' ll ||| -5.10527 -5.32301 -8.64566 -4.80402 1 0 1
    ...

    We then modify the config file "NiuTrans.phrase.user.config" to activate the newly-introduced feature.

    param="freefeature"                   value="1"
    param="tablefeatnum"                  value="7"
    

    where "freefeature" is a trigger that activates the use of additional features. "tablefeatnum" sets the number of features defined in the table.

  • 如何在解码器中加入额外翻译规则
    The NiuTrans system also defines some special markups to support this feature. E.g. below is sample sentence to be decoded.

    彼得泰勒 是 一名 英国 资深 金融 分析师 .
    (Peter Taylor is a senior financial analyst at UK .)

    If you have prior knowledge about how to translate "彼得泰勒" and "英国", you can add your own translations using the special markups.:

    彼得泰勒 是 一名 英国 资深 金融 分析师 . |||| {0 ||| 0 ||| Peter Taylor ||| $ne ||| 彼得泰勒} \
    {3 ||| 3 ||| UK ||| $ne ||| 英国}

    where "||||" is a separator, "{0 ||| 0 ||| Peter Taylor ||| $ne ||| 彼得泰勒}" and "{3 ||| 3 ||| UK ||| $ne ||| 英国}" are two user-defined translations. Each consists of 5 terms. The first two numbers indicate the span to be translated; the third term is the translation specified by users; the fourth term indicates the type of translation; and the last term repeats the corresponding word sequence. Note that "\" is used to ease the display here. Please remove "\" in you file, and use "彼得泰勒 是 一名 英国 资深 金融 分析师 . |||| {0 ||| 0 ||| Peter Taylor ||| $ne ||| 彼得泰勒}{3 ||| 3 ||| UK ||| $ne ||| 英国}" directly.