有关于 giza++ 的教程吗? [英] Is there a tutorial about giza++?
问题描述
其自述"文件中的网址无效 (http://www.fjoch.com/mkcls.html 和 http://www.fjoch.com/GIZA++.html).有关于 giza++ 的好教程吗?或者是否有一些具有完整文档的替代方案?
The Urls in its 'readme' file is not valid (http://www.fjoch.com/mkcls.html and http://www.fjoch.com/GIZA++.html). Is there a good tutorial about giza++? Or is there some alternatives that have complete documentation?
推荐答案
以下内容摘自我为课堂整理的教程.(注意:这假设您已在 *nix 系统上成功安装 GIZA++-v2.)
The following is excerpted from a tutorial I'm putting together for a class. (NB: This assumes you have successfully installed GIZA++-v2 on a *nix system.)
- 从包含已分词的平行句子的两个数据文件开始,每行一个句子.例如,一对平行的英法文件可能如下所示.
示例 1 - train.en
I gave him the book .
He read the book .
He loved the book .
示例 2 - train.fr
Je lui ai donne/ le livre .
Il a lu le livre .
Il aimait le livre .
- 通过
plain2snt.out
运行这些文件,得到目标和源词汇文件(*.vcb
)以及句对文件(*.snt
).
- Run these files through
plain2snt.out
to get target and source vocabulary files (*.vcb
) as well as a sentence pair file (*.snt
).
从 GIZA++ 目录,运行:
From the GIZA++ directory, run:
./plain2snt.out TEXT1 TEXT2
其中 TEXT1
和 TEXT2
是步骤 1 中描述的数据文件.
where TEXT1
and TEXT2
are the data files described in step 1.
这会在与 TEXT1
和 TEXT2
相同的目录中生成四个文件(假设它们在同一目录中):
This produces four files in the same directory as TEXT1
and TEXT2
(assuming they are in the same directory):
- TEXT1_TEXT2.snt
- TEXT1.vcb
- TEXT2_TEXT1.snt
- TEXT2.vcb
词汇文件包含文本中每个单词的唯一(整数)ID(注意:未标记/词形还原)、单词/字符串以及该字符串出现的次数.它们由单个空格字符分隔.
The vocab files contain a unique (integer) ID for each word in the text (NB: not tokenized/lemmatized), the word/string, and the number of times that string occurred. These are separated by a single space character.
句子文件包含数字.对于每个句子对,有三行:第一行是该句对在语料库中出现的次数的计数,第二行和第三行是一串(空格分隔的)数字,对应于词条中的词条目.词汇文件.根据 *.snt
文件的命名约定,第一个文件被假定为源语言,第二个文件被假定为目标语言.例如,在文件TEXT1_TEXT2.snt
中,第一行是第一个句子对在语料库中出现的次数,第二行是一串数字对应的TEXT1.vcb
文件中的words,第三行是TEXT2.vcb
文件中words对应的一串数字.
The sentence files contain numbers. For each sentence pair, there are three lines: the first is a count of the number of times that sentence pair occurs in the corpus and the second and third are a string of (space-separated) numbers corresponding to the entries for words in the vocab files. Based on the naming convention for *.snt
files, the first file is assumed to be the source, and the second is assumed to be the target language. For example, in the file TEXT1_TEXT2.snt
, the first line will be a count of the number of times the first sentence-pair occurred in the corpus, the second line will be a string of numbers corresponding to words in the TEXT1.vcb
file, and the third line will be a string of numbers corresponding to words in the TEXT2.vcb
file.
- 现在
TEXT1.vcb
、TEXT2.vcb
和两个*.snt
文件中的任何一个都可以用作 GIZA++ 的输入产生对齐.
- Now
TEXT1.vcb
,TEXT2.vcb
, and either of the two*.snt
files can be used as input to GIZA++ to produce an alignment.
例如:
./GIZA++ -s TEXT1.vcb -t TEXT2.vcb -c TEXT1_TEXT2.snt
但请注意,当我尝试运行此程序时,我必须将 TEXT1_TEXT2.snt
重命名为名称中不带下划线的名称,以获得任何正确的输出.
But note that when I tried to run this, I had to rename TEXT1_TEXT2.snt
to something without an underscore in the name in order to get any proper output.
这篇关于有关于 giza++ 的教程吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!