是否有关于giza ++的教程? [英] Is there a tutorial about giza++?

查看：136 发布时间：2020/5/18 0:32:44 nlp machine-translation giza++

本文介绍了是否有关于giza ++的教程?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

自述文件"中的Urls无效( http://www.fjoch.com/mkcls.html 和 http://www.fjoch.com/GIZA++.html ).有没有关于giza ++的好教程?还是有一些备有完整文档的替代品?

The Urls in its 'readme' file is not valid (http://www.fjoch.com/mkcls.html and http://www.fjoch.com/GIZA++.html). Is there a good tutorial about giza++? Or is there some alternatives that have complete documentation?

推荐答案

以下内容摘自我正在为一个课程编写的教程. (注意:这假设您已经在* nix系统上成功安装了GIZA ++-v2.)

The following is excerpted from a tutorial I'm putting together for a class. (NB: This assumes you have successfully installed GIZA++-v2 on a *nix system.)

从两个数据文件开始，这些数据文件包含已被标记的并行句子，每行一个句子.例如，一对平行的英语-法语文件可能如下所示.

示例1-train.en

I gave him the book . 
He read the book . 
He loved the book .

示例2-train.fr

Je lui ai donne/ le livre .
Il a lu le livre .
Il aimait le livre .

通过plain2snt.out运行这些文件以获得目标和源词汇文件(*.vcb)以及句子对文件(*.snt).

Run these files through plain2snt.out to get target and source vocabulary files (*.vcb) as well as a sentence pair file (*.snt).

在GIZA ++目录中，运行:

From the GIZA++ directory, run:

./plain2snt.out TEXT1 TEXT2

其中TEXT1和TEXT2是步骤1中描述的数据文件.

where TEXT1 and TEXT2 are the data files described in step 1.

这将在与TEXT1和TEXT2相同的目录中生成四个文件(假设它们位于同一目录中):

This produces four files in the same directory as TEXT1 and TEXT2 (assuming they are in the same directory):

TEXT1_TEXT2.snt
TEXT1.vcb
TEXT2_TEXT1.snt
TEXT2.vcb

vocab文件包含文本中每个单词的唯一(整数)ID(注意:未标记化/词组化)，单词/字符串以及该字符串出现的次数.这些用单个空格字符分隔.

The vocab files contain a unique (integer) ID for each word in the text (NB: not tokenized/lemmatized), the word/string, and the number of times that string occurred. These are separated by a single space character.

句子文件包含数字.对于每个句子对，共有三行:第一是句子对在语料库中出现的次数的计数，第二和第三是与空格中的单词条目相对应的(以空格分隔)数字的字符串vocab文件.根据*.snt文件的命名约定，假定第一个文件为源语言，第二个文件为目标语言.例如，在文件TEXT1_TEXT2.snt中，第一行将是第一个句子对在语料库中出现的次数的计数，第二行将是与TEXT1.vcb文件中的单词相对应的数字字符串，第三行将是与TEXT2.vcb文件中的单词相对应的数字字符串.

The sentence files contain numbers. For each sentence pair, there are three lines: the first is a count of the number of times that sentence pair occurs in the corpus and the second and third are a string of (space-separated) numbers corresponding to the entries for words in the vocab files. Based on the naming convention for *.snt files, the first file is assumed to be the source, and the second is assumed to be the target language. For example, in the file TEXT1_TEXT2.snt, the first line will be a count of the number of times the first sentence-pair occurred in the corpus, the second line will be a string of numbers corresponding to words in the TEXT1.vcb file, and the third line will be a string of numbers corresponding to words in the TEXT2.vcb file.

现在TEXT1.vcb，TEXT2.vcb和两个*.snt文件中的任何一个都可以用作GIZA ++的输入以产生对齐方式.

Now TEXT1.vcb, TEXT2.vcb, and either of the two *.snt files can be used as input to GIZA++ to produce an alignment.

例如:

./GIZA++ -s TEXT1.vcb -t TEXT2.vcb -c TEXT1_TEXT2.snt

但是请注意，当我尝试运行此命令时，我必须将TEXT1_TEXT2.snt重命名为名称中没有下划线的名称，以便获得任何适当的输出.

But note that when I tried to run this, I had to rename TEXT1_TEXT2.snt to something without an underscore in the name in order to get any proper output.

这篇关于是否有关于giza ++的教程?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

是否有关于giza ++的教程? [英] Is there a tutorial about giza++?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

是否有关于giza ++的教程? [英] Is there a tutorial about giza++?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭