NLTK西班牙标记器结果真的不好吗? [英] NLTK spanish tagger results real bad?

查看:73
本文介绍了NLTK西班牙标记器结果真的不好吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为西班牙语创建标记器性能比较.我当前的脚本是这个的修改版本,尽管我尝试了另一个具有非常相似结果的版本.

I'm trying to create a tagger performance comparisson for Spanish. My current script is modified version of this one, althoug I tried another version with very similar results.

我正在使用cess_esp语料库,并使用带标签的句子来训练每个标记者,为此语料库创建了Unigram,Bigram,Trigram和Brill标记者.

I'm using the cess_esp corpus and have created a Unigram, Bigram, Trigram and Brill tagger for this corpus using the tagged sentences for training each of the taggers.

我担心他在Bigram,Trigram标记器上的表现……从结果来看,他们似乎一点都不起作用.

I'm concerned about he performance of the Bigram, Trigram taggers...they seem to be not working AT ALL from the results.

例如,这是我的脚本的一些输出:

For instance, here is some output from my script:

*************** START TAGGING FOR LINE 6 ****************************************************************************************************************************************

Current line contents before tagging-> mejor ve a la sucursal de Juan Pablo II es la que menos gente tiene y no te tardas nada

Unigram tagger-> [('@yadimota', None), ('@ContactoBanamex', None), ('mejor', 'aq0cs0'), ('ve', 'vmip3s0'), ('a', 'sps00'), ('la', 'da0fs0'), ('sucursal', 'ncfs000'), ('de', 'sps00'), ('Juan', 'np0000p'), ('Pablo', None), ('II', None), ('es', 'vsip3s0'), ('la', 'da0fs0'), ('que', 'pr0cn000'), ('menos', 'rg'), ('gente', 'ncfs000'), ('tiene', 'vmip3s0'), ('y', 'cc'), ('no', 'rn'), ('te', 'pp2cs000'), ('tardas', None), ('nada', 'pi0cs000')]

Bigram tagger-> [('@yadimota', None), ('@ContactoBanamex', None), ('mejor', None), ('ve', None), ('a', None), ('la', None), ('sucursal', None), ('de', None), ('Juan', None), ('Pablo', None), ('II', None), ('es', None), ('la', None), ('que', None), ('menos', None), ('gente', None), ('tiene', None), ('y', None), ('no', None), ('te', None), ('tardas', None), ('nada', None)]

Trigram tagger-> [('@yadimota', None), ('@ContactoBanamex', None), ('mejor', None), ('ve', None), ('a', None), ('la', None), ('sucursal', None), ('de', None), ('Juan', None), ('Pablo', None), ('II', None), ('es', None), ('la', None), ('que', None), ('menos', None), ('gente', None), ('tiene', None), ('y', None), ('no', None), ('te', None), ('tardas', None), ('nada', None)]
****************************************************************************************************************************************

*************** START TAGGING FOR LINE 7 ****************************************************************************************************************************************

Current line contents before tagging-> He levantado ya varios reporte pero no resuelven nada

Unigram tagger-> [('He', 'vaip1s0'), ('levantado', 'vmp00sm'), ('ya', 'rg'), ('varios', 'di0mp0'), ('reporte', 'vmsp1s0'), ('pero', 'cc'), ('no', 'rn'), ('resuelven', None), ('nada', 'pi0cs000')]

Bigram tagger-> [('He', None), ('levantado', None), ('ya', None), ('varios', None), ('reporte', None), ('pero', None), ('no', None), ('resuelven', None), ('nada', None)]

Trigram tagger-> [('He', None), ('levantado', None), ('ya', None), ('varios', None), ('reporte', None), ('pero', None), ('no', None), ('resuelven', None), ('nada', None)]

*************** START TAGGING FOR LINE 8 ****************************************************************************************************************************************

Current line contents before tagging-> Es lamentable el servicio que brindan

Unigram tagger-> [('@ContactoBanamex', None), ('Es', 'vsip3s0'), ('lamentable', 'aq0cs0'), ('el', 'da0ms0'), ('servicio', 'ncms000'), ('que', 'pr0cn000'), ('brindan', None)]

Bigram tagger-> [('@ContactoBanamex', None), ('Es', None), ('lamentable', None), ('el', None), ('servicio', None), ('que', None), ('brindan', None)]

Trigram tagger-> [('@ContactoBanamex', None), ('Es', None), ('lamentable', None), ('el', None), ('servicio', None), ('que', None), ('brindan', None)]

现在,二元组和三元组被作为指示的链接进行训练,顺便说一下,这是NLTK书中描述的更直接的方法:

Now the bigram and trigram are being trained as the indicated link, which is by the way, the more straight forward way as depicted in the NLTK book:

from nltk.corpus import cess_esp as cess
from nltk import BigramTagger as bt
from nltk import TrigramTagger as tt
cess_sents = cess.tagged_sents()
# Training BigramTagger.
bi_tag = bt(cess_sents)
#Training TrigramTagger
tri_tag = tt(cess_sents)

知道我是否在这里遗漏了什么吗?难道二元论和三元论的表现不应该比二元论更好吗?我是否应该始终对bigram adn trigram使用退避标记器?

Any idea if I'm missing something here? Aren't bigram and trigram supposed to perform better than unigram? Should I use a backoff tagger always for bigram adn trigram?

谢谢! 亚历杭德罗

推荐答案

意大利面条塔格(创建https://code.google.com/p/spaghetti-tagger/的目的是为了简化教程,说明如何使用NLTK语料库和标记模块轻松创建可伸缩标记.

The spaghetti-tagger (https://code.google.com/p/spaghetti-tagger/) was created for simple tutorial purposes on how to easily create scalable taggers using NLTK corpus and tagging modules.

这并不意味着该网站建议的是最先进的系统.建议使用最新的标记器,例如 http://nlp.lsi.upc.edu /freeling/.如果您需要,我很乐意在python中为Freeling编写一个正确的包装器类.

It is not meant to be a state-of-art system as the site suggests. It is advisable to use state-of-art taggers such as http://nlp.lsi.upc.edu/freeling/. I'll be happy to write a proper wrapper class in python for Freeling if you need it.

回到您的问题,正如弗朗西斯所暗示的那样( https://groups.google.com/forum/#!topic/nltk-users/FtqksaZLLvY ),请首先完成教程

Back to your question, as Francis had hinted (https://groups.google.com/forum/#!topic/nltk-users/FtqksaZLLvY) , first go through the tutorial http://nltk.googlecode.com/svn/trunk/doc/howto/tag.html, then you will see that backoff parameter might resolves your problem

免责声明::我写了spaghetti.py https://spaghetti-tagger.googlecode.com/svn/spaghetti.py

Disclaimer: I wrote the spaghetti.py https://spaghetti-tagger.googlecode.com/svn/spaghetti.py

这篇关于NLTK西班牙标记器结果真的不好吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆