NLTK西班牙标记器结果真的不好吗? [英] NLTK spanish tagger results real bad?

查看：73 发布时间：2020/5/18 1:26:43 python nltk

本文介绍了NLTK西班牙标记器结果真的不好吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试为西班牙语创建标记器性能比较.我当前的脚本是这个的修改版本，尽管我尝试了另一个具有非常相似结果的版本.

I'm trying to create a tagger performance comparisson for Spanish. My current script is modified version of this one, althoug I tried another version with very similar results.

我正在使用cess_esp语料库，并使用带标签的句子来训练每个标记者，为此语料库创建了Unigram，Bigram，Trigram和Brill标记者.

I'm using the cess_esp corpus and have created a Unigram, Bigram, Trigram and Brill tagger for this corpus using the tagged sentences for training each of the taggers.

我担心他在Bigram，Trigram标记器上的表现……从结果来看，他们似乎一点都不起作用.

I'm concerned about he performance of the Bigram, Trigram taggers...they seem to be not working AT ALL from the results.

例如，这是我的脚本的一些输出:

For instance, here is some output from my script:

*************** START TAGGING FOR LINE 6 ****************************************************************************************************************************************

Current line contents before tagging-> mejor ve a la sucursal de Juan Pablo II es la que menos gente tiene y no te tardas nada

Unigram tagger-> [('@yadimota', None), ('@ContactoBanamex', None), ('mejor', 'aq0cs0'), ('ve', 'vmip3s0'), ('a', 'sps00'), ('la', 'da0fs0'), ('sucursal', 'ncfs000'), ('de', 'sps00'), ('Juan', 'np0000p'), ('Pablo', None), ('II', None), ('es', 'vsip3s0'), ('la', 'da0fs0'), ('que', 'pr0cn000'), ('menos', 'rg'), ('gente', 'ncfs000'), ('tiene', 'vmip3s0'), ('y', 'cc'), ('no', 'rn'), ('te', 'pp2cs000'), ('tardas', None), ('nada', 'pi0cs000')]

Bigram tagger-> [('@yadimota', None), ('@ContactoBanamex', None), ('mejor', None), ('ve', None), ('a', None), ('la', None), ('sucursal', None), ('de', None), ('Juan', None), ('Pablo', None), ('II', None), ('es', None), ('la', None), ('que', None), ('menos', None), ('gente', None), ('tiene', None), ('y', None), ('no', None), ('te', None), ('tardas', None), ('nada', None)]

Trigram tagger-> [('@yadimota', None), ('@ContactoBanamex', None), ('mejor', None), ('ve', None), ('a', None), ('la', None), ('sucursal', None), ('de', None), ('Juan', None), ('Pablo', None), ('II', None), ('es', None), ('la', None), ('que', None), ('menos', None), ('gente', None), ('tiene', None), ('y', None), ('no', None), ('te', None), ('tardas', None), ('nada', None)]
****************************************************************************************************************************************

*************** START TAGGING FOR LINE 7 ****************************************************************************************************************************************

Current line contents before tagging-> He levantado ya varios reporte pero no resuelven nada

Unigram tagger-> [('He', 'vaip1s0'), ('levantado', 'vmp00sm'), ('ya', 'rg'), ('varios', 'di0mp0'), ('reporte', 'vmsp1s0'), ('pero', 'cc'), ('no', 'rn'), ('resuelven', None), ('nada', 'pi0cs000')]

Bigram tagger-> [('He', None), ('levantado', None), ('ya', None), ('varios', None), ('reporte', None), ('pero', None), ('no', None), ('resuelven', None), ('nada', None)]

Trigram tagger-> [('He', None), ('levantado', None), ('ya', None), ('varios', None), ('reporte', None), ('pero', None), ('no', None), ('resuelven', None), ('nada', None)]

*************** START TAGGING FOR LINE 8 ****************************************************************************************************************************************

Current line contents before tagging-> Es lamentable el servicio que brindan

Unigram tagger-> [('@ContactoBanamex', None), ('Es', 'vsip3s0'), ('lamentable', 'aq0cs0'), ('el', 'da0ms0'), ('servicio', 'ncms000'), ('que', 'pr0cn000'), ('brindan', None)]

Bigram tagger-> [('@ContactoBanamex', None), ('Es', None), ('lamentable', None), ('el', None), ('servicio', None), ('que', None), ('brindan', None)]

Trigram tagger-> [('@ContactoBanamex', None), ('Es', None), ('lamentable', None), ('el', None), ('servicio', None), ('que', None), ('brindan', None)]

现在，二元组和三元组被作为指示的链接进行训练，顺便说一下，这是NLTK书中描述的更直接的方法:

Now the bigram and trigram are being trained as the indicated link, which is by the way, the more straight forward way as depicted in the NLTK book:

from nltk.corpus import cess_esp as cess
from nltk import BigramTagger as bt
from nltk import TrigramTagger as tt
cess_sents = cess.tagged_sents()
# Training BigramTagger.
bi_tag = bt(cess_sents)
#Training TrigramTagger
tri_tag = tt(cess_sents)

知道我是否在这里遗漏了什么吗?难道二元论和三元论的表现不应该比二元论更好吗?我是否应该始终对bigram adn trigram使用退避标记器?

Any idea if I'm missing something here? Aren't bigram and trigram supposed to perform better than unigram? Should I use a backoff tagger always for bigram adn trigram?

谢谢！亚历杭德罗

NLTK西班牙标记器结果真的不好吗? [英] NLTK spanish tagger results real bad?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

NLTK西班牙标记器结果真的不好吗? [英] NLTK spanish tagger results real bad?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭