NLTK使用语料库标记西班牙语单词 [英] NLTK Tagging spanish words using a corpus

查看:128
本文介绍了NLTK使用语料库标记西班牙语单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习如何使用NLTK标记西班牙语单词.

I am trying to learn how to tag spanish words using NLTK.

nltk书中可以很容易地使用示例标记英语单词.因为我是nltk和所有语言处理的新手,所以我对如何进行程序感到很困惑.

From the nltk book, It is quite easy to tag english words using their example. Because I am new to nltk and all language processing, I am quite confused on how to proceeed.

我已经下载了cess_esp语料库.有没有一种方法可以指定nltk.pos_tag中的语料库.我查看了pos_tag文档,但没有发现建议的任何内容.我觉得我缺少一些关键概念.我是否需要在cess_esp语料库中手动标记文本中的单词? (通过手动,我的意思是标记我的情感,然后再次对语料库进行操作)或者我完全偏离了常规.谢谢

I have downloaded the cess_esp corpus. Is there a way to specifiy a corpus in nltk.pos_tag. I looked at the pos_tag documentation and didn't see anything that suggested I could. I feel like i'm missing some key concepts. Do I have to manually tag the words in my text agains the cess_esp corpus? (by manually I mean tokenize my sentance and run it agains the corpus) Or am I off the mark entirely. Thank you

推荐答案

首先,您需要从语料库中读取带标记的句子. NLTK提供了一个不错的界面,不会因不同格式的不同而烦恼语料库您可以简单地导入语料库,使用语料库对象函数来访问数据.请参见 http://nltk.googlecode.com/svn/trunk/nltk_data/index .xml .

First you need to read the tagged sentence from a corpus. NLTK provides a nice interface to no bother with different formats from the different corpora; you can simply import the corpus use the corpus object functions to access the data. See http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml .

然后,您必须选择标记器并训练标记器.还有更多花哨的选项,但您可以从N-gram标记器开始.

Then you have to choose your choice of tagger and train the tagger. There are more fancy options but you can start with the N-gram taggers.

然后,您可以使用标记器标记所需的句子.这是一个示例代码:

Then you can use the tagger to tag the sentence you want. Here's an example code:

from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt

# Read the corpus into a list, 
# each entry in the list is one sentence.
cess_sents = cess.tagged_sents()

# Train the unigram tagger
uni_tag = ut(cess_sents)

sentence = "Hola , esta foo bar ."

# Tagger reads a list of tokens.
uni_tag.tag(sentence.split(" "))

# Split corpus into training and testing set.
train = int(len(cess_sents)*90/100) # 90%

# Train a bigram tagger with only training data.
bi_tag = bt(cess_sents[:train])

# Evaluates on testing data remaining 10%
bi_tag.evaluate(cess_sents[train+1:])

# Using the tagger.
bi_tag.tag(sentence.split(" "))

在大型语料库上训练标记器可能需要花费大量时间.无需在每次需要时训练标记器,而是将训练有素的标记器保存在文件中以供以后重用是很方便的.

Training a tagger on a large corpus may take a significant time. Instead of training a tagger every time we need one, it is convenient to save a trained tagger in a file for later re-use.

请查看存储标记部分. > http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html

Please look at Storing Taggers section in http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html

这篇关于NLTK使用语料库标记西班牙语单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆