从带有特定单词,标签组合的带有pos标签的语料库中提取句子 [英] extracting sentences from pos-tagged corpus with certain word, tag combos

查看:215
本文介绍了从带有特定单词,标签组合的带有pos标签的语料库中提取句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在玩棕色的语料库,特别是新闻"中带有标记的句子.我发现"to"是带有最多歧义词标签的词(TO,IN,TO-HL,IN-HL,IN-TL,NPS).我正在尝试编写一个代码,该代码将为与"to"相关联的每个标签从语料库中打印一个句子.这些句子不需要清除"标签,而只需包含"to"和每个相关的pos标签.

I'm playing with the brown corpus, specifically the tagged sentences in "news." I've found that "to" is the word with the most ambiguous word tags (TO, IN, TO-HL, IN-HL, IN-TL, NPS). I'm trying to write a code that will print one sentence from the corpus for each tag associated with "to". The sentences do not need to be "cleaned" of the tags, but just contain both "to" and one each of the associated pos-tags.

brown_sents = nltk.corpus.brown.tagged_sents(categories="news")
for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == "IN"):
            print sent

我仅使用pos标签之一尝试了上面的代码,以查看其是否有效,但它会打印出所有示例.我需要它仅打印找到的第一个与单词,标记匹配的句子,然后停止.我试过了:

I tried the above code with just one of the pos-tags to see if it worked, but it prints all the instances of this. I need it to print just the first found sentence that matches the word, tag and then stop. I tried this:

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'IN'):
            print sent
        if (word != 'to' and tag != 'IN'):
            break

该标签与此pos-tag一起使用,因为它是与"to"相关的第一个标签,但是如果我使用:

This works with this pos-tag because it's the first one related to "to", but if I use:

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'TO-HL'):
            print sent
        if (word != 'to' and tag != 'TO-HL'):
            break

它什么也不返回.我想我太亲密了-愿意帮忙吗?

It returns nothing. I think I am SO close -- care to help?

推荐答案

您可以继续添加到当前代码中,但是您的代码未考虑以下内容:

You can continue to add to your current code but your code didn't consider these things:

  • 如果"to"在具有相同或不同POS的句子中多次出现,会发生什么?
  • 如果您使用相同的POS机"to"出现在句子中两次,您是否希望将该句子打印两次?
  • 如果"to"出现在句子的第一个单词中并且大写,该怎么办?

如果您想坚持自己的代码,请尝试以下操作:

If you want to stick with your code try this:

from nltk.corpus import brown

brown_sents = brown.tagged_sents(categories="news")

def to_pos_sent(pos):
    for sent in brown_sents:
        for word, tag in sent:
            if word == 'to' and tag == pos:
                yield sent

for sent in to_pos_sent('TO'):
    print sent

for sent in to_pos_sent('IN'):
    print sent

我建议您将句子存储在defaultdict(list)中,然后您可以随时检索它们.

I suggest that you store the sentence in a defaultdict(list), then you can retrieve them anytime.

from nltk.corpus import brown
from collections import Counter, defaultdict

sents_with_to = defaultdict(list)

to_counts = Counter()

for i, sent in enumerate(brown.tagged_sents(categories='news')):
    # Check if 'to' is in sentence.
    uniq_words = dict(sent)
    if 'to' in uniq_words or 'To' in uniq_words:
        # Iterate through the sentence to find 'to'
        for word, pos in sent:
            if word.lower()=='to':
                # Flatten the sentence into a string
                sents_with_to[pos].append(sent)
                to_counts[pos]+=1


for pos in sents_with_to:
    for sent in sents_with_to[pos]:
        print pos, sent

要访问特定POS的句子,请执行以下操作:

To access the sentences of a specific POS:

for sent in sents_with_to['TO']:
    print sent

您将意识到,如果与特定POS的"to"在句子中出现两次.在sents_with_to[pos]中记录了两次.如果要删除它们,请尝试:

You'll realized that if 'to' with a specific POS appears twice in the sentence. It's recorded twice in sents_with_to[pos]. If you want to remove them, try:

sents_with_to_and_TO = set(" ".join(["#".join(word, pos) for word, pos in sent] for sent in sents_with_to['TO']))

这篇关于从带有特定单词,标签组合的带有pos标签的语料库中提取句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆