nltk pos tagger希望合并“.". [英] nltk pos tagger looks to incorporate '.'

查看:91
本文介绍了nltk pos tagger希望合并“.".的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python,nlp和nltk的新手,所以请多多包涵.我有几篇文章(〜200篇)是手工分类的.我正在寻找一种启发式的方法来协助/推荐类别.首先,我希望在当前类别和文档中的单词之间建立一种关系.

I am new to python, nlp and and nltk, so please bear with me. I have a handful of articles (~200), that have been categorized by hand. I am looking to develop a heuristic to assist/ recommend categories. To start I was hoping to build a relationship between current categories and the words within the document.

我的前提是名词对类别而言比其他任何词性都重要.例如,能源"类别可能几乎完全由名词驱动:石油,电池,风等.

My premise is that the nouns are more important to the category than any other part of speech. For example the category "Energy" probably is driven nearly completely through the nouns: oil, battery, wind, etc.

我要做的第一件事是标记零件并评估它们.在第一篇文章中,我遇到了一些问题.有些标记必须标点符号.

The first thing I wanted to do was tag the parts and evaluate them. On the first article I encountered some issues. Some of the tokens are bound to punctuation.

for articles in articles[1]:
    articles_id, content = articles
    clean = nltk.clean_html(content).replace('’', "'")
    tokens = nltk.word_tokenize(clean)
    pos_document = nltk.pos_tag(tokens)
    pos ={}
    for pos_word in pos_document:
        word, part = pos_word
        if pos.has_key(part):
            pos[part].append(word)
        else:
            pos[part] = [word]

格式化输出:

{
'VBG': ['continuing', 'paying', 'falling', 'starting'], 
'VBD': ['made', 'ended'], 'VBN': ['been', 'leaned', 'been', 'been'], 
'VBP': ['know', 'hasn', 'have', 'continue', 'expect', 'take', 'see', 'have', 'are'], 
'WDT': ['which', 'which'], 'JJ': ['negative', 'positive', 'top', 'modest', 'negative', 'real', 'financial', 'isn', 'important', 'long', 'short', 'next'], 
'VBZ': ['is', 'has', 'is', 'leads', 'is', 'is'], 'DT': ['Another', 'the', 'the', 'any', 'any', 'the', 'the', 'a', 'the', 'the', 'the', 'the', 'a', 'the', 'a', 'a', 'the', 'a', 'the', 'any'], 
'RP': ['back'], 
'NN': [ 'listless', 'day', 'rsquo', 'll', 'progress', 'rsquo', 't', 'news', 'season', 'corner', 'surprise', 'stock', 'line', 'growth', 'question', 
        'stop', 'engineering', 'growth', 'isn', 'rsquo', 't', 'rsquo', 't', 'stock', 'market', 'look', 'junk', 'bond', 'market', 'turning', 'junk', 
        'rock', 'history', 'guide', 't', 'day', '%', '%', '%', 'level', 'move', 'isn', 'rsquo', 't', 'indication', 'way'], 
',': [',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ','], '.': ['.'], 
'TO': ['to', 'to', 'to', 'to', 'to', 'to', 'to'], 
'PRP': ['them', 'they', 'they', 'we', 'you', 'they', 'it'], 
'RB': ['then', 'there', 'just', 'just', 'always', 'so', 'so', 'only', 'there', 'right', 'there', 'much', 'typically', 'far', 'certainly'], 
':': [';', ';', ';', ';', ';', ';', ';'], 
'NNS': ['folks', 'companies', 'estimates', 'covers', 's', 'equities', 'bonds', 'equities', 'flats'], 
'NNP': ['drift.', 'We', 'Monday', 'DC', 'note.', 'Earnings', 'EPS', 'same.', 'The', 'Street', 'now.', 'Since', 'points.', 'What', 'behind.', 'We', 'flat.', 'The'], 
'VB': ['get', 'manufacture', 'buy', 'boost', 'look', 'see', 'say', 'let', 'rsquo', 'rsquo', 'be', 'build', 'accelerate', 'be'], 
'WRB': ['when', 'where'], 
'CC': ['&', 'and', '&', 'and', 'and', 'or', 'and', '&', '&', '&', 'and', '&', 'and', 'but', '&'], 
'CD': ['47', '23', '30'], 
'EX': ['there'], 
'IN': ['on', 'if', 'until', 'of', 'around', 'as', 'on', 'down', 'since', 'of', 'for', 'under', 'that', 'about', 'at', 'at', 'that', 'like', 'if'], 
'MD': ['can', 'will', 'can', 'can', 'will'], 
'JJR': ['more']
}

在NMP下注意漂移"一词. -不应该删除期限吗?我需要自行删除它还是在库中缺少某些内容?

notice under the NMP the word 'drift.' - shouldn't the period be removed? Do I need to remove this on my own or am I missing something with the library?

推荐答案

NLTK的单词标记器假定其输入已经被分成句子.因此,为了使其正常工作,您需要首先在输入上调用sent_tokenize.我认为您可以将sent_tokenize的输出用作word_tokenize的输入,但是通常您希望遍历句子.

NLTK's word tokenizer assumes that its input has already been separated into sentences. Therefore in order to get it to work, you need to call sent_tokenize on your input first. I think you can use the output of sent_tokenize as the input to word_tokenize, but typically you would want to iterate over your sentences.

for articles in articles[1]:
    articles_id, content = articles
    clean = nltk.clean_html(content).replace('’', "'")
    sents = nltk.sent_tokenize(clean)
    pos ={}
    for sent in sents:
        tokens = nltk.word_tokenize(sent)
        pos_document = nltk.pos_tag(tokens)
        for pos_word in pos_document:
            word, part = pos_word
            if pos.has_key(part):
                pos[part].append(word)
            else:
                pos[part] = [word]

我认为这是必要的原因是为了帮助将句子末尾的标点符号时段与缩写所用的时段区分开(即,您不希望将史密斯先生"分解为'Mr', '.', 'Smith')

I believe the reason this is necessary is to help distinguish punctuation periods at the ends of sentences from periods used in abbreviations (i.e. you wouldn't want "Mr. Smith" to be broken into 'Mr', '.', 'Smith')

这篇关于nltk pos tagger希望合并“.".的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆