NLTK中的实际字数 [英] real word count in NLTK

查看:87
本文介绍了NLTK中的实际字数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

NLTK书中有几个字数示例,但实际上它们不是字数而是令牌数.例如,第1章计算词汇量"说以下内容给出了单词计数:

The NLTK book has a couple of examples of word counts, but in reality they are not word counts but token counts. For instance, Chapter 1, Counting Vocabulary says that the following gives a word count:

text = nltk.Text(tokens)
len(text)

但是,它没有-它给出了单词和标点计数. 如何获得真实的字数(忽略标点符号)?

However, it doesn't - it gives a word and punctuation count. How can you get a real word count (ignoring punctuation)?

类似地,您如何获得一个单词的平均字符数? 显而易见的答案是:

Similarly, how can you get the average number of characters in a word? The obvious answer is:

word_average_length =(len(string_of_text)/len(text))

但是,这将因为以下原因而消失:

However, this would be off because:

  1. len(string_of_text)是一个字符计数,包括空格
  2. len(text)是一个令牌计数,不包括空格,但包括不是单词的标点符号.

我在这里错过了什么吗?这必须是非常常见的NLP任务...

Am I missing something here? This must be a very common NLP task...

推荐答案

删除标点符号

使用正则表达式过滤掉标点符号

Removing Punctuation

Use a regular expression to filter out the punctuation

import re
from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
>>> filtered = [w for w in text if nonPunct.match(w)]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

平均字符数

对每个单词的长度求和.除以字数.

Average Number of Characters

Sum the lengths of each word. Divide by the number of words.

>>> float(sum(map(len, filtered))) / len(filtered)
3.75

或者您可以利用已经进行的计数来防止重新计算.这会将单词的长度乘以我们看到单词的次数,然后将所有单词加起来.

Or you could make use of the counts you already did to prevent some re-computation. This multiplies the length of the word by the number of times we saw it, then sums all of that up.

>>> float(sum(len(w)*c for w,c in counts.iteritems())) / len(filtered)
3.75

这篇关于NLTK中的实际字数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆