NLTK 中的实际字数 [英] real word count in NLTK

查看：28 发布时间：2022/1/2 17:27:29 python nlp nltk

本文介绍了NLTK 中的实际字数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

NLTK 书中有几个字数统计的例子，但实际上它们不是字数而是标记数.例如，第 1 章，计数词汇说以下给出了字数:

text = nltk.Text(tokens)len(文本)

然而，它没有 - 它给出了一个单词和标点符号.你怎么能得到一个真正的字数(忽略标点符号)?

同样地，如何获得一个单词的平均字符数?显而易见的答案是:

word_average_length =(len(string_of_text)/len(text))

但是，这将被关闭，因为:

len(string_of_text) 是一个字符数，包括空格
len(text) 是一个标记计数，不包括空格但包括标点符号，它们不是单词.

我在这里遗漏了什么吗?这一定是一个很常见的NLP任务...

解决方案

去除标点符号

使用正则表达式过滤掉标点

导入重新从集合导入计数器>>>text = ['this', 'is', 'a', 'sentence', '.']>>>nonPunct = re.compile('.*[A-Za-z0-9].*') # 必须包含字母或数字>>>过滤 = [w for w in text if nonPunct.match(w)]>>>计数 = 计数器(已过滤)>>>计数Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

平均字符数

求和每个单词的长度.除以字数.

<预><代码>>>>浮动(总和(地图(len，过滤)))/len(过滤)3.75

或者你可以利用你已经做过的计数来防止一些重新计算.这会将单词的长度乘以我们看到它的次数，然后将所有结果相加.

<预><代码>>>>float(sum(len(w)*c for w,c in counts.iteritems()))/len(filtered)3.75

The NLTK book has a couple of examples of word counts, but in reality they are not word counts but token counts. For instance, Chapter 1, Counting Vocabulary says that the following gives a word count:

text = nltk.Text(tokens)
len(text)

However, it doesn't - it gives a word and punctuation count. How can you get a real word count (ignoring punctuation)?

Similarly, how can you get the average number of characters in a word? The obvious answer is:

word_average_length =(len(string_of_text)/len(text))

However, this would be off because:

len(string_of_text) is a character count, including spaces
len(text) is a token count, excluding spaces but including punctuation marks, which aren't words.

Am I missing something here? This must be a very common NLP task...

解决方案

Removing Punctuation

Use a regular expression to filter out the punctuation

import re
from collections import Counter

>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
>>> filtered = [w for w in text if nonPunct.match(w)]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})

Average Number of Characters

Sum the lengths of each word. Divide by the number of words.

>>> float(sum(map(len, filtered))) / len(filtered)
3.75

Or you could make use of the counts you already did to prevent some re-computation. This multiplies the length of the word by the number of times we saw it, then sums all of that up.

>>> float(sum(len(w)*c for w,c in counts.iteritems())) / len(filtered)
3.75

这篇关于NLTK 中的实际字数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

NLTK 中的实际字数 [英] real word count in NLTK

问题描述

去除标点符号

平均字符数

Removing Punctuation

Average Number of Characters

相关文章

Python最新文章

热门教程

热门工具

登录关闭

NLTK 中的实际字数 [英] real word count in NLTK

问题描述

去除标点符号

平均字符数

Removing Punctuation

Average Number of Characters

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭