如何提取文本中使用的关键字? [英] How do I extract keywords used in text?

查看:236
本文介绍了如何提取文本中使用的关键字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何通过数据挖掘一堆文本来获取关键字? (雅各布史密斯或围栏)

How do I data mine a pile of text to get keywords by usage? ("Jacob Smith" or "fence")

是否有软件可以执行此操作?甚至半自动,如果它可以过滤掉简单的单词,如the,and,或,那么我可以更快地找到主题。

And is there a software to do this already? even semi-automatically, and if it can filter out simple words like "the", "and", "or", then I could get to the topics quicker.

推荐答案

通用算法将如下所示:


- Obtain Text
- Strip punctuation, special characters, etc.
- Strip "simple" words
- Split on Spaces
- Loop Over Split Text
    - Add word to Array/HashTable/Etc if it doesn't exist;
       if it does, increment counter for that word

最终结果是文本中所有单词的频率计数。然后,您可以取这些值并除以总字数以获得频率百分比。任何进一步的处理都取决于你。

The end result is a frequency count of all words in the text. You can then take these values and divide by the total number of words to get a percentage of frequency. Any further processing is up to you.

你也想要研究 Stemming 。词干用于减少词根。例如 going =>去 cars =>汽车等。

You're also going to want to look into Stemming. Stemming is used to reduce words to their root. For example going => go, cars => car, etc.

像这样的算法在垃圾邮件过滤器,关键字索引等中很常见。

An algorithm like this is going to be common in spam filters, keyword indexing and the like.

这篇关于如何提取文本中使用的关键字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆