如何提取文本中使用的关键字? [英] How do I extract keywords used in text?
问题描述
如何通过数据挖掘一堆文本来获取关键字? (雅各布史密斯或围栏)
How do I data mine a pile of text to get keywords by usage? ("Jacob Smith" or "fence")
是否有软件可以执行此操作?甚至半自动,如果它可以过滤掉简单的单词,如the,and,或,那么我可以更快地找到主题。
And is there a software to do this already? even semi-automatically, and if it can filter out simple words like "the", "and", "or", then I could get to the topics quicker.
推荐答案
通用算法将如下所示:
- Obtain Text
- Strip punctuation, special characters, etc.
- Strip "simple" words
- Split on Spaces
- Loop Over Split Text
- Add word to Array/HashTable/Etc if it doesn't exist;
if it does, increment counter for that word
最终结果是文本中所有单词的频率计数。然后,您可以取这些值并除以总字数以获得频率百分比。任何进一步的处理都取决于你。
The end result is a frequency count of all words in the text. You can then take these values and divide by the total number of words to get a percentage of frequency. Any further processing is up to you.
你也想要研究 Stemming 。词干用于减少词根。例如 going =>去
, cars =>汽车
等。
You're also going to want to look into Stemming. Stemming is used to reduce words to their root. For example going => go
, cars => car
, etc.
像这样的算法在垃圾邮件过滤器,关键字索引等中很常见。
An algorithm like this is going to be common in spam filters, keyword indexing and the like.
这篇关于如何提取文本中使用的关键字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!