使用R中的tm包进行文本挖掘,删除以[http]开头的单词或任何其他特定单词 [英] text mining with tm package in R ,remove words starting from [http] or any other specifc word

查看:174
本文介绍了使用R中的tm包进行文本挖掘,删除以[http]开头的单词或任何其他特定单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是R和文本挖掘的新手.我已经从与某个术语相关的Twitter提要中形成了一个词云.我面临的问题是在wordcloud中它显示http:...或htt ... 我该如何处理 我尝试使用元字符*,但我仍然怀疑我是否正确使用它

I am new to R and text mining. I had made a word cloud out of twitter feed related to some term. The problem that I'm facing is that in the wordcloud it shows http:... or htt... How do I deal about this issue I tried using metacharacter * but I still doubt if I'm applying it right

tw.text = removeWords(tw.text,c(stopwords("en"),"rt","http \\ *"))

tw.text = removeWords(tw.text,c(stopwords("en"),"rt","http\\*"))

有人在挖掘文字,请帮我解决这个问题.

somebody into text-minning please help me with this.

推荐答案

如果要从字符串中删除URL,可以使用:

If you are looking to remove URLs from your string, you may use:

gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)

x所在的位置:

x <- c("some text http://idontwantthis.com", 
         "same problem again http://pleaseremoveme.com")


如果您可以发布数据样本,则为您提供特定答案会更容易,但是以下示例将为您提供不带URL的纯净文本:


It would be easier to provide you with a specific answer if you could post sample of your data but the following example would give you a clean text with no URLs:

> clean_x <- gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
> clean_x
[1] "some text "          "same problem again "

作为补充,我建议寻找在挖掘之前清除文本的现有方法可能是值得的.例如,在此处中讨论的clean函数将使您能够执行此操作自动地.在类似的行上,具有清除推文(#@),标点符号和其他不良条目的功能.

As a side point, I would suggest that it may be worth searching for the existing methods to clean text before mining. For example the clean function discussed here would enable you to do this automatically. On similar lines, there are function to clean your text from tweets (#,@), punctuation and other undesirable entries.

这篇关于使用R中的tm包进行文本挖掘,删除以[http]开头的单词或任何其他特定单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆