R中的大数据处理与分析 [英] Big Data Process and Analysis in R

查看:142
本文介绍了R中的大数据处理与分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这并不是R中的新概念,我已经浏览了高性能和并行计算任务视图.话虽如此,我是从无知的角度提出这个问题的,因为我没有接受过计算机科学方面的正规培训,而且完全是自学成才.

最近,我从Twitter Streaming API收集了数据,当前原始JSON位于10 GB的文本文件中.我知道在使R适应大数据方面取得了长足的进步,那么您将如何解决这个问题?这只是我要完成的一些任务:

  1. 读取数据并将其处理到数据帧中
  2. 基本描述性分析,包括文本挖掘(常用术语等)
  3. 绘图

是否有可能为此完全使用R,还是我必须编写一些Python来解析数据并将其扔到数据库中,以便获取足够小的随机样本以适合R.

简而言之,将不胜感激您可以提供的任何提示或指示.同样,如果您也描述三年级的解决方案,我也不会冒犯.

先谢谢了.

解决方案

如果您需要一次处理整个10GB的文件,那么我赞同@Chase的观点,即要购买一台更大的,可能基于云的计算机.

(Twitter流API返回了一个非常丰富的对象:一个140个字符的推文可能会占用数kb的数据.如果对R之外的数据进行预处理以仅提取所需的内容,则可能会减少内存开销.作为作者名称和推文).

另一方面,如果您的分析适合对数据进行分段(例如,您要首先按作者,日期/时间等对推文进行分组),则可以考虑使用Hadoop来驱动R.

当然,Hadoop将产生一些开销(包括群集设置和有关基础MapReduce模型的学习);但是如果您打算做大量的大数据工作,您可能仍然希望在工具箱中使用Hadoop.

几个指针:

  • 第7章中的示例R 显示了如何设置R和Hadoop进行大规模推文分析.该示例使用RHIPE软件包,但是这些概念适用于任何Hadoop/MapReduce工作.

  • 您还可以通过AWS/EC2获得Hadoop集群.退房 弹性MapReduce 用于按需群集,或使用 Whirr上轻松进行Hadoop设置 如果您需要对Hadoop部署进行更多控制.

I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught.

Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do:

  1. Read and process the data into a data frame
  2. Basic descriptive analysis, including text mining (frequent terms, etc.)
  3. Plotting

Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R.

Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won't take offense if you describe solutions at a 3rd grade level either.

Thanks in advance.

解决方案

If you need to operate on the entire 10GB file at once, then I second @Chase's point about getting a larger, possibly cloud-based computer.

(The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)

On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.

Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.

A couple of pointers:

  • an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.

  • you can also get a Hadoop cluster via AWS/EC2. Check out Elastic MapReduce for an on-demand cluster, or use Whirr if you need more control over your Hadoop deployment.

这篇关于R中的大数据处理与分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆