R中的大数据处理与分析 [英] Big Data Process and Analysis in R
问题描述
我知道这并不是R中的新概念,我已经浏览了高性能和并行计算任务视图.话虽如此,我是从无知的角度提出这个问题的,因为我没有接受过计算机科学方面的正规培训,而且完全是自学成才.
最近,我从Twitter Streaming API收集了数据,当前原始JSON位于10 GB的文本文件中.我知道在使R适应大数据方面取得了长足的进步,那么您将如何解决这个问题?这只是我要完成的一些任务:
- 读取数据并将其处理到数据帧中
- 基本描述性分析,包括文本挖掘(常用术语等)
- 绘图
是否有可能为此完全使用R,还是我必须编写一些Python来解析数据并将其扔到数据库中,以便获取足够小的随机样本以适合R.
简而言之,将不胜感激您可以提供的任何提示或指示.同样,如果您也描述三年级的解决方案,我也不会冒犯.
先谢谢了.
如果您需要一次处理整个10GB的文件,那么我赞同@Chase的观点,即要购买一台更大的,可能基于云的计算机.
(Twitter流API返回了一个非常丰富的对象:一个140个字符的推文可能会占用数kb的数据.如果对R之外的数据进行预处理以仅提取所需的内容,则可能会减少内存开销.作为作者名称和推文).
另一方面,如果您的分析适合对数据进行分段(例如,您要首先按作者,日期/时间等对推文进行分组),则可以考虑使用Hadoop来驱动R.>
当然,Hadoop将产生一些开销(包括群集设置和有关基础MapReduce模型的学习);但是如果您打算做大量的大数据工作,您可能仍然希望在工具箱中使用Hadoop.
几个指针:
-
第7章中的示例R 显示了如何设置R和Hadoop进行大规模推文分析.该示例使用RHIPE软件包,但是这些概念适用于任何Hadoop/MapReduce工作. -
您还可以通过AWS/EC2获得Hadoop集群.退房 弹性MapReduce 用于按需群集,或使用
Whirr上轻松进行Hadoop设置 如果您需要对Hadoop部署进行更多控制.
I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught.
Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do:
- Read and process the data into a data frame
- Basic descriptive analysis, including text mining (frequent terms, etc.)
- Plotting
Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R.
Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won't take offense if you describe solutions at a 3rd grade level either.
Thanks in advance.
If you need to operate on the entire 10GB file at once, then I second @Chase's point about getting a larger, possibly cloud-based computer.
(The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)
On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.
Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.
A couple of pointers:
an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.
you can also get a Hadoop cluster via AWS/EC2. Check out Elastic MapReduce for an on-demand cluster, or use Whirr if you need more control over your Hadoop deployment.
这篇关于R中的大数据处理与分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!