R中的大数据处理与分析 [英] Big Data Process and Analysis in R

查看：142 发布时间：2020/9/20 19:36:02 r bigdata

本文介绍了R中的大数据处理与分析的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我知道这并不是R中的新概念，我已经浏览了高性能和并行计算任务视图.话虽如此，我是从无知的角度提出这个问题的，因为我没有接受过计算机科学方面的正规培训，而且完全是自学成才.

最近，我从Twitter Streaming API收集了数据，当前原始JSON位于10 GB的文本文件中.我知道在使R适应大数据方面取得了长足的进步，那么您将如何解决这个问题?这只是我要完成的一些任务:

读取数据并将其处理到数据帧中
基本描述性分析，包括文本挖掘(常用术语等)
绘图

是否有可能为此完全使用R，还是我必须编写一些Python来解析数据并将其扔到数据库中，以便获取足够小的随机样本以适合R.

简而言之，将不胜感激您可以提供的任何提示或指示.同样，如果您也描述三年级的解决方案，我也不会冒犯.

先谢谢了.

解决方案

如果您需要一次处理整个10GB的文件，那么我赞同@Chase的观点，即要购买一台更大的，可能基于云的计算机.

(Twitter流API返回了一个非常丰富的对象:一个140个字符的推文可能会占用数kb的数据.如果对R之外的数据进行预处理以仅提取所需的内容，则可能会减少内存开销.作为作者名称和推文).

另一方面，如果您的分析适合对数据进行分段(例如，您要首先按作者，日期/时间等对推文进行分组)，则可以考虑使用Hadoop来驱动R.

当然，Hadoop将产生一些开销(包括群集设置和有关基础MapReduce模型的学习)；但是如果您打算做大量的大数据工作，您可能仍然希望在工具箱中使用Hadoop.

几个指针:

第7章中的示例R 显示了如何设置R和Hadoop进行大规模推文分析.该示例使用RHIPE软件包，但是这些概念适用于任何Hadoop/MapReduce工作.
您还可以通过AWS/EC2获得Hadoop集群.退房弹性MapReduce 用于按需群集，或使用 Whirr上轻松进行Hadoop设置如果您需要对Hadoop部署进行更多控制.

I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught.

Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do:

Read and process the data into a data frame
Basic descriptive analysis, including text mining (frequent terms, etc.)
Plotting

Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R.

Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won't take offense if you describe solutions at a 3rd grade level either.

Thanks in advance.

解决方案

If you need to operate on the entire 10GB file at once, then I second @Chase's point about getting a larger, possibly cloud-based computer.

(The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)

On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.

Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.

A couple of pointers:

an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.
you can also get a Hadoop cluster via AWS/EC2. Check out Elastic MapReduce for an on-demand cluster, or use Whirr if you need more control over your Hadoop deployment.

这篇关于R中的大数据处理与分析的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R中的大数据处理与分析 [英] Big Data Process and Analysis in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R中的大数据处理与分析 [英] Big Data Process and Analysis in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭