有谁知道我如何在R中处理大数据? [英] Does anyone know how I can work with big data in R?

查看:71
本文介绍了有谁知道我如何在R中处理大数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在RStudio中分析推文:

Analyzing tweets in RStudio:

我的csv文件包含4,000,000条tweets,其中包含5列:screen_name,text,created_at,favourite_count和retweet_count.

My csv file contains 4,000,000 tweets with five columns: screen_name, text, created_at, favorite_count, and retweet_count.

我正在尝试使用以下代码来确定主题标签的出现频率,但是它运行太慢了几天,有时RStudio会崩溃.

I am trying to identify the frequency of hashtags using the following codes, however it runs too slowly for several days and sometimes RStudio crashes.

mydata %>%
  unnest_tokens(word, text, token ="tweets") %>%
  anti_join(stop_words, by= "word")

我使用其他方法来处理R中的大数据,例如: https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/ https://spark.rstudio.com/guides/textmining/和Spark库: https://spark.rstudio.com/guides/textmining/.他们都不适合我.

I have used other approaches to handle big data in R such as: https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/ or https://spark.rstudio.com/guides/textmining/ and Spark library: https://spark.rstudio.com/guides/textmining/. None of them work well for me.

在Spark中,我执行以下操作,但是RStudio无法将我的数据集复制到Spark.我在RStudio中甚至看到火花正在运行"一天,而没有将我的数据集复制到Spark.

In Spark, I do the following, but RStudio is not able to copy my dataset to Spark. I see that "Spark is Running" in my RStudio for even one day without copying my dataset to Spark.

连接到您的Spark集群:

Connect to your Spark cluster:

spark_conn <- spark_connect("local")

将track_metadata复制到Spark:

Copy track_metadata to Spark:

track_metadata_tbl <- copy_to(spark_conn, my_database)

您有什么建议/说明/链接可以帮助我分析我的数据吗?

我的笔记本电脑是Mac处理器:2.9 GHz双核Intel Core i5内存:8 GB 2133 MHz LPDDR3

My laptop is a Mac Processor: 2.9 GHz Dual-Core Intel Core i5 Memory: 8 GB 2133 MHz LPDDR3

推荐答案

如果您遇到这种情况,我不会尝试一次解析整个文件,而是一次处理一个大块.

If I were in your situation, I would not try to parse that whole file at once but instead work with a chunk at a time.

我将使用 vroom 读取数据,并处理数据块一次(从5万行开始,然后看看您可以一次扩展多少).

I would use vroom to read in the data, and work with chunks of the data at a time (starting with, say, 50k lines and then seeing how much you can scale up to do at once).

如果您只想对主题标签计数,则可以执行以下操作:

If you are interested in only counting hashtags, you can do something like:

mydata %>%
  unnest_tokens(word, text, token ="tweets") %>%
  filter(str_detect(word, "^#")) %>%
  count(word, sort = TRUE)

并将其附加到汇总结果的新CSV中.然后以块为单位遍历整个数据集.最后,您可以解析结果的CSV并重新汇总计数以总结并找到标签频率.

And append this to a new CSV of aggregated results. Then work through your whole dataset in chunks. At the end, you can parse your CSV of results and re-aggregate your counts to sum up and find the hashtag frequencies.

这篇关于有谁知道我如何在R中处理大数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆