有谁知道我如何在 R 中处理大数据? [英] Does anyone know how I can work with big data in R?

查看:32
本文介绍了有谁知道我如何在 R 中处理大数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 RStudio 中分析推文:

Analyzing tweets in RStudio:

我的 csv 文件包含 4,000,000 条推文,有五列:screen_name、text、created_at、favorite_count 和 retweet_count.

My csv file contains 4,000,000 tweets with five columns: screen_name, text, created_at, favorite_count, and retweet_count.

我正在尝试使用以下代码来确定主题标签的频率,但是它在几天内运行速度太慢,有时 RStudio 会崩溃.

I am trying to identify the frequency of hashtags using the following codes, however it runs too slowly for several days and sometimes RStudio crashes.

mydata %>%
  unnest_tokens(word, text, token ="tweets") %>%
  anti_join(stop_words, by= "word")

我使用了其他方法来处理 R 中的大数据,例如:https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/https://spark.rstudio.com/guides/textmining/ 和Spark 库:https://spark.rstudio.com/guides/textmining/.它们都不适合我.

I have used other approaches to handle big data in R such as: https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/ or https://spark.rstudio.com/guides/textmining/ and Spark library: https://spark.rstudio.com/guides/textmining/. None of them work well for me.

在 Spark 中,我执行以下操作,但 RStudio 无法将我的数据集复制到 Spark.我在我的 RStudio 中看到Spark 正在运行"甚至一天,而没有将我的数据集复制到 Spark.

In Spark, I do the following, but RStudio is not able to copy my dataset to Spark. I see that "Spark is Running" in my RStudio for even one day without copying my dataset to Spark.

连接到您的 Spark 集群:

Connect to your Spark cluster:

spark_conn <- spark_connect("local")

将 track_metadata 复制到 Spark:

Copy track_metadata to Spark:

track_metadata_tbl <- copy_to(spark_conn, my_database)

您有什么建议/说明/链接可以帮助我分析数据吗?

我的笔记本电脑是 Mac 处理器:2.9 GHz 双核 Intel Core i5 内存:8 GB 2133 MHz LPDDR3

My laptop is a Mac Processor: 2.9 GHz Dual-Core Intel Core i5 Memory: 8 GB 2133 MHz LPDDR3

推荐答案

如果我遇到你的情况,我不会尝试一次解析整个文件,而是一次处理一个块.

If I were in your situation, I would not try to parse that whole file at once but instead work with a chunk at a time.

我会使用 vroom 读取数据,并处理数据块一次(比如说,从 50k 行开始,然后看看你一次可以放大多少).

I would use vroom to read in the data, and work with chunks of the data at a time (starting with, say, 50k lines and then seeing how much you can scale up to do at once).

如果您只对计算主题标签感兴趣,您可以执行以下操作:

If you are interested in only counting hashtags, you can do something like:

mydata %>%
  unnest_tokens(word, text, token ="tweets") %>%
  filter(str_detect(word, "^#")) %>%
  count(word, sort = TRUE)

并将其附加到汇总结果的新 CSV 文件中.然后分块处理整个数据集.最后,您可以解析结果的 CSV 文件并重新汇总您的计数以求和并找到主题标签频率.

And append this to a new CSV of aggregated results. Then work through your whole dataset in chunks. At the end, you can parse your CSV of results and re-aggregate your counts to sum up and find the hashtag frequencies.

这篇关于有谁知道我如何在 R 中处理大数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆