有谁知道我如何在 R 中处理大数据? [英] Does anyone know how I can work with big data in R?
问题描述
在 RStudio 中分析推文:
Analyzing tweets in RStudio:
我的 csv 文件包含 4,000,000 条推文,有五列:screen_name、text、created_at、favorite_count 和 retweet_count.
My csv file contains 4,000,000 tweets with five columns: screen_name, text, created_at, favorite_count, and retweet_count.
我正在尝试使用以下代码来确定主题标签的频率,但是它在几天内运行速度太慢,有时 RStudio 会崩溃.
I am trying to identify the frequency of hashtags using the following codes, however it runs too slowly for several days and sometimes RStudio crashes.
mydata %>%
unnest_tokens(word, text, token ="tweets") %>%
anti_join(stop_words, by= "word")
我使用了其他方法来处理 R 中的大数据,例如:https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/ 或 https://spark.rstudio.com/guides/textmining/ 和Spark 库:https://spark.rstudio.com/guides/textmining/.它们都不适合我.
I have used other approaches to handle big data in R such as: https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/ or https://spark.rstudio.com/guides/textmining/ and Spark library: https://spark.rstudio.com/guides/textmining/. None of them work well for me.
在 Spark 中,我执行以下操作,但 RStudio 无法将我的数据集复制到 Spark.我在我的 RStudio 中看到Spark 正在运行"甚至一天,而没有将我的数据集复制到 Spark.
In Spark, I do the following, but RStudio is not able to copy my dataset to Spark. I see that "Spark is Running" in my RStudio for even one day without copying my dataset to Spark.
连接到您的 Spark 集群:
Connect to your Spark cluster:
spark_conn <- spark_connect("local")
将 track_metadata 复制到 Spark:
Copy track_metadata to Spark:
track_metadata_tbl <- copy_to(spark_conn, my_database)
您有什么建议/说明/链接可以帮助我分析数据吗?
我的笔记本电脑是 Mac 处理器:2.9 GHz 双核 Intel Core i5 内存:8 GB 2133 MHz LPDDR3
My laptop is a Mac Processor: 2.9 GHz Dual-Core Intel Core i5 Memory: 8 GB 2133 MHz LPDDR3
推荐答案
如果我遇到你的情况,我不会尝试一次解析整个文件,而是一次处理一个块.
If I were in your situation, I would not try to parse that whole file at once but instead work with a chunk at a time.
我会使用 vroom 读取数据,并处理数据块一次(比如说,从 50k 行开始,然后看看你一次可以放大多少).
I would use vroom to read in the data, and work with chunks of the data at a time (starting with, say, 50k lines and then seeing how much you can scale up to do at once).
如果您只对计算主题标签感兴趣,您可以执行以下操作:
If you are interested in only counting hashtags, you can do something like:
mydata %>%
unnest_tokens(word, text, token ="tweets") %>%
filter(str_detect(word, "^#")) %>%
count(word, sort = TRUE)
并将其附加到汇总结果的新 CSV 文件中.然后分块处理整个数据集.最后,您可以解析结果的 CSV 文件并重新汇总您的计数以求和并找到主题标签频率.
And append this to a new CSV of aggregated results. Then work through your whole dataset in chunks. At the end, you can parse your CSV of results and re-aggregate your counts to sum up and find the hashtag frequencies.
这篇关于有谁知道我如何在 R 中处理大数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!