有谁知道我如何在R中处理大数据? [英] Does anyone know how I can work with big data in R?
问题描述
在RStudio中分析推文:
Analyzing tweets in RStudio:
我的csv文件包含4,000,000条tweets,其中包含5列:screen_name,text,created_at,favourite_count和retweet_count.
My csv file contains 4,000,000 tweets with five columns: screen_name, text, created_at, favorite_count, and retweet_count.
我正在尝试使用以下代码来确定主题标签的出现频率,但是它运行太慢了几天,有时RStudio会崩溃.
I am trying to identify the frequency of hashtags using the following codes, however it runs too slowly for several days and sometimes RStudio crashes.
mydata %>%
unnest_tokens(word, text, token ="tweets") %>%
anti_join(stop_words, by= "word")
我使用其他方法来处理R中的大数据,例如: https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/或 https://spark.rstudio.com/guides/textmining/和Spark库: https://spark.rstudio.com/guides/textmining/.他们都不适合我.
I have used other approaches to handle big data in R such as: https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/ or https://spark.rstudio.com/guides/textmining/ and Spark library: https://spark.rstudio.com/guides/textmining/. None of them work well for me.
在Spark中,我执行以下操作,但是RStudio无法将我的数据集复制到Spark.我在RStudio中甚至看到火花正在运行"一天,而没有将我的数据集复制到Spark.
In Spark, I do the following, but RStudio is not able to copy my dataset to Spark. I see that "Spark is Running" in my RStudio for even one day without copying my dataset to Spark.
连接到您的Spark集群:
Connect to your Spark cluster:
spark_conn <- spark_connect("local")
将track_metadata复制到Spark:
Copy track_metadata to Spark:
track_metadata_tbl <- copy_to(spark_conn, my_database)
您有什么建议/说明/链接可以帮助我分析我的数据吗?
我的笔记本电脑是Mac处理器:2.9 GHz双核Intel Core i5内存:8 GB 2133 MHz LPDDR3
My laptop is a Mac Processor: 2.9 GHz Dual-Core Intel Core i5 Memory: 8 GB 2133 MHz LPDDR3
推荐答案
如果您遇到这种情况,我不会尝试一次解析整个文件,而是一次处理一个大块.
If I were in your situation, I would not try to parse that whole file at once but instead work with a chunk at a time.
我将使用 vroom 读取数据,并处理数据块一次(从5万行开始,然后看看您可以一次扩展多少).
I would use vroom to read in the data, and work with chunks of the data at a time (starting with, say, 50k lines and then seeing how much you can scale up to do at once).
如果您只想对主题标签计数,则可以执行以下操作:
If you are interested in only counting hashtags, you can do something like:
mydata %>%
unnest_tokens(word, text, token ="tweets") %>%
filter(str_detect(word, "^#")) %>%
count(word, sort = TRUE)
并将其附加到汇总结果的新CSV中.然后以块为单位遍历整个数据集.最后,您可以解析结果的CSV并重新汇总计数以总结并找到标签频率.
And append this to a new CSV of aggregated results. Then work through your whole dataset in chunks. At the end, you can parse your CSV of results and re-aggregate your counts to sum up and find the hashtag frequencies.
这篇关于有谁知道我如何在R中处理大数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!