在许多较小的数据帧中分割一个巨大的数据帧,以便在r中创建一个语料库 [英] Split a huge dataframe in many smaller dataframes to create a corpus in r
问题描述
我需要从一个巨大的数据框架(大约170.000行,但只有两列)创建一个语料库,以根据搜索字词挖掘一些文本和用户名。例如,我从这样的数据框开始:
I need to create a corpus from a huge dataframe (about 170.000 rows, but only two columns) to mine some text and group by usernames according to to the search terms. For example I start from a dataframe like this:
username search_term
name_1 "some_text_1"
name_1 "some_text_2"
name_2 "some_text_3"
name_2 "some_text_4"
name_3 "some_text_5"
name_3 "some_text_6"
name_3 "some_text_1"
[...]
name_n "some_text_n-1"
而我要获取:
data frame 1
username search_term
name_1 "some_text_1"
name_1 "some_text_2"
data frame 2
username search_term
name_2 "some_text_3"
name_2 "some_text_4"
等等..
任何想法?我想到一个for循环,但是它太慢了,因为我需要创建大约11000个数据帧...
Any idea? I thought to a for loop, but it is too slow, since I need to create about 11000 data frames...
要查看如何将列表转换为语料库请参阅:如何将列表转换为语料库r?
To see how to transform a list into a corpus see: How transform a list into a corpus in r?
推荐答案
我们可以 split
数据集'df1')转换成列表
We can split
the dataset ('df1') into a list
lst <- split(df1, df1$username)
通常,最好在这里停下来,做所有的计算/分析在列表中
本身。但是,如果我们要在全局环境中创建l000个对象,则在命名列表
之后使用 list2env
元素与我们想要的对象名称。
Usually, it is better to stop here and do all the calculations/analysis within the list
itself. But, if we want to create l000's of objects in the global environment, one way is using list2env
after naming the list
elements with the object names we desire.
list2env(setNames(lst, paste0('DataFrame',
seq_along(lst)), envir=.GlobalEnv)
DataFrame1
DataFrame2
< hr>
保存数据的另一种方法是 nest
它
library(dplyr)
library(tidyr)
df1 %>%
nest(-username)
这篇关于在许多较小的数据帧中分割一个巨大的数据帧,以便在r中创建一个语料库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!