使用选择条件从一个中提取多个数据框 [英] Extract multiple data.frames from one with selection criteria

查看:72
本文介绍了使用选择条件从一个中提取多个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的数据集:

df <- data.frame(x1 = runif(1000), x2 = runif(1000), x3 = runif(1000), 
             split = sample( c('SPLITMEHERE', 'OBS'), 1000, replace=TRUE, prob=c(0.04, 0.96) ))

因此,我有一些变量(在我的情况下为15),以及要根据其进行拆分的条件data.frame转换为多个data.frames。

So, I have some variables (in my case, 15), and criteria by which I want to split the data.frame into multiple data.frames.

我的标准如下:每当出现 SPLITMEHERE时,我都希望获取所有值或所有 OBS在其下方,仅从这些观察中获取一个data.frame。因此,如果起始data.frame中有20个 SPLITMEHERE,我最后要以10个data.frame结尾。

My criteria is the following: each other time the 'SPLITMEHERE' appears I want to take all the values, or all 'OBS' below it and get a data.frame from just these observations. So, if there's 20 'SPLITMEHERE's in starting data.frame, I want to end up with 10 data.frames in the end.

我知道这听起来很混乱,好像没有什么意义,但这是从一个非常脏的.txt文件中提取原始数字以获取有意义的结果的结果数据。基本上,每个 SPLITMEHERE都表示此.txt文件中的新表,但是每个县都分为两个表,因此我想为每个县一个表(data.frame)。

I know it sounds confusing and like it doesn't have much sense, but this is the result from extracting the raw numbers from an awfully dirty .txt file to obtain meaningful data. Basically, every 'SPLITMEHERE' denotes the new table in this .txt file, but each county is divided into two tables, so I want one table (data.frame) for each county.

希望我能更清楚地说明一下,这是我真正需要的示例。假设前20个观察值是:

In the hope I will make it more clear, here is the example of exactly what I need. Let's say the first 20 observations are:

             x1          x2           x3       split
1    0.307379064 0.400526799 0.2898194543         SPLITMEHERE
2    0.465236674 0.915204924 0.5168274657         OBS
3    0.063814420 0.110380201 0.9564822116         OBS
4    0.401881416 0.581895095 0.9443995396         OBS
5    0.495227871 0.054014926 0.9059893533         SPLITMEHERE
6    0.091463620 0.945452614 0.9677482590         OBS
7    0.876123151 0.702328031 0.9739113525         OBS
8    0.413120761 0.441159673 0.4725571219         OBS
9    0.117764512 0.390644966 0.3511555807         OBS
10   0.576699384 0.416279417 0.8961428872         OBS
11   0.854786077 0.164332814 0.1609375612         OBS
12   0.336853841 0.794020157 0.0647337821         SPLITMEHERE
13   0.122690541 0.700047133 0.9701538396         OBS
14   0.733926139 0.785366852 0.8938749305         OBS
15   0.520766503 0.616765349 0.5136788010         OBS
16   0.628549288 0.027319848 0.4509875809         OBS
17   0.944188977 0.913900539 0.3767973795         OBS
18   0.723421337 0.446724318 0.0925365961         OBS
19   0.758001243 0.530991725 0.3916394396         SPLITMEHERE
20   0.888036748 0.862066601 0.6501050976         OBS

我想要得到的是:

data.frame1:

1    0.465236674 0.915204924 0.5168274657         OBS
2    0.063814420 0.110380201 0.9564822116         OBS
3    0.401881416 0.581895095 0.9443995396         OBS
4    0.091463620 0.945452614 0.9677482590         OBS
5    0.876123151 0.702328031 0.9739113525         OBS
6    0.413120761 0.441159673 0.4725571219         OBS
7    0.117764512 0.390644966 0.3511555807         OBS
8    0.576699384 0.416279417 0.8961428872         OBS
9    0.854786077 0.164332814 0.1609375612         OBS

data.frame2:
    1   0.122690541 0.700047133 0.9701538396         OBS
    2   0.733926139 0.785366852 0.8938749305         OBS
    3   0.520766503 0.616765349 0.5136788010         OBS
    4   0.628549288 0.027319848 0.4509875809         OBS
    5   0.944188977 0.913900539 0.3767973795         OBS
    6   0.723421337 0.446724318 0.0925365961         OBS
    7   0.888036748 0.862066601 0.6501050976         OBS

因此,split列仅显示我要在何处进行拆分,而写入 SPLITMEHERE的列中的数据是没有意义的。但是,这并不麻烦,因为稍后我可以删除此行,重点是根据此条件分离多个data.frames。

Therefore, split column only shows me where to split, data in columns where 'SPLITMEHERE' is written is meaningless. But, this is no bother, as I can delete this rows later, the point is in separating multiple data.frames based on this criteria.

显然,只有 split()函数和 dplyr 中的 filter()在这里就足够了。真正的问题是,应该分隔data.frames(即每隔一个 SPLITMEHERE)的行不是以常规方式出现,而是像上面的示例一样。一旦有3行的间隔,其他时候可能是10或15行。

Obviously, just the split() function and filter() from dplyr wouldn't suffice here. The real problem is that the lines which are supposed to separate the data.frames (i.e. every other 'SPLITMEHERE') do not appear in regular fashion, but just like in my above example. Once there is a gap of 3 lines, and other times it could be 10 or 15 lines.

有什么方法可以在R中有效地提取它吗?

Is there any way to extract this efficiently in R?

推荐答案

问题中最难的部分是创建组。一旦我们有了适当的分组,使用分割即可轻松获得您的结果。

The hardest part of the problem is creating the groups. Once we have the proper groupings, it's easy enough to use a split to get your result.

,您可以为各个组使用 cumsum 。在这里,我将 cumsum 除以2,并使用天花板,这样任何2个 SPLITMEHERE的组会合为一体。我还使用 ifelse 排除带有 SPLITMEHERE 的行:

With that said, you can use a cumsum for the groups. Here I divide the cumsum by 2 and use a ceiling so that any groups of 2 SPLITMEHERE's will be collapsed into one. I also use an ifelse to exclude the rows with SPLITMEHERE:

df$group <- ifelse(df$split != "SPLITMEHERE", ceiling(cumsum(df$split=="SPLITMEHERE")/2), 0)
res <- split(df, df$group)

结果是带有每个的数据框。 0 的组是您要丢弃的组。

The result is a list with a dataframe for each group. The groups with 0 are ones you want throw out.

这篇关于使用选择条件从一个中提取多个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆