从单列创建多列并清理结果 [英] Create multiple columns from a single column and clean up results

查看:67
本文介绍了从单列创建多列并清理结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的数据框:

I have a data frame like this:

foo=data.frame(Point.Type = c("Zero Start","Zero Start", "Zero Start", "3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","Zero Stop","Zero Start"),
               Point.Value = c(NA,NA,NA,rnorm(3),NA,NA))

我想通过用分隔符_分割第一列来添加三列,并且只保留分割后获得的数值.对于第一列不包含任何_ 的那些行,三个新列应该是NA.我使用 separate 有点接近,但这还不够:

I want to add three columns, by splitting the first column with separator _, and retain only the numeric values obtained after the split. For those rows where the first column doesn't contain any _, the three new columns should be NA. I got somewhat close using separate, but that's not enough:

> library(tidyr) 
> bar = separate(foo,Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"), sep="_", remove = FALSE, extra="drop", fill="right")
> bar
                            Point.Type    rpm_nom GVF_nom p0in_nom Point.Value
1                           Zero Start Zero Start    <NA>     <NA>          NA
2                           Zero Start Zero Start    <NA>     <NA>          NA
3                           Zero Start Zero Start    <NA>     <NA>          NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww    3000rpm     10%   13barG   -1.468033
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww    3000rpm     10%   13barG    1.280868
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww    3000rpm     10%   13barG    0.270126
7                            Zero Stop  Zero Stop    <NA>     <NA>          NA
8                           Zero Start Zero Start    <NA>     <NA>          NA

我不确定为什么我的数据框现在包含两种明显不同的 NA,但是 is.na 似乎都喜欢它们,所以我可以忍受那.但是,我有两种问题:

I'm not sure why my data frame contains now two apparently different kinds of NA, but is.na seems to like them both, so I can live with that. However, I have two kind of problems:

  1. 新列应该至少是 numeric,并且可能是 integer.相反,它们是 character,因为尾随 rpm%barG.我该如何摆脱这些?
  2. Point.Type不能拆分时,rpm_nom应该是NA,而是变成Zero Start> 或 零停止.更改 fill= 选项只会更改获得 Zero Start/Zero Stop 的新列之一.相反,我希望他们三个都是 NA.我该怎么做?
  1. the new columns should be at least numeric, and possibly integer. Instead they're character, because of the trailing rpm, %, barG. How do I get rid of those?
  2. when Point.Type can't be split, rpm_nom should be NA, instead it becomes Zero Start or Zero Stop. Changing the fill= option only changes which one of the new columns get the Zero Start/Zero Stop. Instead I want all three of them to be NA. How can I do that?

注意:我正在使用 tidyr,但当然您不需要,如果您认为有更好的方法可以做到这一点.

NOTE: I'm using tidyr, but of course you don't need to, if you think there's a better way to do this.

推荐答案

您可以使用 dplyr 对列进行后期处理:

You can post-process the columns with dplyr:

library(dplyr)
foo <- foo %>%
  separate(Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"), 
           sep="_", remove = FALSE, extra="drop", fill="right") %>%
  mutate_each(funs(as.numeric(gsub("[^0-9]","",.))), rpm_nom, GVF_nom, p0in_nom)

gsub("[^0-9]","",.)-part 删除所有非数字字符.如果你想防止小数点被删除,你可以使用[^0-9.]代替[^0-9](就像@PierreLafortune 在他的回答中使用的一样),但请注意,这也将包括不是小数点的点.通过将其包装在 as.numeric 中,您可以将它们转换为数值,同时将空单元格转换为 NA.这给出了以下结果:

The gsub("[^0-9]","",.)-part removes all non-numeric characters. If you want to prevent the removal of decimal points, you can use [^0-9.] instead of [^0-9] (like @PierreLafortune used in his answer), but be aware that this will also include points that are not meant to be decimal points. By wrapping it in as.numeric, you convert them to numeric values while at the same time transforming the empty cells to NA. This gives the following result:

> foo
                            Point.Type rpm_nom GVF_nom p0in_nom Point.Value
1                           Zero Start      NA      NA       NA          NA
2                           Zero Start      NA      NA       NA          NA
3                           Zero Start      NA      NA       NA          NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww    3000      10       13  -1.2361145
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww    3000      10       13  -0.8727960
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww    3000      10       13   0.9685555
7                            Zero Stop      NA      NA       NA          NA
8                           Zero Start      NA      NA       NA          NA

<小时>

或者使用 data.table(由@DavidArenburg 在评论中提供):


Or using data.table (as contributed by @DavidArenburg in the comments):

library(data.table)
setDT(foo)[, c("rpm_nom","GVF_nom","p0in_nom") := 
             lapply(tstrsplit(Point.Type, "_", fixed = TRUE)[1:3],
                    function(x) as.numeric(gsub("[^0-9]","",x)))
           ]

将给出类似的结果:

> foo
                             Point.Type Point.Value rpm_nom GVF_nom p0in_nom
1:                           Zero Start          NA      NA      NA       NA
2:                           Zero Start          NA      NA      NA       NA
3:                           Zero Start          NA      NA      NA       NA
4: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww -0.09255445    3000      10       13
5: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww  1.18581340    3000      10       13
6: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww  2.14475950    3000      10       13
7:                            Zero Stop          NA      NA      NA       NA
8:                           Zero Start          NA      NA      NA       NA

这样做的好处是foo是通过引用更新的.由于它速度更快且内存效率更高,因此这对于使用大型数据集特别有价值.

The advantage of this is that foo is updated by reference. As this is faster and more memory efficient, this is especially valuable for using with large datasets.

这篇关于从单列创建多列并清理结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆