从单列创建多列并清理结果 [英] Create multiple columns from a single column and clean up results
问题描述
我有一个这样的数据框:
I have a data frame like this:
foo=data.frame(Point.Type = c("Zero Start","Zero Start", "Zero Start", "3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","Zero Stop","Zero Start"),
Point.Value = c(NA,NA,NA,rnorm(3),NA,NA))
我想通过用分隔符_
分割第一列来添加三列,并且只保留分割后获得的数值.对于第一列不包含任何_
的那些行,三个新列应该是NA
.我使用 separate
有点接近,但这还不够:
I want to add three columns, by splitting the first column with separator _
, and retain only the numeric values obtained after the split. For those rows where the first column doesn't contain any _
, the three new columns should be NA
. I got somewhat close using separate
, but that's not enough:
> library(tidyr)
> bar = separate(foo,Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"), sep="_", remove = FALSE, extra="drop", fill="right")
> bar
Point.Type rpm_nom GVF_nom p0in_nom Point.Value
1 Zero Start Zero Start <NA> <NA> NA
2 Zero Start Zero Start <NA> <NA> NA
3 Zero Start Zero Start <NA> <NA> NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG -1.468033
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG 1.280868
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG 0.270126
7 Zero Stop Zero Stop <NA> <NA> NA
8 Zero Start Zero Start <NA> <NA> NA
我不确定为什么我的数据框现在包含两种明显不同的 NA
,但是 is.na
似乎都喜欢它们,所以我可以忍受那.但是,我有两种问题:
I'm not sure why my data frame contains now two apparently different kinds of NA
, but is.na
seems to like them both, so I can live with that. However, I have two kind of problems:
- 新列应该至少是
numeric
,并且可能是integer
.相反,它们是character
,因为尾随rpm
、%
、barG
.我该如何摆脱这些? - 当
Point.Type
不能拆分时,rpm_nom
应该是NA
,而是变成Zero Start
> 或零停止
.更改fill=
选项只会更改获得Zero Start
/Zero Stop
的新列之一.相反,我希望他们三个都是NA
.我该怎么做?
- the new columns should be at least
numeric
, and possiblyinteger
. Instead they'recharacter
, because of the trailingrpm
,%
,barG
. How do I get rid of those? - when
Point.Type
can't be split,rpm_nom
should beNA
, instead it becomesZero Start
orZero Stop
. Changing thefill=
option only changes which one of the new columns get theZero Start
/Zero Stop
. Instead I want all three of them to beNA
. How can I do that?
注意:我正在使用 tidyr
,但当然您不需要,如果您认为有更好的方法可以做到这一点.
NOTE: I'm using tidyr
, but of course you don't need to, if you think there's a better way to do this.
推荐答案
您可以使用 dplyr 对列进行后期处理:
You can post-process the columns with dplyr:
library(dplyr)
foo <- foo %>%
separate(Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"),
sep="_", remove = FALSE, extra="drop", fill="right") %>%
mutate_each(funs(as.numeric(gsub("[^0-9]","",.))), rpm_nom, GVF_nom, p0in_nom)
gsub("[^0-9]","",.)
-part 删除所有非数字字符.如果你想防止小数点被删除,你可以使用[^0-9.]
代替[^0-9]
(就像@PierreLafortune 在他的回答中使用的一样),但请注意,这也将包括不是小数点的点.通过将其包装在 as.numeric
中,您可以将它们转换为数值,同时将空单元格转换为 NA
.这给出了以下结果:
The gsub("[^0-9]","",.)
-part removes all non-numeric characters. If you want to prevent the removal of decimal points, you can use [^0-9.]
instead of [^0-9]
(like @PierreLafortune used in his answer), but be aware that this will also include points that are not meant to be decimal points. By wrapping it in as.numeric
, you convert them to numeric values while at the same time transforming the empty cells to NA
. This gives the following result:
> foo
Point.Type rpm_nom GVF_nom p0in_nom Point.Value
1 Zero Start NA NA NA NA
2 Zero Start NA NA NA NA
3 Zero Start NA NA NA NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 -1.2361145
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 -0.8727960
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.9685555
7 Zero Stop NA NA NA NA
8 Zero Start NA NA NA NA
<小时>
或者使用 data.table(由@DavidArenburg 在评论中提供):
Or using data.table (as contributed by @DavidArenburg in the comments):
library(data.table)
setDT(foo)[, c("rpm_nom","GVF_nom","p0in_nom") :=
lapply(tstrsplit(Point.Type, "_", fixed = TRUE)[1:3],
function(x) as.numeric(gsub("[^0-9]","",x)))
]
将给出类似的结果:
> foo
Point.Type Point.Value rpm_nom GVF_nom p0in_nom
1: Zero Start NA NA NA NA
2: Zero Start NA NA NA NA
3: Zero Start NA NA NA NA
4: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww -0.09255445 3000 10 13
5: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 1.18581340 3000 10 13
6: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 2.14475950 3000 10 13
7: Zero Stop NA NA NA NA
8: Zero Start NA NA NA NA
这样做的好处是foo
是通过引用更新的.由于它速度更快且内存效率更高,因此这对于使用大型数据集特别有价值.
The advantage of this is that foo
is updated by reference. As this is faster and more memory efficient, this is especially valuable for using with large datasets.
这篇关于从单列创建多列并清理结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!