排序复杂的字符串向量以获得有序因子 [英] Ordering a complex string vector in order to obtain a ordered factor
问题描述
我使用的字符串向量的结构对应于下面的:
messy_vec < - c 0-9,100-150,21-abc,50-56,70abc-80)
我想更改这个向量的类,以便根据第一个数字排序级别。代码:
messy_vec_fac< - as.factor(messy_vec)
pre>
会产生
messy_vec_fac
[1] 0 - 9 100 - 150 21 - abc 50 - 56 70abc - 80
级别:0 - 9 100 - 150 21 - abc 50 - 56 70abc - 80
,但我有兴趣获得特征向量:
[1] 0-9 100 - 150 21 - abc 50 - 56 70abc - 80
级别:0 - 9 21 - abc 50 - 56 70abc - 80 100 - 150
如上所述,级别顺序对应于订单: p>
0 21 50 70 100
侧点
这对于所寻求的解决方案不是关键的,但是如果所提出的解决方案不假定向量元素的第一部分中的最大数目的位数,则这将是好的。可能会发生以下值:
- 8787abc - 89898 deff > 8787 应用于确认订单
- 001 def-1111 OHMG - 在这种情况下为 1 应该用于断言命令
- 可以放心地假设所有的向量元素都有
-
strings:[[:space:]] - [[:space:]]
- 出现重复值
编辑
按照CathG的建议'
# ...%>%
mutate(very_needed_factor = factor(messy_vec,
levels = messy_vec [
order(
as.numeric(
sub \\\d +)[^ \\d] * - 。*,\\1,
messy_vec)))]))
#%>%...
但我不断收到以下错误:
警告消息:
1:按顺序(as.numeric(sub((\\d +)[^ \\d] * - 。* ,\\1,c(12-14,:
由强制引入的NAs
2:在级别< -|(`* tmp *`,value = if nl == nL)as.character(labels)else paste0(labels,:
重复的因子中的级别已弃用
如果我正确理解你想做什么,你可以捕获每个字符串中出现的第一个数字 sub
并将它们转换为数字,然后用于对因子
调用中的级别排序。
num_vec < - as.numeric(sub((\\d +)[^ \\d] * - 。*,\\1,messy_vec) )
messy_vec_fac< - factor(messy_vec,levels = messy_vec [order(num_vec)])
messy_vec_fac
#[1] 0 - 9 100 - 150 21 - abc 50 - 56 70abc - 80
#Levels:0 - 9 21 - abc 50 - 56 70abc - 80 100 - 150
b $ b
NB :如果值重复,您可以在 levels = unique(messy_vec [order(num_vec)])
因子
调用
I'm working with a string vector with a structure corresponding to the one below:
messy_vec <- c("0 - 9","100 - 150","21 - abc","50 - 56","70abc - 80")
I'm looking to change a class of this vector to factor which levels would be ordered according to the first digit(s). The code:
messy_vec_fac <- as.factor(messy_vec)
would produce
> messy_vec_fac
[1] 0 - 9 100 - 150 21 - abc 50 - 56 70abc - 80
Levels: 0 - 9 100 - 150 21 - abc 50 - 56 70abc - 80
whereas I'm interested in obtaining vector of characteristics:
[1] 0-9 100 - 150 21 - abc 50 - 56 70abc - 80
Levels: 0 - 9 21 - abc 50 - 56 70abc - 80 100 - 150
As indicated, the order of levels corresponds to the order:
0 21 50 70 100
which are the first digits derived from the elements of the messy vector.
Side points
This is not crucial to the sought solution but it would be good if the proposed solution would not assume the maximum number of digits in the first part of the vector elements. It may happen that the following values occur:
- 8787abc - 89898 deff - in this case the value 8787 should be used to assert the order
- 001 def - 1111 OHMG - in this case the value 1 should be used to assert the order
- It can be safely assumed that all vector elements have
-
strings:[[:space:]]-[[:space:]]
- Duplicate values occur
Edits
Following very useful suggestion by CathG I'm trying to cram this solution into a bigger dplyr
syntax
# ... %>%
mutate(very_needed_factor= factor(messy_vec,
levels = messy_vec[
order(
as.numeric(
sub("(\\d+)[^\\d]* - .*", "\\1",
messy_vec)))]))
# %>% ...
But I keep on getting the following error:
Warning messages:
1: In order(as.numeric(sub("(\\d+)[^\\d]* - .*", "\\1", c("12-14", :
NAs introduced by coercion
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
If I correctly understood what you want to do, you can capture the first digits appearing in each of the string with sub
and convert them to numeric to be then used to order the levels in the factor
call.
num_vec <- as.numeric(sub("(\\d+)[^\\d]* - .*", "\\1", messy_vec))
messy_vec_fac <- factor(messy_vec, levels=messy_vec[order(num_vec)])
messy_vec_fac
#[1] 0 - 9 100 - 150 21 - abc 50 - 56 70abc - 80
#Levels: 0 - 9 21 - abc 50 - 56 70abc - 80 100 - 150
NB: in case of duplicated values, you can do levels=unique(messy_vec[order(num_vec)])
in the factor
call
这篇关于排序复杂的字符串向量以获得有序因子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!