排序复杂的字符串向量以获得有序因子 [英] Ordering a complex string vector in order to obtain a ordered factor

查看:204
本文介绍了排序复杂的字符串向量以获得有序因子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的字符串向量的结构对应于下面的:

  messy_vec < -  c 0-9,100-150,21-abc,50-56,70abc-80)

我想更改这个向量的类,以便根据第一个数字排序级别。代码:

  messy_vec_fac<  -  as.factor(messy_vec)
pre>

会产生

  messy_vec_fac 
[1] 0 - 9 100 - 150 21 - abc 50 - 56 70abc - 80
级别:0 - 9 100 - 150 21 - abc 50 - 56 70abc - 80

,但我有兴趣获得特征向量:


[1] 0-9 100 - 150 21 - abc 50 - 56 70abc - 80



级别:0 - 9 21 - abc 50 - 56 70abc - 80 100 - 150


如上所述,级别顺序对应于订单: p>


0 21 50 70 100




侧点



这对于所寻求的解决方案不是关键的,但是如果所提出的解决方案不假定向量元素的第一部分中的最大数目的位数,则这将是好的。可能会发生以下值:




  • 8787abc - 89898 deff > 8787 应用于确认订单

  • 001 def-1111 OHMG - 在这种情况下为 1 应该用于断言命令

  • 可以放心地假设所有的向量元素都有 - strings: [[:space:]] - [[:space:]]

  • 出现重复值






编辑



按照CathG的建议'

 # ...%>%
mutate(very_needed_factor = factor(messy_vec,
levels = messy_vec [
order(
as.numeric(
sub \\\d +)[^ \\d] * - 。*,\\1,
messy_vec)))]))
#%>%...

但我不断收到以下错误:

 警告消息:
1:按顺序(as.numeric(sub((\\d +)[^ \\d] * - 。* ,\\1,c(12-14,:
由强制引入的NAs
2:在级别< -|(`* tmp *`,value = if nl == nL)as.character(labels)else paste0(labels,:
重复的因子中的级别已弃用


解决方案

如果我正确理解你想做什么,你可以捕获每个字符串中出现的第一个数字 sub 并将它们转换为数字,然后用于对因子调用中的级别排序。

  num_vec < -  as.numeric(sub((\\d +)[^ \\d] *  - 。*,\\1,messy_vec) )
messy_vec_fac< - factor(messy_vec,levels = messy_vec [order(num_vec)])

messy_vec_fac
#[1] 0 - 9 100 - 150 21 - abc 50 - 56 70abc - 80
#Levels:0 - 9 21 - abc 50 - 56 70abc - 80 100 - 150


b $ b

NB :如果值重复,您可以在 levels = unique(messy_vec [order(num_vec)]) 因子调用


I'm working with a string vector with a structure corresponding to the one below:

messy_vec <- c("0 - 9","100 - 150","21 - abc","50 - 56","70abc - 80")

I'm looking to change a class of this vector to factor which levels would be ordered according to the first digit(s). The code:

messy_vec_fac <- as.factor(messy_vec)

would produce

> messy_vec_fac
[1] 0 - 9      100 - 150  21 - abc   50 - 56    70abc - 80
Levels: 0 - 9 100 - 150 21 - abc 50 - 56 70abc - 80

whereas I'm interested in obtaining vector of characteristics:

[1] 0-9 100 - 150 21 - abc 50 - 56 70abc - 80

Levels: 0 - 9 21 - abc 50 - 56 70abc - 80 100 - 150

As indicated, the order of levels corresponds to the order:

0 21 50 70 100

which are the first digits derived from the elements of the messy vector.

Side points

This is not crucial to the sought solution but it would be good if the proposed solution would not assume the maximum number of digits in the first part of the vector elements. It may happen that the following values occur:

  • 8787abc - 89898 deff - in this case the value 8787 should be used to assert the order
  • 001 def - 1111 OHMG - in this case the value 1 should be used to assert the order
  • It can be safely assumed that all vector elements have - strings: [[:space:]]-[[:space:]]
  • Duplicate values occur

Edits

Following very useful suggestion by CathG I'm trying to cram this solution into a bigger dplyr syntax

# ... %>%
  mutate(very_needed_factor= factor(messy_vec,
                                      levels = messy_vec[
                                        order(
                                          as.numeric(
                                            sub("(\\d+)[^\\d]* - .*", "\\1",
                                                messy_vec)))]))
# %>% ...

But I keep on getting the following error:

Warning messages:
1: In order(as.numeric(sub("(\\d+)[^\\d]* - .*", "\\1", c("12-14",  :
  NAs introduced by coercion
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated

解决方案

If I correctly understood what you want to do, you can capture the first digits appearing in each of the string with sub and convert them to numeric to be then used to order the levels in the factor call.

num_vec <- as.numeric(sub("(\\d+)[^\\d]* - .*", "\\1", messy_vec))
messy_vec_fac <- factor(messy_vec, levels=messy_vec[order(num_vec)])

messy_vec_fac
#[1] 0 - 9      100 - 150  21 - abc   50 - 56    70abc - 80
#Levels: 0 - 9 21 - abc 50 - 56 70abc - 80 100 - 150

NB: in case of duplicated values, you can do levels=unique(messy_vec[order(num_vec)]) in the factor call

这篇关于排序复杂的字符串向量以获得有序因子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆