尝试建立索引时,Dplyr变异重复列表值 [英] Dplyr mutate duplicates list values when trying to index

查看:98
本文介绍了尝试建立索引时,Dplyr变异重复列表值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,我从这样的数据集开始(来自盖洛普).我想将年份和日期从数据集中拉出,并放入一个新列中.所以我尝试分割日期字符串...

Let's say I start with a dataset like this (it's from Gallup). I want to pull the year and date out of the dataset and into a new column. So I try to split the date string...

index   date         R  D
1   2018 Jan 2-7    35  50  
2   2017 Dec 4-11   41  45  
3   2017 Nov 2-8    39  46  
4   2017 Oct 5-11   39  46  
5   2017 Sep 6-10   45  47  
6   2017 Aug 2-6    43  46

..使用mutate

.. using mutate

dataset <- data %>% 
      mutate(Y = strsplit(date, split = " ")[[1]][1]) %>%
      mutate(M = strsplit(date, split = " ")[[1]][2])

但是strsplit而不是对日期行进行操作,似乎对所有列值的列表进行操作.

But strsplit, rather than operate on the date row, seems to operate on a list of all column values.

因此,我最终得到的[[1]]子集访问器仅获取第一行值,而不是与每一行相关的列表条目.

So I end up with the [[1]] subset accessor just grabbing the first row value, rather than a the list entry relevant to each row.

index   date         R  D    Y        M
1   2018 Jan 2-7    35  3   2018    Jan
2   2017 Dec 4-11   41  3   2018    Jan
3   2017 Nov 2-8    39  3   2018    Jan
4   2017 Oct 5-11   39  3   2018    Jan
5   2017 Sep 6-10   45  3   2018    Jan
6   2017 Aug 2-6    43  3   2018    Jan

如何分割字符串,以便从列表中为每一行推断值?将索引用作子集访问器不起作用.

How can I split the string so an extrapolate the value from the list for each row? Using index as a subset accessor doesn't work.

推荐答案

我建议使用软件包 stringr 是tidyverse的一部分,因此可以与dplyr无缝地工作.

I would recommend using the package stringr, which is part of the tidyverse, and thus works seamlessly with dplyr.

data %>% mutate(Y = str_extract(date, "^\\d{4}"),
                M = str_extract(date, "[A-Za-z]{3}"))

#   index          date  R  D    Y   M
# 1     1  2018 Jan 2-7 35 50 2018 Jan
# 2     2 2017 Dec 4-11 41 45 2017 Dec
# 3     3  2017 Nov 2-8 39 46 2017 Nov
# 4     4 2017 Oct 5-11 39 46 2017 Oct
# 5     5 2017 Sep 6-10 45 47 2017 Sep
# 6     6  2017 Aug 2-6 43 46 2017 Aug

str_extract允许您基于模式提取子字符串-在这里,我们使用两个不同的正则表达式.第一个匹配字符串(^)开头的4个连续数字(\\d{4}).第二个表达式仅包含3个连续字母([A-Za-z]),考虑到日期的结构,这是安全的.

str_extract allows you to extract substrings based on a pattern -- here, we use two different regular expressions. The first matches 4 consecutive digits (\\d{4}) at the start of the string (^). The second expression simply takes 3 consecutive letters ([A-Za-z]), which is safe given the structure of your dates.

但是,如果您仍然希望将strsplitmutate结合使用,则可以向rowwise添加呼叫:

If you'd still like to use strsplit with mutate, however, you can add a call to rowwise:

data %>% rowwise() %>% mutate(Y = strsplit(date, split = " ")[[1]][1],
                              M = strsplit(date, split = " ")[[1]][2])

这篇关于尝试建立索引时,Dplyr变异重复列表值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆