如何检查可以找到多少列字符 [英] How to check in how many columns character can be found

查看:98
本文介绍了如何检查可以找到多少列字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含4个包含名称的列的数据集,其中名称的数量和名称的顺序在列之间有所不同。某些列还可以包含两次或更多次相同的名称。看起来如下:

I have a dataset with 4 columns containing names, where the number of names and the order of names differ between columns. Some columns can also contain the same name twice or more. It looks like follows:

df<- data.frame(x1=c("Ben","Alex","Tim", "Lisa", "MJ","NA", "NA","NA","NA"), 
x2=c("Ben","Paul","Tim", "Linda", "Alex", "MJ", "Lisa", "Ken","NA"), 
x3=c("Tomas","Alex","Ben", "Paul", "MJ", "Tim", "Ben", "Alex", "Linda"), 
x4=c("Ben","Alex","Tim", "Lisa", "MJ", "Ben", "Barbara","NA", "NA"))

现在我必须首先提取数据集中的唯一名称。我使用以下代码做到了这一点:

Now I have to first extract the unique names within the dataset. I did that using the following code:

u<- as.vector(unique(unlist(df)))

第二,我需要在所有4列(A类名称)中的3列(B类名称)中找到3个可以找到的名称)和4列中的2列(C类名称)。

Second, I need to find the names that can be found in all 4 columns (class A names), in 3 out of 4 columns (class B names) and in 2 out of 4 columns (class C names).

在这里我被卡住了。我只能使用以下方法提取所有4列中包含的名称:

Here is where I get stuck. I can only extract the names that are contained in all 4 columns using:

n<- ifelse(u%in%df$x1 & u%in%df$x2 & u%in%df$x3 & 
               u%in%df$x4", A, B)

因此,例如,Ben将成为A类名称,因为它可以在所有4列中找到,而Lisa将成为B类名称,因为它只能在4列中的3个列中找到。 / p>

So, e.g., Ben would be a A class name because it can be found in all 4 columns and Lisa would be a B class name because it can only be found in 3 out of 4 columns.

Name Class
Ben    A
Lisa   B

是否有更好的方法可以根据可以在它们中找到的列数对唯一名称进行分类,以及如何对B和C类名称进行处理?

Is there a nicer way to classify the unique names according to the number of columns they can be found in and how can it be done for B and C class names?

预先感谢!

推荐答案

您可以获取长格式的数据,并为每个名称查找多少个唯一列出现:

You can get the data in long format and for each name find how many unique column it occurs :

library(dplyr)

df %>%
  tidyr::pivot_longer(cols = everything(), values_drop_na = TRUE) %>%
  group_by(value) %>%
  summarise(count = n_distinct(name))

#   value   count
#   <chr>   <int>
# 1 Alex        4
# 2 Barbara     1
# 3 Ben         4
# 4 Ken         1
# 5 Linda       2
# 6 Lisa        3
# 7 MJ          4
# 8 NA          3
# 9 Paul        2
#10 Tim         4
#11 Tomas       1

在这里,您会在输出中得到 NA ,因为它是一个字符串。如果您的数据具有真实的 NA ,则由于 values_drop_na = TRUE 而将其删除。

Here you get "NA" in the output because it is a string. If your data has real NA's it will be dropped because of values_drop_na = TRUE.

这篇关于如何检查可以找到多少列字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆