在R中选择与不同列相同结果的行 [英] Selecting rows with same result in different columns in R

查看:563
本文介绍了在R中选择与不同列相同结果的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在我的数据框(catch)中选择我的tspp.name变量与我的elasmo.name变量相同的行。



例如,在这种情况下,#74807和#74809行将被选中,而不是第#74823行,因为elasmo.name是skate,而tspp.name是Northern shrimp。



我确定有一个简单的答案,但我还没有找到。任何提示都不会被赞赏。

 > catch [4:6,] 
gear tripID obsid sortie setID日期时间NAFO lat long dur depth bodymesh
74807 GRL2 G00001 A 1 13 2000-01-04 13:40:00 2H 562550 594350 2.000000 377 80
74809 GRL2 G00001 A 1 14 2000-01-04 23:30:00 2H 562550 594350 2.166667 370 80
74823 GRL2 G00001 A 1 16 2000-01-05 07:45:00 2H 561450 593050 3.000000 408 80
codendmesh mail.fil long.fil nbr.fil hook.shape hook.size钩子VTS tspp tspp.name elasmo
74807 45 NA NA NA NA NA 3.3 2211北方虾2211
74809 45 NA NA NA NA NA 3.2 2211北虾2211
74823 45 NA NA NA NA NA 3.3 2211北虾211
elasmo.name保持丢弃Tcatch date.1纬度经度EID
74807北方虾2747 50 2797 2000-0 1-04 56.91667 -60.21667 G00001-13
74809北方虾4919 100 5019 2000-01-04 56.91667 -60.21667 G00001-14
74823溜冰鞋0 50 50 2000-01-05 56.73333 -60.00000 G00001-16
fgear
74807带网格的虾拖网(船尾)
74809带网格的虾拖网(船尾)
74823带网格的虾拖网(船尾)


解决方案

我知道问题是什么 - 您需要阅读数据 ,通过将参数 as.is = TRUE 添加到 read.csv 命令(您可能用于加载一切都在)。没有这个,字符串被存储为因子,并且上面提到的所有方法都将失败(正如你已经发现的那样!)



一旦你正确读入数据,你可以使用

  catch [which(catch $ tspp.name == catch $ elasmo.name),] 

 子集(catch,tspp.name == elasmo.name)

获取匹配的行 - do不要省略第一个,否则在与NAs进行比较时,代码将失败。



下面是一个30秒的例子,使用一个小的制作数据集,显示了所有这些要点。



首先,在磁盘上创建一个看起来像这样的文本文件(我将其保存为 F:/test.dat,但它可以保存在任何地方... ...

  col1〜col2 
a〜b
a〜a
b〜b
c〜NA
NA〜d
NA〜NA

我们加载它,而不是将因子转换为str只是为了看到上面提出的方法可以解决:

 > dat = read.csv(F:/test.dat,sep =〜)#不要忘记检查文件名

> dat [which(dat $ col1 == dat $ col2),]
Ops.factor中的错误(dat $ col1,dat $ col2):级别的因子集是不同的

> dat [dat $ col1 == dat $ col2,]
Ops.factor中的错误(dat $ col1,dat $ col2):级别的因子集不同

>子集(dat,col1 == col2)
Ops.factor(col1,col2)中的错误:级别集合因子不同

这正是您遇到的问题。如果您键入 dat $ col1 dat $ col2 ,您将看到第一个有因子级别 abc ,而第二个因子级别 abd - 因此出现错误消息。



现在我们来做同样的事情,但是这次按照原样阅读数据:

 > dat = read.csv(F:/test.dat,sep =〜,as.is = TRUE)#注意as.is = TRUE 

> dat [which(dat $ col1 == dat $ col2),]
col1 col2
2 a a
3 b b

> dat [dat $ col1 == dat $ col2,]
col1 col2
2 a a
3 b b
NA< NA> < NA>
NA.1< NA> < NA>
NA.2< NA> < NA>

>子集(dat,col1 == col2)
col1 col2
2 aa
3 bb

如您所见,第一种方法(基于其中)和第三种方法(基于子集)都给出正确的答案,而第二种方法通过与NA的比较而混淆。我个人主张倡导子集方法,在我看来,这是最整洁的。



最后一个注释:还有其他您可以在数据框架中获取字符串作为因素的方法,并避免所有这些头痛,始终记住在创建时最终将参数 stringsAsFactors = FALSE 使用 data.frame 的数据框。例如,在R中直接创建对象 dat 的正确方法将是:

 <$ c = c(b,a,b,c), ,NA,d,NA),
stringsAsFactors = FALSE)

键入 dat $ col1 dat $ col2 ,您将看到它们已被正确解释。如果再次尝试,但是省略(或设置为TRUE)的 stringsAsFactors 参数,您会看到这些被隐藏的因素出现(就像从磁盘加载的第一个方法) )。简单来说,永远记住 as.is = TRUE stringsAsFactors = FALSE ,并学习如何使用子集命令,你不会太错了!



希望这有助于:)


I would like to select in my dataframe (catch) only the rows for which my "tspp.name" variable is the same as my "elasmo.name" variable.

For example, row #74807 and #74809 in this case would be selected, but not row #74823 because the elasmo.name is "skate" and the tspp.name is "Northern shrimp".

I am sure there is an easy answer for this, but I have not found it yet. Any hints would be appreciated.

> catch[4:6,]
      gear tripID obsID sortie setID       date     time NAFO    lat   long      dur depth bodymesh
74807 GRL2 G00001     A      1    13 2000-01-04 13:40:00   2H 562550 594350 2.000000   377       80
74809 GRL2 G00001     A      1    14 2000-01-04 23:30:00   2H 562550 594350 2.166667   370       80
74823 GRL2 G00001     A      1    16 2000-01-05 07:45:00   2H 561450 593050 3.000000   408       80
      codendmesh mail.fil long.fil nbr.fil hook.shape hook.size hooks VTS tspp       tspp.name elasmo
74807         45       NA       NA      NA                   NA    NA 3.3 2211 Northern shrimp   2211
74809         45       NA       NA      NA                   NA    NA 3.2 2211 Northern shrimp   2211
74823         45       NA       NA      NA                   NA    NA 3.3 2211 Northern shrimp    211
          elasmo.name kept discard Tcatch     date.1 latitude longitude       EID
74807 Northern shrimp 2747      50   2797 2000-01-04 56.91667 -60.21667 G00001-13
74809 Northern shrimp 4919     100   5019 2000-01-04 56.91667 -60.21667 G00001-14
74823          Skates    0      50     50 2000-01-05 56.73333 -60.00000 G00001-16
                                 fgear
74807 Shrimp trawl (stern) with a grid
74809 Shrimp trawl (stern) with a grid
74823 Shrimp trawl (stern) with a grid

解决方案

I know what the problem is - you need to read in the data "as is", by adding the argument as.is=TRUE to the read.csv command (which you presumably used to load everything in). Without this, the strings get stored as factors, and all methods suggested above will fail (as you've discovered!)

Once you've read in the data correctly, you can use either

catch[which(catch$tspp.name == catch$elasmo.name),]

or

subset(catch, tspp.name == elasmo.name)

to obtain the matching rows - do not omit the which in the first one otherwise the code will fail when doing comparisons with NAs.

Below is a 30-second example using a small fabricated data set that illustrates all these points explicitly.

First, create a text file on disk that looks like this (I saved it as "F:/test.dat" but it can be saved anywhere)...

col1~col2
a~b
a~a
b~b
c~NA
NA~d
NA~NA

Let's load it in without converting factors to strings, just to see the methods proposed above fall over:

> dat=read.csv("F:/test.dat",sep="~")  # don't forget to check the filename

> dat[which(dat$col1==dat$col2),]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different

> dat[dat$col1==dat$col2,]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different

> subset(dat,col1==col2)
Error in Ops.factor(col1, col2) : level sets of factors are different

This is exactly the problem you were having. If you type dat$col1 and dat$col2 you'll see that the first has factor levels a b c while the second has factor levels a b d - hence the error messages.

Now let's do the same, but this time reading in the data "as is":

> dat=read.csv("F:/test.dat",sep="~",as.is=TRUE)  # note the as.is=TRUE

> dat[which(dat$col1==dat$col2),]
  col1 col2
2    a    a
3    b    b

> dat[dat$col1==dat$col2,]
     col1 col2
2       a    a
3       b    b
NA   <NA> <NA>
NA.1 <NA> <NA>
NA.2 <NA> <NA>

> subset(dat,col1==col2)
     col1 col2
2    a    a
3    b    b

As you can see, the first method (based on which) and the third method (based on subset) both give the right answer, while the second method gets confused by comparisons with NA. I would personally advocate the subset method as in my opinion it's the neatest.

A final note: There are other ways that you can get strings arising as factors in a data frame - and to avoid all of those headaches, always remember to include the argument stringsAsFactors = FALSE at the end whenever you create a data frame using data.frame. For instance, the correct way to create the object dat directly in R would be:

dat=data.frame(col1=c("a","a","b","c",NA,NA), col2=c("b","a","b",NA,"d",NA),
                         stringsAsFactors=FALSE)

Type dat$col1 and dat$col2 and you'll see they've been interpreted correctly. If you try it again but with the stringsAsFactors argument omitted (or set to TRUE), you'll see those darned factors appear (just like the dodgy first method of loading from disk).

In short, always remember as.is=TRUE and stringsAsFactors=FALSE, and learn how to use the subset command, and you won't go far wrong!

Hope this helps :)

这篇关于在R中选择与不同列相同结果的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆