使用 R 比较 2 个不同文件中的特定列 [英] Comparing specific columns in 2 different files using R

查看:20
本文介绍了使用 R 比较 2 个不同文件中的特定列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现自己不得不经常这样做——比较来自 2 个不同文件的特定列.列、格式相同,但需要比较的列具有浮点/指数格式数据,例如0.0058104642437413175-3.459017050577087E-4

I find myself having to do this very often -- compare specific columns from 2 different files. The columns, formats are the same, but the columns that need comparison have floating point/exponential format data, e.g. 0.0058104642437413175, -3.459017050577087E-4, etc.

我目前正在使用以下 R 代码:

I'm currently using the below R code:

test <- read.csv("C:/VBG_TEST/testing/FILE_2010-06-16.txt", header = FALSE, sep = "|", quote="\"", dec=".")
prod <- read.csv("C:/VBG_PROD/testing/FILE_2010-06-16.txt", header = FALSE, sep = "|", quote="\"", dec=".")
sqldf("select sum(V10), sum(V15) from test")
sqldf("select sum(V10), sum(V15) from prod")

我读入文件,并对特定列进行求和——V10V15,然后观察这些值.这样我就可以忽略每行浮点数据中非常小的差异.

I read in the files, and sum the specific columns -- V10, V15 and then observe the values. This way I can ignore very small differences in floating point data per row.

但是,今后,我想设置一个容差百分比,即.if abs( (prod.V10 - test.V10)/prod.V10 ) >0.01%,并且只打印那些超出此容差限制的行号.

However, going forward, I would like to set a tolerance percent, ie. if abs( (prod.V10 - test.V10)/prod.V10 ) > 0.01%, and only print those row numbers that exceed this tolerance limit.

此外,如果数据的顺序不合理,我如何通过指定充当复合主键的列来进行比较?

Also, if the data is not in the sane order, how can I do a comparison by specifying columns that will act like a composite primary key?

例如,如果我在 Sybase 中这样做,我会写如下:

For e.g., if I did this in Sybase, I'd have written something like:

select A.*, B.* 
from tableA A, tableB B
where abs( (A.Col15-B.Col15)/A.Col15) ) > 0.01%
  and A.Col1 = B.Col1
  and A.Col4 = B.Col4
  and A.Col6 = B.Col6

如果我尝试在 R 中使用 sqldf 做同样的事情,它不起作用,因为文件包含 500K+ 行数据.

If I try doing the same thing using sqldf in R, it does NOT work as the files contain 500K+ rows of data.

谁能指出我如何在 R 中执行上述操作?

Can anyone point me to how I can do the above in R?

非常感谢,恰帕克斯.

推荐答案

Au,这个 sqldf 伤了我的心 -- 最好使用普通的 R 功能而不是用 SQL 折磨自己:

Au, this sqldf hurts my mind -- better use plain R capabilities than torture yourself with SQL:

which(abs(prod$V10-test$V10)/prod$V10>0.0001)

在更一般的版本中:

which(abs(prod[,colTest]-test[,colTest])/prod[,colTest]>tolerance)

其中 colTest 是要测试的列的索引,tolerance 是容差.

where colTest is an index of column that you want to test and tolerance is tolerance.

这篇关于使用 R 比较 2 个不同文件中的特定列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆