使用散列来确定2个数据帧是否相同（PART 01） [英] using hash to determine whether 2 dataframes are identical (PART 01)

查看：106 发布时间：2018/6/1 19:11:04 r hash

本文介绍了使用散列来确定2个数据帧是否相同（PART 01）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已在几个月前使用 WHO ATC / DDD Index 创建了一个数据集，并且我希望确保今天在线数据库保持不变，所以我再次下载了它，并尝试使用R中的 digest 包进行比较。

这两个数据集（以txt格式）可以下载此处。（我知道你可能认为这些文件是不安全的并且可能有病毒，但我不知道如何生成一个哑数据集来复制我现在遇到的问题，所以我最后上传了这个数据集）

我写了一个如下脚本：

  library（digest）
 
 ddd.old<  -  read.table（ddd.table.old.txt，header = TRUE，stringsAsFactors = FALSE）
 ddd.new<  -  read.table（ddd。 table.new.txt，header = TRUE，stringsAsFactors = FALSE）
 
 
 ddd.old [，ddd]<  -  as.character（ddd.old [，ddd ]）
 ddd.new [，ddd]<  -  as.character（ddd.new [，ddd]）
 
 ddd.old<  -  data.frame （ddd.old，hash = apply（ddd.old，1，digest），stringsAsFactors = FALSE）
 ddd.new<  -  data.frame（ddd.new，hash = apply（ddd.new， digest），stringsAsFactors = FALSE）
 
 ddd.old<  -  ddd.old [order（ddd.old [，hash]），] 
 ddd.new<  -  ddd .new [order（ddd.new [，hash]），]

当我这样做的时候会发生检查：

 > table（ddd.old [，hash]％in％ddd.new [，hash]）＃line01 
 
 TRUE 
 506 
> table（ddd.new [，hash]％in％ddd.old [，hash]）＃line02 
 
 TRUE 
 506 
> digest（ddd.old [，hash]）==摘要（ddd.new [，hash]）＃line03 
 [1] TRUE 
> digest（ddd.old）== digest（ddd.new）＃line04 
 [1] FALSE

line01 和 line02 显示可以在 ddd.new 中找到ddd.old ，反之亦然。
line03 显示两个数据框的散列列是相同的

line04 显示两个数据框不同

会发生什么？包含相同行的数据框（从 line01 和 line02 ），相同的顺序（从 line03 ），但是不同？（从 line04 ）

或者我对 digest ？感谢。
解决方案
像以前一样读取数据。
ddd.old< - read.table（ddd.table.old.txt，header = TRUE，stringsAsFactors = FALSE） ddd.new< - read.table（ ddd.table.new.txt，header = TRUE，stringsAsFactors = FALSE） ddd.old [，ddd]< - as.character（ddd.old [，ddd]）$ b $ （ddd.new [，ddd]）
像Marek说的那样，首先检查与 all.equal 差异。
all.equal（ddd.old，ddd.new） [1]Component 6：4 string mismatches [2]Component 8：24 string mismatches
所以我们只需要看第6列和第8列。
different.old< - ddd.old [，c（6，8）] different.new <-d dd.new [，c（6 ，8）]
散列这些列。
hash.old< - apply（different.old，1，digest） hash.new< - apply（diff erent.new，1，digest）
找到不匹配的行。 p>
different_rows< - which（hash.old！= hash.new）#which is optional cbind（different.old [different_rows，]，different.new [different_rows，]） I have created a dataset using WHO ATC/DDD Index a few months before and I want to make sure if the database online remains unchanged today, so I downloaded it again and try to use the digest package in R to do the comparison. The two dataset (in txt format) can be downloaded here. (I am aware that you may think the files are unsafe and may have virus, but I don't know how to generate a dummy dataset to replicate the issue I have now, so I upload the dataset finally) And I have written a little script as below: library(digest) ddd.old <- read.table("ddd.table.old.txt",header=TRUE,stringsAsFactors=FALSE) ddd.new <- read.table("ddd.table.new.txt",header=TRUE,stringsAsFactors=FALSE) ddd.old[,"ddd"] <- as.character(ddd.old[,"ddd"]) ddd.new[,"ddd"] <- as.character(ddd.new[,"ddd"]) ddd.old <- data.frame(ddd.old, hash = apply(ddd.old, 1, digest),stringsAsFactors=FALSE) ddd.new <- data.frame(ddd.new, hash = apply(ddd.new, 1, digest),stringsAsFactors=FALSE) ddd.old <- ddd.old[order(ddd.old[,"hash"]),] ddd.new <- ddd.new[order(ddd.new[,"hash"]),] And something really interesting happens when I do the checking: > table(ddd.old[,"hash"]%in%ddd.new[,"hash"]) #line01 TRUE 506 > table(ddd.new[,"hash"]%in%ddd.old[,"hash"]) #line02 TRUE 506 > digest(ddd.old[,"hash"])==digest(ddd.new[,"hash"]) #line03 [1] TRUE > digest(ddd.old)==digest(ddd.new) #line04 [1] FALSE line01 and line02 shows that every rows in ddd.old can be found in ddd.new, and vice versa. line03 shows that the hash column for both dataframe are the same line04 shows that the two dataframe are different What happen? Both dataframe with the identical rows (from line01 and line02), same order (from line03), but are different? (from line04) Or do I have any misunderstanding about digest? Thanks. 解决方案 Read in data as before. ddd.old <- read.table("ddd.table.old.txt",header=TRUE,stringsAsFactors=FALSE) ddd.new <- read.table("ddd.table.new.txt",header=TRUE,stringsAsFactors=FALSE) ddd.old[,"ddd"] <- as.character(ddd.old[,"ddd"]) ddd.new[,"ddd"] <- as.character(ddd.new[,"ddd"]) Like Marek said, start by checking for differences with all.equal. all.equal(ddd.old, ddd.new) [1] "Component 6: 4 string mismatches" [2] "Component 8: 24 string mismatches" So we just need to look at columns 6 and 8. different.old <- ddd.old[, c(6, 8)] different.new <- ddd.new[, c(6, 8)] Hash these columns. hash.old <- apply(different.old, 1, digest) hash.new <- apply(different.new, 1, digest) And find the rows where they don't match. different_rows <- which(hash.old != hash.new) #which is optional Finally, combine the datasets. cbind(different.old[different_rows, ], different.new[different_rows, ]) 这篇关于使用散列来确定2个数据帧是否相同（PART 01）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

使用散列来确定2个数据帧是否相同（PART 01） [英] using hash to determine whether 2 dataframes are identical (PART 01)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用散列来确定2个数据帧是否相同（PART 01） [英] using hash to determine whether 2 dataframes are identical (PART 01)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭