R:如何在两个CSV文本文件之间查找和比较特定元素 [英] R: How to locate and compare a particular element between two CSV text files
问题描述
我发现了一些类似的问题,例如此问题(关于比较属性(在XML文件中),此(关于比较值是数字的情况)和这一个(关于获取两个文件之间不同的列数),但与此特定问题无关.
I found some similar questions such as this one (about comparing attributes in XML files), this one (about a case where the compared values are numeric) and this one (about getting a number of columns that differ between two files) but nothing about this particular problem.
我有两个CSV文本文件,其中许多(但不是全部)行是相等的.这些文件具有相同数量的列,并且这些列上的数据类型相同,但是它们没有相同数量的行.两个文件上的行数约为120K,两个文件中的某些行不在另一个上.
I have two CSV text files on which many, but not all, rows are equal. The files have the same amount of columns with same data type on the columns but they do not have the same amount of rows. The amount of rows on both files is around 120K and both files have some rows that are not on the other.
这些文件的简化版本如下所示.
Simplified versions of these files would look as shown below.
文件1:
PROFILE.ID,CITY,STATE,USERID
2265,Miami,Florida,EL4950
4350,Nashville,Tennessee,GW7420
5486,Durango,Colorado,BH9012
R719,Flagstaff,Arizona,YT7460
Z551,Flagstaff,Arizona,ML1451
文件2:
PROFILE.ID,CITY,STATE,USERID
1173,Nashville,Tennessee,GW7420
2265,Miami,Florida,EL4950
R540,Flagstaff,Arizona,YT7460
T216,Durango,Colorado,BH9012
在实际文件中,第一个文件中的许多 USERID
值也可以在第二个文件中找到(但是可能不存在).同样,尽管所有用户的 USERID
值均未更改,但其 PROFILE.ID
的值可能已更改.
In the actual files many of the USERID
values in the first file can also be found in the second file (some may not be present however). Also while the USERID
values are unchanged for all users, their PROFILE.ID
may have changed.
问题是我必须找到 PROFILE.ID
已更改的行.
The problem is that I would have to locate the rows where the PROFILE.ID
has changed.
我认为我必须使用以下步骤序列在R中进行分析:
I am thinking that I would have to use the following sequence of steps to analyze it in R:
- 将两个文件作为数据帧加载到R Studio
- 浏览第一个文件(具有更多行)中的
USERID
列 - 在第二个文件中搜索在第一个文件中找到的每个
USERID
- 从第二个文件中返回相应的
PROFILE.ID
- 将返回值与第一个文件中的值进行比较
- 输出
PROFILE.ID
值不同的行
- Load both files to R Studio as data frames
- Loop through the
USERID
column on the first file (which has more rows) - Search the second file for each
USERID
found in the first file - Return the corresponding
PROFILE.ID
from second file - Compare the returned value with what is in the first file
- Output the rows where the
PROFILE.ID
values differ
我当时正在考虑编写类似下面所示代码的内容,但不确定是否有更好的方法来实现此目的.
I was thinking of writing something like the code shown below but am not sure if there are better ways to accomplish this.
library(tidyverse)
con1 <- file("file1.csv", open = "r")
con2 <- file("file2.csv", open = "r")
file1 <- read.csv(con1, fill = F, colClasses = "character")
file2 <- read.csv(con2, fill = F, colClasses = "character")
for (i in seq(nrow(file1))) {
profIDFile1 <- file1$PROFILE.ID[i]
userIDFile1 <- file1$USERID[i]
profIDRowFile2 <- filter(file2, USERID == userIDFile1)
profIDFile2 <- profIDRowFile2$PROFILE.ID
if (profIDFile1 != profIDFile2) {
output < - profIDRowFile2
}
}
write.csv(output, file='result.csv', row.names=FALSE, quote=FALSE)
close(con1)
close(con2)
问题:R中是否有一个可以进行这种比较的软件包,或者用R脚本完成此比较的一种好方法是什么?
Question: Is there a package in R that can do this kind of comparison or what would be a good way to accomplish this in R script?
推荐答案
我认为您可以通过简单的连接来做到这一点:
I think you can do this with a simple join:
library(dplyr)
full_join(file1, file2, by = "USERID") %>%
filter(PROFILE.ID.x != PROFILE.ID.y)
# PROFILE.ID.x CITY.x STATE.x USERID PROFILE.ID.y CITY.y STATE.y
# 1 4350 Nashville Tennessee GW7420 1173 Nashville Tennessee
# 2 5486 Durango Colorado BH9012 T216 Durango Colorado
# 3 R719 Flagstaff Arizona YT7460 R540 Flagstaff Arizona
这表明这三个 USERID
行具有不同的 PROFILE.ID
字段.( .x
来自 file1
, .y
来自 file2
.)
This shows that those three USERID
rows have differeing PROFILE.ID
fields. (The .x
are from file1
, .y
from file2
.)
该测试不能很好地处理缺少的ID,因此您可以添加以下逻辑:
That test does not deal very well with IDs that are missing in one, so you might add logic such as:
full_join(file1, file2, by = "USERID") %>%
filter(is.na(PROFILE.ID.x) | is.na(PROFILE.ID.y) |
PROFILE.ID.x != PROFILE.ID.y)
# PROFILE.ID.x CITY.x STATE.x USERID PROFILE.ID.y CITY.y STATE.y
# 1 4350 Nashville Tennessee GW7420 1173 Nashville Tennessee
# 2 5486 Durango Colorado BH9012 T216 Durango Colorado
# 3 R719 Flagstaff Arizona YT7460 R540 Flagstaff Arizona
# 4 Z551 Flagstaff Arizona ML1451 <NA> <NA> <NA>
第四行表示 file2
中缺少的ID.这可能是一个小样本数据集(在SO:-上很不错)的人工产物,我不确定这是否对您有意义或有意义.
The fourth row indicates an ID missing in file2
. This here is likely an artifact of a small sample dataset (which is good on SO :-), I'm not certain if this is interesting or meaningful to you.
这篇关于R:如何在两个CSV文本文件之间查找和比较特定元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!