R:如何在两个CSV文本文件之间查找和比较特定元素 [英] R: How to locate and compare a particular element between two CSV text files

查看:107
本文介绍了R:如何在两个CSV文本文件之间查找和比较特定元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现了一些类似的问题,例如此问题(关于比较属性(在XML文件中),(关于比较值是数字的情况)和这一个(关于获取两个文件之间不同的列数),但与此特定问题无关.

I found some similar questions such as this one (about comparing attributes in XML files), this one (about a case where the compared values are numeric) and this one (about getting a number of columns that differ between two files) but nothing about this particular problem.

我有两个CSV文本文件,其中许多(但不是全部)行是相等的.这些文件具有相同数量的列,并且这些列上的数据类型相同,但是它们没有相同数量的行.两个文件上的行数约为120K,两个文件中的某些行不在另一个上.

I have two CSV text files on which many, but not all, rows are equal. The files have the same amount of columns with same data type on the columns but they do not have the same amount of rows. The amount of rows on both files is around 120K and both files have some rows that are not on the other.

这些文件的简化版本如下所示.

Simplified versions of these files would look as shown below.

文件1:

PROFILE.ID,CITY,STATE,USERID
2265,Miami,Florida,EL4950
4350,Nashville,Tennessee,GW7420
5486,Durango,Colorado,BH9012
R719,Flagstaff,Arizona,YT7460
Z551,Flagstaff,Arizona,ML1451

文件2:

PROFILE.ID,CITY,STATE,USERID
1173,Nashville,Tennessee,GW7420
2265,Miami,Florida,EL4950
R540,Flagstaff,Arizona,YT7460
T216,Durango,Colorado,BH9012

在实际文件中,第一个文件中的许多 USERID 值也可以在第二个文件中找到(但是可能不存在).同样,尽管所有用户的 USERID 值均未更改,但其 PROFILE.ID 的值可能已更改.

In the actual files many of the USERID values in the first file can also be found in the second file (some may not be present however). Also while the USERID values are unchanged for all users, their PROFILE.ID may have changed.

问题是我必须找到 PROFILE.ID 已更改的行.

The problem is that I would have to locate the rows where the PROFILE.ID has changed.

我认为我必须使用以下步骤序列在R中进行分析:

I am thinking that I would have to use the following sequence of steps to analyze it in R:

  1. 将两个文件作为数据帧加载到R Studio
  2. 浏览第一个文件(具有更多行)中的 USERID
  3. 在第二个文件中搜索在第一个文件中找到的每个 USERID
  4. 从第二个文件中返回相应的 PROFILE.ID
  5. 将返回值与第一个文件中的值进行比较
  6. 输出 PROFILE.ID 值不同的行
  1. Load both files to R Studio as data frames
  2. Loop through the USERID column on the first file (which has more rows)
  3. Search the second file for each USERID found in the first file
  4. Return the corresponding PROFILE.ID from second file
  5. Compare the returned value with what is in the first file
  6. Output the rows where the PROFILE.ID values differ

我当时正在考虑编写类似下面所示代码的内容,但不确定是否有更好的方法来实现此目的.

I was thinking of writing something like the code shown below but am not sure if there are better ways to accomplish this.

library(tidyverse)

con1  <- file("file1.csv", open = "r")
con2  <- file("file2.csv", open = "r")

file1 <- read.csv(con1, fill = F, colClasses = "character")
file2 <- read.csv(con2, fill = F, colClasses = "character")

for (i in seq(nrow(file1))) {
   profIDFile1 <- file1$PROFILE.ID[i]
   userIDFile1 <- file1$USERID[i]

   profIDRowFile2 <- filter(file2, USERID == userIDFile1)
   profIDFile2 <- profIDRowFile2$PROFILE.ID

   if (profIDFile1 != profIDFile2) {
     output < - profIDRowFile2
   }

}

write.csv(output, file='result.csv', row.names=FALSE, quote=FALSE)

close(con1)
close(con2)

问题:R中是否有一个可以进行这种比较的软件包,或者用R脚本完成此比较的一种好方法是什么?

Question: Is there a package in R that can do this kind of comparison or what would be a good way to accomplish this in R script?

推荐答案

我认为您可以通过简单的连接来做到这一点:

I think you can do this with a simple join:

library(dplyr)
full_join(file1, file2, by = "USERID") %>%
  filter(PROFILE.ID.x != PROFILE.ID.y)
#   PROFILE.ID.x    CITY.x   STATE.x USERID PROFILE.ID.y    CITY.y   STATE.y
# 1         4350 Nashville Tennessee GW7420         1173 Nashville Tennessee
# 2         5486   Durango  Colorado BH9012         T216   Durango  Colorado
# 3         R719 Flagstaff   Arizona YT7460         R540 Flagstaff   Arizona

这表明这三个 USERID 行具有不同的 PROFILE.ID 字段.( .x 来自 file1 .y 来自 file2 .)

This shows that those three USERID rows have differeing PROFILE.ID fields. (The .x are from file1, .y from file2.)

该测试不能很好地处理缺少的ID,因此您可以添加以下逻辑:

That test does not deal very well with IDs that are missing in one, so you might add logic such as:

full_join(file1, file2, by = "USERID") %>%
  filter(is.na(PROFILE.ID.x) | is.na(PROFILE.ID.y) |
           PROFILE.ID.x != PROFILE.ID.y)
#   PROFILE.ID.x    CITY.x   STATE.x USERID PROFILE.ID.y    CITY.y   STATE.y
# 1         4350 Nashville Tennessee GW7420         1173 Nashville Tennessee
# 2         5486   Durango  Colorado BH9012         T216   Durango  Colorado
# 3         R719 Flagstaff   Arizona YT7460         R540 Flagstaff   Arizona
# 4         Z551 Flagstaff   Arizona ML1451         <NA>      <NA>      <NA>

第四行表示 file2 中缺少的ID.这可能是一个小样本数据集(在SO:-上很不错)的人工产物,我不确定这是否对您有意义或有意义.

The fourth row indicates an ID missing in file2. This here is likely an artifact of a small sample dataset (which is good on SO :-), I'm not certain if this is interesting or meaningful to you.

这篇关于R:如何在两个CSV文本文件之间查找和比较特定元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆