Linux Bash命令从CSV文件中删除重复项 [英] Linux Bash commands to remove duplicates from a CSV file

查看:82
本文介绍了Linux Bash命令从CSV文件中删除重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将合并许多CSV文件.我想做的是:

I will be combining a number of CSV files. What I am trying to do is to:

1)从文件中删除重复的行,但是,我需要检查多列作为构成重复项的标准.我该怎么办?

1) Remove duplicate rows from the file, however, I need to check multiple columns as the criteria for what consists as a duplicate. How do I do that?

2)然后创建一个第二个输出文件以查看删除了什么,以防万一删除了不应该删除的内容.

2) It would be nice to then create a 2nd output file to see what was removed in case something was removed that was not supposed to be removed.

3)创建一个项目列表作为输入文件来运行(如果此行在此特定列中包含此单词,则删除整行.

3) Create a list of items as an input file to run as a (if this row contains this word in a this particular column, then remove the entire row.

如果有人可以通过命令帮助我,那就太好了!请让我知道是否需要澄清.

If someone could help me with the commands to do this, that would be great! Please let me know if I need to clarify.

以下是数据外观的示例(以下是建议的示例):

Here is a sample of what the data looks like (here is an example as suggested):

我有一个这样的csv文件:

I have a csv file like this :

column1    column2

john       kerry
adam       stephenson
ashley     hudson
john       kerry
etc..

我想从此文件中删除重复项,以便仅针对问题1:

I want to remove duplicates from this file, to get only for the question at 1:

column1    column2

john       kerry
adam       stephenson
ashley     hudson

对于问题3,我想获取第二个列表...表示第一个列表的输出并进一步清理此列表.我想要一个像input.txt这样的文件,其中包含:

For question 3, I want to take the 2nd list...meaning the output of the 1st list and scrub this futher. I want a file like input.txt that contains:

adam

然后,最终输出将是:

column1    column2

john       kerry
ashley     hudson

因此,示例中的input.txt文件包含单词adam(通过这种方式,我可以列出一长串单词来检查input.txt文件).对于#3,我需要一个代码片段,它将检查所有单词输入文件的CSV所有行的第1列,然后从csv中删除所有匹配项.

So, the input.txt file in the example contains the word adam (this way I can make a long list of words to check in the input.txt file). For #3, I need a code snipet that will check column 1 of all lines of the CSV for all the words input file, then remove any matches from the csv.

推荐答案

您需要提供问题3的更多详细信息,但是对于问题1和问题2,以下 awk 单行代码可以使用.

You need to provide more details for question 3, but for question 1 and 2 the following awk one-liner will work.

awk 'seen[$0]++{print $0 > "dups.csv"; next}{print $0 > "new.csv"}' mycsv

为清楚起见,添加了一些空格:

And with some whitespace added for clarity:

awk 'seen[$0]++ {
  print $0 > "dups.csv"; next
}
{
  print $0 > "new.csv"
}' mycsv

这不会将任何内容打印到STDOUT,但会创建两个文件. dups.csv 将包含已删除的所有重复项(即,如果同一行中有5个条目,则此文件将包含4个被删除为重复项的条目)和 new.csv 将包含所有唯一的行.

This will not print anything to STDOUT but will create two files. dups.csv will contain all the duplicates (that is if there are 5 entries of same line, this file will contain 4 entries that were removed as dups) that were removed and new.csv will contain all unique rows.

seen [$ 0] ++ 是我们对每一行进行的测试.如果该行存在于我们的数组中,它将被插入到 dups.csv 文件中,然后我们将使用 next 移至下一行.如果没有line,我们将把该行添加到数组中并将其写入 new.csv 文件.

seen[$0]++ is a test we do for each line. If the line is present in our array it will be inserted to dups.csv file and we will move to the next line using next. If line is not present we will add that line to the array and write it to new.csv file.

使用 $ 0 表示整行.如果要指定较少的列,可以这样做.您只需要基于定界符设置输入字段分隔符即可.您已经提到了 csv ,但是我没有看到任何逗号分隔符,所以我使用的是默认分隔符,即 [[:: space:]] + .

Use of $0 means entire line. If you want to specify fewer columns, you can do so. You just need to set the input field separator based on delimiter. You have mentioned csv but I don't see any comma delimiters so I am using the default separator which is [[:space:]]+.

此外,它是用逗号分隔的,我只是在提供示例数据.因此,如果我想使用上面的示例,但只想测试第3列和第2列,4(使用seeed命令),如何在用逗号分隔的文件中做到这一点?

对于真正的 csv ,只需将字段分隔符设置为. seen 不是命令.这是将列保留为键的哈希.因此,您将上面的命令修改为:

For true csv just set the field separator to ,. seen is not a command. It is a hash that retains column as keys. So you will modify the above command to:

awk -F, 'seen[$3,$4]++{print $0 > "dups.csv"; next}{print $0 > "new.csv"}' mycsv

更新:

一旦使用上述命令获得了没有重复的列表.我们只剩下:

Once you have a list without dups using the commands stated above. We are left with:

$ cat new.csv 
john,kerry
adam,stephenson
ashley,hudson

$ cat remove.txt 
adam

$ awk -F, 'NR==FNR{remove[$1]++;next}!($1 in remove)' remove.txt new.csv 
john,kerry
ashley,hudson

这篇关于Linux Bash命令从CSV文件中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆