Linux Bash命令从CSV文件中删除重复项 [英] Linux Bash commands to remove duplicates from a CSV file
问题描述
我将合并许多CSV文件.我想做的是:
I will be combining a number of CSV files. What I am trying to do is to:
1)从文件中删除重复的行,但是,我需要检查多列作为构成重复项的标准.我该怎么办?
1) Remove duplicate rows from the file, however, I need to check multiple columns as the criteria for what consists as a duplicate. How do I do that?
2)然后创建一个第二个输出文件以查看删除了什么,以防万一删除了不应该删除的内容.
2) It would be nice to then create a 2nd output file to see what was removed in case something was removed that was not supposed to be removed.
3)创建一个项目列表作为输入文件来运行(如果此行在此特定列中包含此单词,则删除整行.
3) Create a list of items as an input file to run as a (if this row contains this word in a this particular column, then remove the entire row.
如果有人可以通过命令帮助我,那就太好了!请让我知道是否需要澄清.
If someone could help me with the commands to do this, that would be great! Please let me know if I need to clarify.
以下是数据外观的示例(以下是建议的示例):
Here is a sample of what the data looks like (here is an example as suggested):
我有一个这样的csv文件:
I have a csv file like this :
column1 column2
john kerry
adam stephenson
ashley hudson
john kerry
etc..
我想从此文件中删除重复项,以便仅针对问题1:
I want to remove duplicates from this file, to get only for the question at 1:
column1 column2
john kerry
adam stephenson
ashley hudson
对于问题3,我想获取第二个列表...表示第一个列表的输出并进一步清理此列表.我想要一个像input.txt这样的文件,其中包含:
For question 3, I want to take the 2nd list...meaning the output of the 1st list and scrub this futher. I want a file like input.txt that contains:
adam
然后,最终输出将是:
column1 column2
john kerry
ashley hudson
因此,示例中的input.txt文件包含单词adam(通过这种方式,我可以列出一长串单词来检查input.txt文件).对于#3,我需要一个代码片段,它将检查所有单词输入文件的CSV所有行的第1列,然后从csv中删除所有匹配项.
So, the input.txt file in the example contains the word adam (this way I can make a long list of words to check in the input.txt file). For #3, I need a code snipet that will check column 1 of all lines of the CSV for all the words input file, then remove any matches from the csv.
推荐答案
您需要提供问题3的更多详细信息,但是对于问题1和问题2,以下 awk
单行代码可以使用.
You need to provide more details for question 3, but for question 1 and 2 the following awk
one-liner will work.
awk 'seen[$0]++{print $0 > "dups.csv"; next}{print $0 > "new.csv"}' mycsv
为清楚起见,添加了一些空格:
And with some whitespace added for clarity:
awk 'seen[$0]++ {
print $0 > "dups.csv"; next
}
{
print $0 > "new.csv"
}' mycsv
这不会将任何内容打印到STDOUT,但会创建两个文件. dups.csv
将包含已删除的所有重复项(即,如果同一行中有5个条目,则此文件将包含4个被删除为重复项的条目)和 new.csv
将包含所有唯一的行.
This will not print anything to STDOUT but will create two files. dups.csv
will contain all the duplicates (that is if there are 5 entries of same line, this file will contain 4 entries that were removed as dups) that were removed and new.csv
will contain all unique rows.
seen [$ 0] ++
是我们对每一行进行的测试.如果该行存在于我们的数组中,它将被插入到 dups.csv
文件中,然后我们将使用 next
移至下一行.如果没有line,我们将把该行添加到数组中并将其写入 new.csv
文件.
seen[$0]++
is a test we do for each line. If the line is present in our array it will be inserted to dups.csv
file and we will move to the next line using next
. If line is not present we will add that line to the array and write it to new.csv
file.
使用 $ 0
表示整行.如果要指定较少的列,可以这样做.您只需要基于定界符设置输入字段分隔符即可.您已经提到了 csv
,但是我没有看到任何逗号
分隔符,所以我使用的是默认分隔符,即 [[:: space:]] +
.
Use of $0
means entire line. If you want to specify fewer columns, you can do so. You just need to set the input field separator based on delimiter. You have mentioned csv
but I don't see any comma
delimiters so I am using the default separator which is [[:space:]]+
.
此外,它是用逗号分隔的,我只是在提供示例数据.因此,如果我想使用上面的示例,但只想测试第3列和第2列,4(使用seeed命令),如何在用逗号分隔的文件中做到这一点?
对于真正的 csv
,只需将字段分隔符设置为,
. seen
不是命令.这是将列保留为键的哈希.因此,您将上面的命令修改为:
For true csv
just set the field separator to ,
. seen
is not a command. It is a hash that retains column as keys. So you will modify the above command to:
awk -F, 'seen[$3,$4]++{print $0 > "dups.csv"; next}{print $0 > "new.csv"}' mycsv
更新:
一旦使用上述命令获得了没有重复的列表.我们只剩下:
Once you have a list without dups using the commands stated above. We are left with:
$ cat new.csv
john,kerry
adam,stephenson
ashley,hudson
$ cat remove.txt
adam
$ awk -F, 'NR==FNR{remove[$1]++;next}!($1 in remove)' remove.txt new.csv
john,kerry
ashley,hudson
这篇关于Linux Bash命令从CSV文件中删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!