脚本找到csv文件副本 [英] Script to find duplicates in a csv file

查看:129
本文介绍了脚本找到csv文件副本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有50,000条记录40 MB CSV文件。它是一种巨大的产品上市。每一行都有近20个领域。 [编号,UPC,说明等]

I have a 40 MB csv file with 50,000 records. Its a giant product listing. Each row has close to 20 fields. [Item#, UPC, Desc, etc]

如何

A)查找和打印重复行。 [这个文件是一个大的附加文件,所以我已经包括了我需要删除文件中的多个头,所以我想知道哪些是重复的第一个确切行。]

a) Find and Print duplicate rows. [This file is a large appended file, so I have multiple headers included in the file which I need to remove, so I wanted to know exact rows which are duplicate first.]

二)查找并基于列打印重复行。 [查看是否有UPC被分配给多个产品]

b) Find and Print duplicate rows based on a column. [See if a UPC is assigned to multiple products]

我需要在服务器上运行的命令或脚本,我有Perl和Python安装。即使是bash脚本或命令会为我工作了。

I need to run the command or script on the server and I have Perl and Python installed. Even bash script or command will work for me too.

我不需要preserve行的顺序。等等

I dont need to preserve the order of the rows. etc

我试过了,

排序largefile.csv | uniq的-d

sort largefile.csv | uniq -d

得到重复,但我没有得到预期的答案。

to get the duplicates, But I am not getting the expected answer.

我非常希望bash脚本或者命令,但是如果任何人有任何其他建议,这将是巨大的。

Ideally I would like bash script or command, but if any one has any other suggestion, that would be great too.

感谢

请参阅:<一href=\"http://stackoverflow.com/questions/3452832/remove-duplicate-rows-from-a-large-file-in-python\">Remove从Python中的大文件重复的行的超过堆栈溢出

See: Remove duplicate rows from a large file in Python over on Stack Overflow

推荐答案

尝试以下方法:

Try the following:

# Sort before using the uniq command
sort largefile.csv | sort | uniq -d

uniq的是一个非常基本的命令,并只报告唯一/重复的是相邻

这篇关于脚本找到csv文件副本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆