需要比较非常大的文件大约1.5GB在python [英] Need to compare very large files around 1.5GB in python

查看：117 发布时间：2017/2/24 17:40:25 python csv numpy pandas large-data-volumes

本文介绍了需要比较非常大的文件大约1.5GB在python的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"

上面是示例数据。
数据根据电子邮件地址排序，文件大小约为1.5Gb

Above is the sample data. Data is sorted according to email addresses and the file is very large around 1.5Gb

我想在另一个csv文件中输出类似这样

I want output in another csv file something like this

"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1,0 days
"Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025",1,0 days
"DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792",1,0 days
"Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800",1,0 days
"Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595",1,0 days
"Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957",1,0 days
"Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212",1,0 days
"DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080",1,0 days
"Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731",1,0 days
"DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000",1,0 days
"DF","0001HARISH@GMAIL.COM","NF251352240086","09DEC2010","B2C","4006",1,0 days
"DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",2,3 days
"DF","0001HARISH@GMAIL.COM","NF252022031180","22DEC2010","B2C","3439",3,10 days
"Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41",1,0 days
"Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96",2,1 days
"Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96",3,0 days
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",4,9 days
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",5,0 days
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",6,4 days
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",7,0 days
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",8,44 days
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",9,0 days

我需要追加1如果它发生第二次，我需要追加2，同样，我的意思是我需要计数的电子邮件地址在文件中没有发生，如果一个电子邮件存在两次或更多我想要日期之间的差异，并记得日期不排序，所以我们必须排序他们也针对特定的电子邮件地址，我正在寻找一个解决方案在python使用numpy或pandas库或任何其他库，可以处理这种类型的巨大的数据，而不放弃的绑定内存异常我有双核处理器与centos 6.3和有4GB的ram。

i.e if entry occurs 1st time i need to append 1 if it occurs 2nd time i need to append 2 and likewise i mean i need to count no of occurences of an email address in the file and if an email exists twice or more i want difference among dates and remember dates are not sorted so we have to sort them also against a particular email address and i am looking for a solution in python using numpy or pandas library or any other library that can handle this type of huge data without giving out of bound memory exception i have dual core processor with centos 6.3 and having ram of 4GB

推荐答案

另一种可能

更新20/04 增加了更多的代码和简化了的工作流程和硬件资源。方法： -

Update 20/04 Added more code and simplified approach:-

将时间戳记转换为秒（来自Epoch），并使用UNIX sort ，使用电子邮件和此新字段是： sort -k2 -k4 -n -t， converted_input_file> output_file ）

初始化3个变量， EMAIL ， PREV_TIME 和 COUNT

在每行之间交叉，如果遇到新电子邮件，请添加1,0天。更新 PREV_TIME = timestamp ， COUNT = 1 ， EMAIL = new_email

下一行：3种可能的情况
- a）如果相同的电子邮件，不同的时间戳记：计算天数，递增COUNT = 1， PREV_TIME，添加Count，Difference_in_days
- b）如果相同的电子邮件，同一时间戳：增加COUNT，添加COUNT，0天
- c ）如果是新电子邮件，请从3开始。

Convert the timestamp to seconds (from Epoch) and use UNIX sort, using email and this new field (that is: sort -k2 -k4 -n -t, < converted_input_file > output_file)
Initialize 3 variable, EMAIL, PREV_TIME and COUNT
Interate over each line, if new email is encountered, add "1,0 day". Update PREV_TIME=timestamp, COUNT=1, EMAIL=new_email
Next line: 3 possible scenario
- a) if same email, different timestamp: calculate days, increment COUNT=1, update PREV_TIME, add "Count, Difference_in_days"
- b) If same email, same timestamp: increment COUNT, add "COUNT, 0 day"
- c) If new email, start from 3.

添加一个新字段TIMESTAMP并在打印输出时将其删除。

Alternative to 1. is to add a new field TIMESTAMP and remove it upon printing out the line.

注意：如果1.5GB太大，无法排序，电子邮件作为分割点。您可以在不同的机器上并行运行这些块

Note: If 1.5GB is too huge to sort at a go, split it into smaller chuck, using email as the split point. You can run these chunks in parallel on different machine

/usr/bin/gawk -F'","' ' { 
    split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " "); 
    for (i=1; i<=12; i++) mdigit[month[i]]=i; 
    print $0 "," mktime(substr($4,6,4) " " mdigit[substr($4,3,3)] " " substr($4,1,2) " 00 00 00"
)}' < input.txt |  /usr/bin/sort -k2 -k7 -n -t, > output_file.txt

output_file.txt：

output_file.txt:

DF，00000000@11111.COM，FLTINT1000130394756，26JUL2010，B2C，6799.2，1280102400
DF，0001HARISH @ COM，NF252022031180，09DEC2010，B2C，3439，1291852800
DF，0001HARISH@GMAIL.COM，NF251742087846，12DEC2010，B2C ，1292112000
DF，0001HARISH@GMAIL.COM，NF251352240086，22DEC2010，B2C，4006，1292976000

...

"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1280102400 "DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439",1291852800 "DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",1292112000 "DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006",1292976000
...

将输出传递给Perl，Python或AWK脚本，以处理步骤2至4.。

You pipe the output to Perl, Python or AWK script to process step 2. through 4.

这篇关于需要比较非常大的文件大约1.5GB在python的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

需要比较非常大的文件大约1.5GB在python [英] Need to compare very large files around 1.5GB in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

需要比较非常大的文件大约1.5GB在python [英] Need to compare very large files around 1.5GB in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭