想要在另一列中打印每个电子邮件的maxm条目 [英] Want to print maxm entry of every email against it in another column
问题描述
我有一个巨大的文件,大约2 GB有超过2000万行
我想要的是
输入文件将像这样
07.SHEKHAR @ GMAIL.COM,1
07SHIBAJI @ GMAIL.COM,1
07.SHINDE @ GMAIL.COM,1
07.SHINDE @ GMAIL.COM,2
07.SHINDE @ GMAIL.COM,3
07.SHINDE@GMAIL.COM ,4
07.SHINDE @ GMAIL.COM,5
07.SHINDE @ GMAIL.COM,6
07.SHINDE @ GMAIL.COM,7
07.SHOBHIT@GMAIL .COM,1
07SKERCH @ RUSKIN.AC.UK,1
07SONIA @ GMAIL.COM,1
07SONIA @ GMAIL.COM,2
07SONIA @ GMAIL.COM,3
07SRAM @ GMAIL.COM,1
07SRAM @ GMAIL.COM,2
07.SUMANTA @ GMAIL.COM,1
07SUPRIYO @ GMAIL.COM,1
07SUPRIYO @ GMAIL.COM,2
07SUPRIYO @ GMAIL.COM,3
07.SUSHMA @ GMAIL.COM,1
07.SWETA @ GMAIL.COM,1
07。 SWETA @ GMAIL.COM,2
07.SWETA @ GMAIL.COM,3
07.TEENA @ GMAIL.COM,1
07.TEENA @ GMAIL.COM,2
07.UDAY @ GMAIL.COM,1
07.UMESH @ GMAIL.COM,1
07VAISHALISINGH @ GMAIL.COM,1
07.VISHAL @ GMAIL.COM,1,1
07.VISHAL @ GMAIL.COM,2
07.VISHAL @ GMAIL.COM,3
07.VISHAL @ GMAIL.COM,4
07.VISHAL @ GMAIL.COM,5
07.VISHAL @ GMAIL.COM,6
07.VISHAL @ GMAIL.COM,7
07.YASH @ GMAIL.COM,1
07.YASH@GMAIL.COM ,2
07.YASH @ GMAIL.COM,3
07.YASH @ GMAIL.COM,4
需要输出文件:
07.SHEKHAR @ GMAIL.COM,1,1
07SHIBAJI @ GMAIL.COM,1,1
07.SHINDE @ GMAIL.COM,1,7
07.SHINDE @ GMAIL.COM,2,7
07.SHINDE@GMAIL。 COM,3,7
07.SHINDE @ GMAIL.COM,4,7
07.SHINDE @ GMAIL.COM,5,7
07.SHINDE @ GMAIL.COM,6,7
07.SHINDE @ GMAIL.COM,7,7
07.SHOBHIT @ GMAIL.COM,1,1
07SKERCH @ RUSKIN.AC.UK,1,1
07SONIA @ GMAIL.COM,1,3
07SONIA @ GMAIL.COM,2,3
07SONIA @ GMAIL.COM,3,3
07SRAM @ GMAIL.COM,1,2
07SRAM @ GMAIL.COM,2,2
07.SUMANTA @ GMAIL.COM,1,1
07SUPRIYO @ GMAIL.COM,1,3
07SUPRIYO @ GMAIL.COM,2, 3
07SUPRIYO @ GMAIL.COM,3,3
07.SUSHMA @ GMAIL.COM,1,1
07.SWETA @ GMAIL.COM,1,3
07。 SWETA @ GMAIL.COM,2,3
07.SWETA @ GMAIL.COM,3,3
07.TEENA @ GMAIL.COM,1,2
07.TEENA@GMAIL.COM ,2,2
07.UDAY @ GMAIL.COM,1,1
07.UMESH @ GMAIL.COM,1,1
07VAISHALISINGH @ GMAIL.COM,1,1
07.VISHAL @ GMAIL.COM,1,7
07.VISHAL @ GMAIL.COM,2,7
07.VISHAL @ GMAIL.COM,3,7
07.VISHAL @ GMAIL.COM,4,7
07.VISHAL @ GMAIL.COM,5,7
07.VISHAL @ GMAIL.COM,6,7
07.VISHAL @ GMAIL.COM,7 ,7
07.YASH @ GMAIL.COM,1,4
07.YASH @ GMAIL.COM,2,4
07.YASH @ GMAIL.COM,3,4
07.YASH @ GMAIL.COM,4,4
i,e 1列的条目对应于每列中的特定电子邮件,使得每行现在包含每个电子邮件的最大出现次数。
我正在寻找一个可行的soln这样一个大文件,最好在python或shell脚本和复杂的O(n)或O(nlogn)
O(n ** 2)在这种情况下
让我们尝试一个python脚本,因为你可能更熟悉这种语言,不需要巨大的内存或硬盘空间。在Python 2.7和3.2上测试
#!/ usr / bin / python
pre>
email =#初始化电子邮件
count = 0#和计数器
import fileinput
在fileinput.input(word.txt)中的行:#Interator:每次处理一行
myArr = line.split(,)
if(email!= myArr [0]):#New email;打印和重置计数,电子邮件
在范围内的n(0,计数):
打印电子邮件+,+ str(n + 1)+,+ str(计数)
email = myArr [0]
count = 1
else:#同样的电子邮件,增加计数
count = count + 1
#打印最后的电子邮件
对于n在范围(0,count):
print email +,+ str(n + 1)+,+ str(count)
任何人都想尝试
awk
脚本?I have a huge File around 2 GB having more then 20million rows
what i want is
Input File will be like this
07.SHEKHAR@GMAIL.COM,1 07SHIBAJI@GMAIL.COM,1 07.SHINDE@GMAIL.COM,1 07.SHINDE@GMAIL.COM,2 07.SHINDE@GMAIL.COM,3 07.SHINDE@GMAIL.COM,4 07.SHINDE@GMAIL.COM,5 07.SHINDE@GMAIL.COM,6 07.SHINDE@GMAIL.COM,7 07.SHOBHIT@GMAIL.COM,1 07SKERCH@RUSKIN.AC.UK,1 07SONIA@GMAIL.COM,1 07SONIA@GMAIL.COM,2 07SONIA@GMAIL.COM,3 07SRAM@GMAIL.COM,1 07SRAM@GMAIL.COM,2 07.SUMANTA@GMAIL.COM,1 07SUPRIYO@GMAIL.COM,1 07SUPRIYO@GMAIL.COM,2 07SUPRIYO@GMAIL.COM,3 07.SUSHMA@GMAIL.COM,1 07.SWETA@GMAIL.COM,1 07.SWETA@GMAIL.COM,2 07.SWETA@GMAIL.COM,3 07.TEENA@GMAIL.COM,1 07.TEENA@GMAIL.COM,2 07.UDAY@GMAIL.COM,1 07.UMESH@GMAIL.COM,1 07VAISHALISINGH@GMAIL.COM,1 07.VISHAL@GMAIL.COM,1,1 07.VISHAL@GMAIL.COM,2 07.VISHAL@GMAIL.COM,3 07.VISHAL@GMAIL.COM,4 07.VISHAL@GMAIL.COM,5 07.VISHAL@GMAIL.COM,6 07.VISHAL@GMAIL.COM,7 07.YASH@GMAIL.COM,1 07.YASH@GMAIL.COM,2 07.YASH@GMAIL.COM,3 07.YASH@GMAIL.COM,4
Output File Needed:-
07.SHEKHAR@GMAIL.COM,1,1 07SHIBAJI@GMAIL.COM,1,1 07.SHINDE@GMAIL.COM,1,7 07.SHINDE@GMAIL.COM,2,7 07.SHINDE@GMAIL.COM,3,7 07.SHINDE@GMAIL.COM,4,7 07.SHINDE@GMAIL.COM,5,7 07.SHINDE@GMAIL.COM,6,7 07.SHINDE@GMAIL.COM,7,7 07.SHOBHIT@GMAIL.COM,1,1 07SKERCH@RUSKIN.AC.UK,1,1 07SONIA@GMAIL.COM,1,3 07SONIA@GMAIL.COM,2,3 07SONIA@GMAIL.COM,3,3 07SRAM@GMAIL.COM,1,2 07SRAM@GMAIL.COM,2,2 07.SUMANTA@GMAIL.COM,1,1 07SUPRIYO@GMAIL.COM,1,3 07SUPRIYO@GMAIL.COM,2,3 07SUPRIYO@GMAIL.COM,3,3 07.SUSHMA@GMAIL.COM,1,1 07.SWETA@GMAIL.COM,1,3 07.SWETA@GMAIL.COM,2,3 07.SWETA@GMAIL.COM,3,3 07.TEENA@GMAIL.COM,1,2 07.TEENA@GMAIL.COM,2,2 07.UDAY@GMAIL.COM,1,1 07.UMESH@GMAIL.COM,1,1 07VAISHALISINGH@GMAIL.COM,1,1 07.VISHAL@GMAIL.COM,1,7 07.VISHAL@GMAIL.COM,2,7 07.VISHAL@GMAIL.COM,3,7 07.VISHAL@GMAIL.COM,4,7 07.VISHAL@GMAIL.COM,5,7 07.VISHAL@GMAIL.COM,6,7 07.VISHAL@GMAIL.COM,7,7 07.YASH@GMAIL.COM,1,4 07.YASH@GMAIL.COM,2,4 07.YASH@GMAIL.COM,3,4 07.YASH@GMAIL.COM,4,4
i,e 1 more column containing maximum no of entries corresponding to a particular email in each column so that every row now contains maximum occurence of each email. I am looking for a feasible soln for such a large file preferably in python or shell script and complexity of O(n) or O(nlogn) O(n**2) wont do in this case
解决方案Lets try a python script since you may be more familiar with that language, doesn't require huge memory or hard disk space. Tested on Python 2.7 and 3.2
#!/usr/bin/python email = "" # Initialize the email count = 0 # and counter import fileinput for line in fileinput.input("word.txt"): # Interator: process a line at a time myArr = line.split(",") if (email != myArr[0]): # New email; print and reset count, email for n in range(0,count): print email + "," + str(n+1) + "," + str(count) email = myArr[0] count = 1 else: # Same email, increment count count = count + 1 # Print the final email for n in range(0,count): print email + "," + str(n+1) + "," + str(count)
Anyone want to try an
awk
script?这篇关于想要在另一列中打印每个电子邮件的maxm条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!