想要在另一列中打印每个电子邮件的maxm条目 [英] Want to print maxm entry of every email against it in another column

查看:122
本文介绍了想要在另一列中打印每个电子邮件的maxm条目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的文件,大约2 GB有超过2000万行



我想要的是



输入文件将像这样

  07.SHEKHAR @ GMAIL.COM,1 
07SHIBAJI @ GMAIL.COM,1
07.SHINDE @ GMAIL.COM,1
07.SHINDE @ GMAIL.COM,2
07.SHINDE @ GMAIL.COM,3
07.SHINDE@GMAIL.COM ,4
07.SHINDE @ GMAIL.COM,5
07.SHINDE @ GMAIL.COM,6
07.SHINDE @ GMAIL.COM,7
07.SHOBHIT@GMAIL .COM,1
07SKERCH @ RUSKIN.AC.UK,1
07SONIA @ GMAIL.COM,1
07SONIA @ GMAIL.COM,2
07SONIA @ GMAIL.COM,3
07SRAM @ GMAIL.COM,1
07SRAM @ GMAIL.COM,2
07.SUMANTA @ GMAIL.COM,1
07SUPRIYO @ GMAIL.COM,1
07SUPRIYO @ GMAIL.COM,2
07SUPRIYO @ GMAIL.COM,3
07.SUSHMA @ GMAIL.COM,1
07.SWETA @ GMAIL.COM,1
07。 SWETA @ GMAIL.COM,2
07.SWETA @ GMAIL.COM,3
07.TEENA @ GMAIL.COM,1
07.TEENA @ GMAIL.COM,2
07.UDAY @ GMAIL.COM,1
07.UMESH @ GMAIL.COM,1
07VAISHALISINGH @ GMAIL.COM,1
07.VISHAL @ GMAIL.COM,1,1
07.VISHAL @ GMAIL.COM,2
07.VISHAL @ GMAIL.COM,3
07.VISHAL @ GMAIL.COM,4
07.VISHAL @ GMAIL.COM,5
07.VISHAL @ GMAIL.COM,6
07.VISHAL @ GMAIL.COM,7
07.YASH @ GMAIL.COM,1
07.YASH@GMAIL.COM ,2
07.YASH @ GMAIL.COM,3
07.YASH @ GMAIL.COM,4

需要输出文件:

  07.SHEKHAR @ GMAIL.COM,1,1 
07SHIBAJI @ GMAIL.COM,1,1
07.SHINDE @ GMAIL.COM,1,7
07.SHINDE @ GMAIL.COM,2,7
07.SHINDE@GMAIL。 COM,3,7
07.SHINDE @ GMAIL.COM,4,7
07.SHINDE @ GMAIL.COM,5,7
07.SHINDE @ GMAIL.COM,6,7
07.SHINDE @ GMAIL.COM,7,7
07.SHOBHIT @ GMAIL.COM,1,1
07SKERCH @ RUSKIN.AC.UK,1,1
07SONIA @ GMAIL.COM,1,3
07SONIA @ GMAIL.COM,2,3
07SONIA @ GMAIL.COM,3,3
07SRAM @ GMAIL.COM,1,2
07SRAM @ GMAIL.COM,2,2
07.SUMANTA @ GMAIL.COM,1,1
07SUPRIYO @ GMAIL.COM,1,3
07SUPRIYO @ GMAIL.COM,2, 3
07SUPRIYO @ GMAIL.COM,3,3
07.SUSHMA @ GMAIL.COM,1,1
07.SWETA @ GMAIL.COM,1,3
07。 SWETA @ GMAIL.COM,2,3
07.SWETA @ GMAIL.COM,3,3
07.TEENA @ GMAIL.COM,1,2
07.TEENA@GMAIL.COM ,2,2
07.UDAY @ GMAIL.COM,1,1
07.UMESH @ GMAIL.COM,1,1
07VAISHALISINGH @ GMAIL.COM,1,1
07.VISHAL @ GMAIL.COM,1,7
07.VISHAL @ GMAIL.COM,2,7
07.VISHAL @ GMAIL.COM,3,7
07.VISHAL @ GMAIL.COM,4,7
07.VISHAL @ GMAIL.COM,5,7
07.VISHAL @ GMAIL.COM,6,7
07.VISHAL @ GMAIL.COM,7 ,7
07.YASH @ GMAIL.COM,1,4
07.YASH @ GMAIL.COM,2,4
07.YASH @ GMAIL.COM,3,4
07.YASH @ GMAIL.COM,4,4

i,e 1列的条目对应于每列中的特定电子邮件,使得每行现在包含每个电子邮件的最大出现次数。
我正在寻找一个可行的soln这样一个大文件,最好在python或shell脚本和复杂的O(n)或O(nlogn)
O(n ** 2)在这种情况下

解决方案

让我们尝试一个python脚本,因为你可能更熟悉这种语言,不需要巨大的内存或硬盘空间。在Python 2.7和3.2上测试

 #!/ usr / bin / python 
email =#初始化电子邮件
count = 0#和计数器
import fileinput

在fileinput.input(word.txt)中的行:#Interator:每次处理一行
myArr = line.split(,)
if(email!= myArr [0]):#New email;打印和重置计数,电子邮件
在范围内的n(0,计数):
打印电子邮件+,+ str(n + 1)+,+ str(计数)
email = myArr [0]
count = 1
else:#同样的电子邮件,增加计数
count = count + 1

#打印最后的电子邮件
对于n在范围(0,count):
print email +,+ str(n + 1)+,+ str(count)
pre>

任何人都想尝试 awk 脚本?


I have a huge File around 2 GB having more then 20million rows

what i want is

Input File will be like this

07.SHEKHAR@GMAIL.COM,1
07SHIBAJI@GMAIL.COM,1
07.SHINDE@GMAIL.COM,1
07.SHINDE@GMAIL.COM,2
07.SHINDE@GMAIL.COM,3
07.SHINDE@GMAIL.COM,4
07.SHINDE@GMAIL.COM,5
07.SHINDE@GMAIL.COM,6
07.SHINDE@GMAIL.COM,7
07.SHOBHIT@GMAIL.COM,1
07SKERCH@RUSKIN.AC.UK,1
07SONIA@GMAIL.COM,1
07SONIA@GMAIL.COM,2
07SONIA@GMAIL.COM,3
07SRAM@GMAIL.COM,1
07SRAM@GMAIL.COM,2
07.SUMANTA@GMAIL.COM,1
07SUPRIYO@GMAIL.COM,1
07SUPRIYO@GMAIL.COM,2
07SUPRIYO@GMAIL.COM,3
07.SUSHMA@GMAIL.COM,1
07.SWETA@GMAIL.COM,1
07.SWETA@GMAIL.COM,2
07.SWETA@GMAIL.COM,3
07.TEENA@GMAIL.COM,1
07.TEENA@GMAIL.COM,2
07.UDAY@GMAIL.COM,1
07.UMESH@GMAIL.COM,1
07VAISHALISINGH@GMAIL.COM,1
07.VISHAL@GMAIL.COM,1,1
07.VISHAL@GMAIL.COM,2
07.VISHAL@GMAIL.COM,3
07.VISHAL@GMAIL.COM,4
07.VISHAL@GMAIL.COM,5
07.VISHAL@GMAIL.COM,6
07.VISHAL@GMAIL.COM,7
07.YASH@GMAIL.COM,1
07.YASH@GMAIL.COM,2
07.YASH@GMAIL.COM,3
07.YASH@GMAIL.COM,4

Output File Needed:-

07.SHEKHAR@GMAIL.COM,1,1
07SHIBAJI@GMAIL.COM,1,1
07.SHINDE@GMAIL.COM,1,7
07.SHINDE@GMAIL.COM,2,7
07.SHINDE@GMAIL.COM,3,7
07.SHINDE@GMAIL.COM,4,7
07.SHINDE@GMAIL.COM,5,7
07.SHINDE@GMAIL.COM,6,7
07.SHINDE@GMAIL.COM,7,7
07.SHOBHIT@GMAIL.COM,1,1
07SKERCH@RUSKIN.AC.UK,1,1
07SONIA@GMAIL.COM,1,3
07SONIA@GMAIL.COM,2,3
07SONIA@GMAIL.COM,3,3
07SRAM@GMAIL.COM,1,2
07SRAM@GMAIL.COM,2,2
07.SUMANTA@GMAIL.COM,1,1
07SUPRIYO@GMAIL.COM,1,3
07SUPRIYO@GMAIL.COM,2,3
07SUPRIYO@GMAIL.COM,3,3
07.SUSHMA@GMAIL.COM,1,1
07.SWETA@GMAIL.COM,1,3
07.SWETA@GMAIL.COM,2,3
07.SWETA@GMAIL.COM,3,3
07.TEENA@GMAIL.COM,1,2
07.TEENA@GMAIL.COM,2,2
07.UDAY@GMAIL.COM,1,1
07.UMESH@GMAIL.COM,1,1
07VAISHALISINGH@GMAIL.COM,1,1
07.VISHAL@GMAIL.COM,1,7
07.VISHAL@GMAIL.COM,2,7
07.VISHAL@GMAIL.COM,3,7
07.VISHAL@GMAIL.COM,4,7
07.VISHAL@GMAIL.COM,5,7
07.VISHAL@GMAIL.COM,6,7
07.VISHAL@GMAIL.COM,7,7
07.YASH@GMAIL.COM,1,4
07.YASH@GMAIL.COM,2,4
07.YASH@GMAIL.COM,3,4
07.YASH@GMAIL.COM,4,4

i,e 1 more column containing maximum no of entries corresponding to a particular email in each column so that every row now contains maximum occurence of each email. I am looking for a feasible soln for such a large file preferably in python or shell script and complexity of O(n) or O(nlogn) O(n**2) wont do in this case

解决方案

Lets try a python script since you may be more familiar with that language, doesn't require huge memory or hard disk space. Tested on Python 2.7 and 3.2

#!/usr/bin/python
email = "" # Initialize the email
count = 0  # and counter
import fileinput

for line in fileinput.input("word.txt"): # Interator: process a line at a time
  myArr = line.split(",")
  if (email != myArr[0]): # New email; print and reset count, email
    for n in range(0,count):
      print email + "," + str(n+1) + "," + str(count)
    email = myArr[0]
    count = 1  
  else: # Same email, increment count
    count = count + 1

# Print the final email
for n in range(0,count):
  print email + "," + str(n+1) + "," + str(count)

Anyone want to try an awk script?

这篇关于想要在另一列中打印每个电子邮件的maxm条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆