从大文档中提取电子邮件子字符串 [英] Extract email sub-strings from large document

查看:57
本文介绍了从大文档中提取电子邮件子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的 .txt 文件,其中包含数十万个电子邮件地址.它们都采用以下格式:

......

让 Python 在整个 .txt 文件中循环查找某个 @domain 字符串的所有实例,然后获取 <...> 中的整个地址的最佳方法是什么,并将其添加到列表中?我遇到的问题是不同地址的可变长度.

解决方案

这个代码 提取字符串中的电子邮件地址.逐行阅读时使用

<预><代码>>>>进口重新>>>line = "我们应该更频繁地使用正则表达式吗?请通过 321dsasdsa@dasdsa.com.lol 告诉我">>>match = re.search(r'[\w\.-]+@[\w\.-]+', line)>>>匹配组(0)'321dsasdsa@dasdsa.com.lol'

如果您有多个电子邮件地址,请使用 findall:

<预><代码>>>>line = "我们应该更频繁地使用正则表达式吗?请通过 321dsasdsa@dasdsa.com.lol 告诉我">>>match = re.findall(r'[\w\.-]+@[\w\.-]+', line)>>>比赛['321dsasdsa@dasdsa.com.lol', 'dadaads@dsdds.com']

<小时>

上面的正则表达式可能会找到最常见的非假电子邮件地址.如果您想完全符合 RFC 5322,您应该检查哪些电子邮件地址遵循规格.检查这个以避免在查找电子邮件时出现任何错误地址正确.

<小时>

编辑:正如 @kostek 在评论中所建议的:在字符串 Contact us at support@example.com. 中,我的正则表达式返回 support@example.com.(末尾有点).为避免这种情况,请使用 [\w\.,]+@[\w\.,]+\.\w+)

Edit II: 评论中提到了另一个很棒的改进:[\w\.-]+@[\w\.-]+\.\w+这也将捕获 example@do-main.com.

I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:

...<name@domain.com>...

What is the best way to have Python to cycle through the entire .txt file looking for a all instances of a certain @domain string, and then grab the entirety of the address within the <...>'s, and add it to a list? The trouble I have is with the variable length of different addresses.

解决方案

This code extracts the email addresses in a string. Use it while reading line by line

>>> import re
>>> line = "should we use regex more often? let me know at  321dsasdsa@dasdsa.com.lol"
>>> match = re.search(r'[\w\.-]+@[\w\.-]+', line)
>>> match.group(0)
'321dsasdsa@dasdsa.com.lol'

If you have several email addresses use findall:

>>> line = "should we use regex more often? let me know at  321dsasdsa@dasdsa.com.lol"
>>> match = re.findall(r'[\w\.-]+@[\w\.-]+', line)
>>> match
['321dsasdsa@dasdsa.com.lol', 'dadaads@dsdds.com']


The regex above probably finds the most common non-fake email address. If you want to be completely aligned with the RFC 5322 you should check which email addresses follow the specification. Check this out to avoid any bugs in finding email addresses correctly.


Edit: as suggested in a comment by @kostek: In the string Contact us at support@example.com. my regex returns support@example.com. (with dot at the end). To avoid this, use [\w\.,]+@[\w\.,]+\.\w+)

Edit II: another wonderful improvement was mentioned in the comments: [\w\.-]+@[\w\.-]+\.\w+which will capture example@do-main.com as well.

这篇关于从大文档中提取电子邮件子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆