在大文本中找到某种模式的有效方法是什么? [英] What is the efficient way to find some pattern in a big text?
问题描述
我想从大型文本文件中提取电子邮件地址.最好的方法是什么?
I want to extract email addresses from a large text file. what is the best way to do it?
我的想法是在文本中找到 '@' 并使用Regex"将电子邮件地址查找到子字符串中(例如)此位置之前的 256 个字符和 512 的长度.
My idea is to find '@' in the text and use "Regex" to find email address into substring at (for example) 256 chars before this position and length of 512.
P.S.:坦率地说,我想知道在 巨大 文本中找到某种模式(如电子邮件地址)的最佳和最有效的方法.
P.S.: Straightforwardly I want to know the best and most efficient way to find some pattern (like email addresses) in a huge text.
推荐答案
如果你绝对需要最有效的方式,我认为不应该使用正则表达式.
If you absolutely need the most efficient way, I don't think regular expressions should be used.
假设您的文本中几乎所有 @
实例都是电子邮件地址,并且您使用的是一种快速向前和向后字符串遍历的语言,这种方法可能会接近最快:
Assuming almost all instances of @
in your text are email addresses and you are working in a language with fast forward and backward string traversal, this method will probably be close to the fastest:
- 搜索
@
- 手动比较
@
后面的每个字符以确保它们在允许的 ASCII 范围内 - 跟踪是否在第一个空格或其他有效终止字符之前找到了有效域
- 再次从
@
符号向后搜索,比较每个字符以确保它们在本地组件的有效字符范围内
- Search for
@
- Manually compare each character after the
@
to make sure they are within the allowed ASCII ranges - Keep track of whether a valid domain was found before the first space or other valid terminating character
- Search again from the
@
symbol backwards, comparing each character to make sure they fall within the valid character ranges for the local component
这篇关于在大文本中找到某种模式的有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!