在大文本中找到某种模式的有效方法是什么? [英] What is the efficient way to find some pattern in a big text?

查看:36
本文介绍了在大文本中找到某种模式的有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从大型文本文件中提取电子邮件地址.最好的方法是什么?

I want to extract email addresses from a large text file. what is the best way to do it?

我的想法是在文本中找到 '@' 并使用Regex"将电子邮件地址查找到子字符串中(例如)此位置之前的 256 个字符和 512 的长度.

My idea is to find '@' in the text and use "Regex" to find email address into substring at (for example) 256 chars before this position and length of 512.

P.S.:坦率地说,我想知道在 巨大 文本中找到某种模式(如电子邮件地址)的最佳和最有效的方法.

P.S.: Straightforwardly I want to know the best and most efficient way to find some pattern (like email addresses) in a huge text.

推荐答案

如果你绝对需要最有效的方式,我认为不应该使用正则表达式.

If you absolutely need the most efficient way, I don't think regular expressions should be used.

假设您的文本中几乎所有 @ 实例都是电子邮件地址,并且您使用的是一种快速向前和向后字符串遍历的语言,这种方法可能会接近最快:

Assuming almost all instances of @ in your text are email addresses and you are working in a language with fast forward and backward string traversal, this method will probably be close to the fastest:

  1. 搜索@
  2. 手动比较 @ 后面的每个字符以确保它们在允许的 ASCII 范围内
  3. 跟踪是否在第一个空格或其他有效终止字符之前找到了有效域
  4. 再次从 @ 符号向后搜索,比较每个字符以确保它们在本地组件的有效字符范围内
  1. Search for @
  2. Manually compare each character after the @ to make sure they are within the allowed ASCII ranges
  3. Keep track of whether a valid domain was found before the first space or other valid terminating character
  4. Search again from the @ symbol backwards, comparing each character to make sure they fall within the valid character ranges for the local component

这篇关于在大文本中找到某种模式的有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆