流上的正则表达式而不是字符串? [英] regular expression on stream instead of string?
问题描述
假设您想通过管道进行正则表达式搜索和提取,但模式可能会跨越多行,该怎么做?也许正则表达式库适用于流?
Suppose you want to do regular expression search and extract over a pipe, but the pattern may cross multiple lines, How to do it? Maybe a regular expression library work for a stream?
我希望使用 Python 库来完成这项工作吗?但是任何解决方案都可以,当然是库而不是命令行工具.
I hope do this job using Python library? But any solution will be OK, a library not a cmd line tool of course.
顺便说一句,我知道如何解决我目前的问题,只是寻求一个通用的解决方案.
BTW, I know how to solve my current problem, just seeking a general solution.
如果不存在这样的库,那么鉴于常规数学算法从不需要向后扫描,为什么常规库不能与流一起使用.
If no such libray exists, why regular library can not work with stream given the regular mathing algorithm never need backward scaning.
推荐答案
如果你想要一个通用的解决方案,你的算法需要看起来像:
If you are after a general solution, your algorithm would need to look something like:
- 将流的一个块读入缓冲区.
- 在缓冲区中搜索正则表达式
- 如果模式匹配,则对匹配执行任何您想要的操作,丢弃缓冲区的开头直到
match.end()
并转到第 2 步. - 如果模式不匹配,则使用来自流的更多数据扩展缓冲区
如果找不到匹配项,这最终可能会使用大量内存,但在一般情况下很难做得更好(考虑尝试将 .*x
匹配为多行正则表达式在一个大文件中,唯一的 x
是最后一个字符).
This could end up using a lot of memory if no matches are found, but it is difficult to do better in the general case (consider trying to match .*x
as a multi-line regexp in a large file where the only x
is the last character).
如果您更了解正则表达式,您可能会遇到其他情况,您可以丢弃部分缓冲区.
If you know more about the regexp, you might have other cases where you can discard part of the buffer.
这篇关于流上的正则表达式而不是字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!