解析文本的最快方法是什么? [英] What is the fastest way to parse text?

查看:73
本文介绍了解析文本的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我想提取在某个文本文件中找到的给定字符串后面的第一个单词(或浮点数)(参见 如何提取字符串后面的第一个单词?).我知道你可以用 perl 或 sed 来完成,也许还有很多其他的方法.我正在寻找性能.最快的解析方式是什么?

Say I want to extract the first word (or floating point number) that follows a given string found in some text file (see How to extract the first word that follows a string?). I know you can do it with perl, or sed, and probably many other ways. I am looking for performance. What is the fastest way to parse?

推荐答案

如果您正在寻找固定字符串,您可能想使用类似 Boyer-Moore 或 Boyer-Moore-Horspool(对于后者,我会推荐 Ray Gardner 的实现).请注意,B-M 和 B-M-H 都是次线性的.相比之下,正则表达式最多是线性的1,并且许多实现(使用回溯的实现)是二次的.

If you're looking for a fixed string, you probably want to search for it using something like Boyer-Moore or Boyer-Moore-Horspool (for the latter, I'd recommend Ray Gardner's implementation). Note that B-M and B-M-H are both sublinear. Regular expressions, by contrast, are linear at best1, and many implementations (those that use backtracking) are quadratic.

下一步是确保您尽快将数据读入内存.实际上,这通常会成为瓶颈.不幸的是,为了很好地处理瓶颈,您通常必须使用一些不可移植的代码.在 Linux 下,mmap 往往是您最好的选择,而在 Windows 下,您通常最好一次读取大块并调用 CreateFilecode> 带有 FILE_FLAG_NO_BUFFERING 标志.还值得使用I/O完成端口(IOCP)进行读取,这样您就可以并行进行搜索和读取.

The next step is to ensure you're reading the data into memory as quickly as possible. In reality, this is typically going to be the bottleneck. Unfortunately, to deal well with the bottleneck, you typically have to use some non-portable code. Under Linux, mmap tends to be about your best bet, while under Windows you're usually better off reading large chunks at a time, and invoking CreateFile with the FILE_FLAG_NO_BUFFERING flag. It's also worth using I/O completion ports (IOCP) to carry out the reading, so you can carry out the searching and the reading in parallel.

1理论上,可以编写一个 RE 引擎,对正确类型的模式进行次线性搜索——但如果有任何实际这样做,我不知道.

1In theory it would be possible to write an RE engine that did sublinear searching for the right kinds of patterns -- but if there's any that actually does, I'm not aware of it.

这篇关于解析文本的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆