Jsoup与regexes与XPath从HTML提取内容的性能? [英] Performance of Jsoup vs regexes vs XPath for extracting content from HTML?

查看:369
本文介绍了Jsoup与regexes与XPath从HTML提取内容的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道在通常情况下不应使用正则表达式来解析HTML .

但是我想对Web应用程序进行性能测试.我肯定知道HTML可能是什么样子.因此,我可以使用正则表达式从页面源中提取一些数据.

But I want to make a performance test for web application. I know for sure how HTML may look like. So I can use regexes to extract some data from page source.

在进行性能测试(使用Jmeter)时,我希望从主机上占用更少的资源.

As I do performance test (using Jmeter), I want to take less resources from master machine.

哪种选择会减少资源消耗:XPath,正则表达式(Jakarta ORO)或Jsoup?

What option will be less resource intensive: XPath, regexes (Jakarta ORO) or Jsoup?

推荐答案

从JMeter 2.8开始,答案是Regexp. 但这当然取决于您使用的Regexp表达式. JMeter中的正则表达式实现已进行了相当优化,并且是相关性的主要后处理方式.

As of JMeter 2.8, the answer is Regexp. But it depends of course on Regexp expressions you use. Regexp implementation in JMeter is rather optimized and the main post processing way for correlation.

关于JSoup,例如,它需要基于JSR223后处理器的自定义编码.

Regarding JSoup, it would need custom coding based on JSR223 post processor for example.

JMeter 2.9将引入新的基于CSS/JQuery选择器的Extractor,并提供2种可能的基础实现:

JMeter 2.9 will introduce a new CSS/JQuery selector based Extractor with 2 possible underlying implementations:

乔德·拉加托(请参阅:

在构建DOM文档时,其性能将低于Regexp,但是它简化了不需要超优化测试计划的测试计划中的语法.

Its performance will be lower than Regexp as it builds a DOM document, but it eases much syntax in Test Plans that don't require ultra-optimised Test Plans.

最后,关于XPath,因为它构建了DOM树:

Finally, regarding XPath, as it builds a DOM Tree:

与正则表达式相比,它具有更高的内存和CPU成本,特别是如果您要提取许多元素,则创建了增强功能:

It has a memory and CPU cost which is higher than regex particularly if you want to extract many elements, an enhancement has been created:

这篇关于Jsoup与regexes与XPath从HTML提取内容的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆