如何使用JSoup构建NodeTraversor/NodeVisitor? [英] How do I build a NodeTraversor/NodeVisitor with JSoup?
问题描述
我几乎是编程的新手,目前正在尝试使用JSoup构建我的第一个Web爬虫.到目前为止,我已经能够从目标站点的单个页面中获取所需的数据,但是自然地,我想以某种方式遍历整个站点.
I'm pretty much a beginner in programming, currently trying to build my first web scraper using JSoup. So far I am able to get the data that I want from a single page of my target site, but naturally I would like to somehow iterate over the entire site.
JSoup似乎为此提供了某种遍历器/访问器(有什么区别?),但是我绝对不知道如何实现该功能.我知道什么是树和节点,也知道目标站点的结构,但是我不知道如何创建(?)遍历器/访问者对象(?)并使其在我的站点上运行.可能是有些未知的高级Java/oo魔术在起作用吗?
JSoup seems to offer some kind of traversor/visitor (what's the difference?) for that, yet I have absolutely no idea how to make that work. I know what trees and nodes are and know the structure of my target site, but I don't know how to create (?) a traverser/visitor-object(?) and let it run over my site. Could it be that there is some advanced Java/oo magic at work, that I don't know of?
不幸的是,Jsoup食谱和其他线程似乎都没有真正地涵盖这些细节,因此,如果有人可以向正确的方向推动我,我将非常感激.
Unfortunately neither the Jsoup cookbook nor other threads seem to really cover the details, so if someone could nudge me in the right direction I'd be very thankful.
推荐答案
JSoup似乎提供了某种遍历器/访问者(有什么区别?)
JSoup seems to offer some kind of traverser/visitor (what's the difference?)
NodeTraversor
将有效地遍历指定根节点下(包括该根节点)的所有节点.它不使用递归,因此大型DOM不会创建stackoverflow.
The NodeTraversor
will efficiently iterate through all nodes under and including a specified root node. It doesn't use recursion so large DOM won't create a stackoverflow.
NodeVisitor
(NV)是 NodeTraversor
(NT). NT每次进入节点时,都会调用NV的head
方法. NT每次离开节点时,都会调用NV的tail
方法.
The NodeVisitor
(NV) is the companion of NodeTraversor
(NT). Each time NT enters a node it calls the head
method of the NV. Each time NT leaves a node, it calls the tail
method of the NV.
NT并将其提供给您.您要做的就是为NT提供NV实施.
NT is ready made and provided to you bythe Jsoup API. All you have to do is to provide NT a NV implementation.
这是NodeVisitor的真实实现,该实现取自 ElasticSearch源代码:
Here is a real life implementation of NodeVisitor taken from ElasticSearch source code:
protected static String convertElementsToText(Elements elements) {
if (elements == null || elements.isEmpty())
return "";
StringBuilder buffer = new StringBuilder();
NodeTraversor nt = new NodeTraversor(new ToTextNodeVisitor(buffer));
for (Element element : elements) {
nt.traverse(element);
}
return buffer.toString().trim();
}
private static final class ToTextNodeVisitor implements NodeVisitor {
final StringBuilder buffer;
ToTextNodeVisitor(StringBuilder buffer) {
this.buffer = buffer;
}
@Override
public void head(Node node, int depth) {
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
String text = textNode.text().replace('\u00A0', ' ').trim(); // non breaking space
if (!text.isEmpty()) {
buffer.append(text);
if (!text.endsWith(" ")) {
buffer.append(" ");
}
}
}
}
@Override
public void tail(Node node, int depth) {
}
}
这篇关于如何使用JSoup构建NodeTraversor/NodeVisitor?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!