遍历DOM树时Jsoup节点哈希码冲突 [英] Jsoup node hash code collision when traversing DOM tree

查看:127
本文介绍了遍历DOM树时Jsoup节点哈希码冲突的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用java jsoup构建HTML DOM树,其中使用了 Node.hashCode()。但是,我发现使用以下代码遍历DOM树时有很多哈希代码冲突:

  doc.traverse new NodeVisitor(){

@Override
public void head(Node node,int depth){

System.out.println(node hash:+ node.hashCode());

/ *一些其他操作* /
}

@Override
public void tail(Node node,int depth ){
// TODO自动生成的方法存根

/ *一些代码* /
}
}
/ pre>

所以当这个运行时,即使在前几个输出中也可以看到许多相同的哈希码。



哈希码是相当大的,我不期望这样奇怪的行为我使用jsoup-1.8.1
任何输入将不胜感激,谢谢。

解决方案


注意:这个错误已经在jSoup 1.8.2中解决了,所以我的答案不再是vant。


这可能是jSoup源代码中的错误。从来源

  @Override 
public int hashCode(){
int result = parentNode!= null? parentNode.hashCode():0;
//不是子代,否则将阻止堆栈返回父项)
result = 31 * result +(attributes!= null?attributes.hashCode():0);
返回结果;
}

我不是Java专家,但是看起来可能会返回不同节点的值相同,如果它们具有相同的属性。 (和同一个父母,感谢@alkis的评论)






编辑:我可以再现这个。使用以下HTML:

 < html> 
< head>
< / head>
< body>
< div style =blah> TODO:write content< / div>
< div style =blah> Nothing here< / div>
< p style =test> Empty< / p>
< p style =nothing> Empty< / p>
< / body>
< / html>

以下代码:

  String html = //上面发布的HTML 

文档doc = Jsoup.parse(html);

元素元素= doc.select([style]); (元素e:元素)
{
System.out.println(e.hashCode());
}

它给出:

  -148184373 
-148184373
-1050420242
2013043377

似乎在计算哈希时完全忽略内容文本,只有属性很重要。






您应该可以实现自己的解决方法。






报告错误 here


I'm using java jsoup to build HTML DOM trees, in which Node.hashCode() is used. But I find there are a lot of hash code collisions when traversing the DOM tree, using the following code:

doc.traverse(new NodeVisitor(){

    @Override
    public void head(Node node, int depth) {

        System.out.println("node hash: "+ node.hashCode());

        /* some other operations */
    }

    @Override
    public void tail(Node node, int depth) {
        // TODO Auto-generated method stub

        /* some codes */
    }
}

So when this is run, I see many identical hash codes even in the first several outputs.

The hash codes are pretty large and I don't expect such weird behavior. I used jsoup-1.8.1. Any input will be greatly appreciated, thanks.

解决方案

Note: This bug has been fixed in jSoup 1.8.2, so my answer is no longer relevant.

It might to be a bug in jSoup source. From the source:

@Override
public int hashCode() {
   int result = parentNode != null ? parentNode.hashCode() : 0;
   // not children, or will block stack as they go back up to parent)
   result = 31 * result + (attributes != null ? attributes.hashCode() : 0);
   return result;
}

I'm not a Java expert, but this looks like it could return the same value for different Nodes, if they have the same attributes. (And the same parent, thanks @alkis for the comment)


Edit: I can reproduce this. Using the following HTML:

<html>
    <head>
    </head>
    <body>
        <div style="blah">TODO: write content</div>
        <div style="blah">Nothing here</div>
        <p style="test">Empty</p>
        <p style="nothing">Empty</p>
    </body>
</html>

And the following code:

String html = //HTML posted above

Document doc = Jsoup.parse(html);

Elements elements = doc.select("[style]");
for (Element e : elements) {
   System.out.println(e.hashCode());
}

It gives:

-148184373
-148184373
-1050420242
2013043377

It seems to ignore the content text entirely while calculating the hash, and only the attributes are important.


You should probably implement your own workaround.


Bug reported here.

这篇关于遍历DOM树时Jsoup节点哈希码冲突的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆