JSOUP为HTML添加额外的编码内容 [英] JSOUP adding extra encoded stuff for an html

查看:144
本文介绍了JSOUP为HTML添加额外的编码内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

实际上,JSOUP在我的jSOUP解析器中向HTML添加了一些额外的编码值.我正在尝试通过

Actually JSOUP is adding some extra encoded values to my HTML in my jSOUP parser.I am trying to take care of it by

String url = "http://iqtestsites.adtech.de/pictelatest/custombkgd/StylelistDevil.html";
System.out.println("Fetching %s..."+url);

Document doc = Jsoup.connect(url).get();
//System.out.println(doc.html());

Document.OutputSettings settings = doc.outputSettings();

settings.prettyPrint(false);
settings.escapeMode(Entities.EscapeMode.base);
settings.charset("ASCII");
String html = doc.html();
System.out.println(html);

但是由于某种原因找不到Entities类,并给出了错误. 我包含的库文件是

But the Entities class is not found for some reason and is giving an error. My included lib are

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

原始HTML是

<!DOCTYPE html>
<html xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" class="SAF" id="global-header-light">
<head>

</head>
<body>


<div style="background-image: url(http://aka-cdn-ns.adtech.de/rm/ads/23274/HPWomenLOFT_1381687318.jpg);background-repeat: no-repeat;-webkit-background-size: 1001px 2059px; height: 2059px; width: 1001px; text-align: center; margin: 0 auto;">                      

<div style="height:2058px; padding-left:0px; padding-top:36px;">


<iframe style="height:90px; width:728px;" />



</div>
</div>

</body>
</html>

JSOUP的doc.html()给出了这一点:

The doc.html() from JSOUP gives this:

<!DOCTYPE html>
<html xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" class="SAF" id="global-header-light">
 <head> 
  <style>

</style> 
 </head> 
 <body> 
  <div style="background-image: url(aol.jpeg); background-repeat: no-repeat;-webkit-background-size:90720;height:720; width:90; text-align: center; margin: 0 auto;"> 
   <div style="height:450; width:100; padding-left:681px; padding-top:200px;"> 
    <iframe style="height:1050px; width:300px;"></iframe> &lt;/div&gt; &lt;/div&gt; &lt;/body&gt; &lt;/html&gt;
   </div>
  </div>
 </body>
</html>

iframe元素已添加了一些编码内容.

The iframe element has been added some encoded stuff.

请帮助.

谢谢 斯瓦拉杰(Swaraj)

Thanks Swaraj

推荐答案

实际上jsoup并未添加已编码的内容. Jsoup只是添加似乎缺少的结束标记.让我解释一下.

Actually jsoup is not adding the encoded stuff. Jsoup just adds the closing tags that seem to be missing. Let me explain.

首先,jsoup尝试格式化html.在您的情况下,这意味着它将添加缺少的结束标记. 示例

First of all, jsoup tries to format your html. In your case that means that it will add closing tags that are missing. Example

Document doc = Jsoup.parse("<div>test<span>test");
System.out.println(doc.html());

输出:

<html>
 <head></head>
 <body>
  <div>
   test
   <span>test</span>
  </div>
 </body>
</html>

如果您检查已编码的内容,您将意识到它们正在关闭标签.

If you check the encoded stuff you will realize that they are closing tags.

&lt;/div&gt;  = </div> 
&lt;/div&gt;  = </div>
&lt;/body&gt; = </body>

如果您转到该站点并按 Ctrl + U (使用Chrome浏览器),则将看到什么jsoup 将解析. Chrome会将颜色赋予其可以识别的有效html标签.出于某种奇怪的原因,它无法识别底部的标签(与转义字符显示的标签相同).出于同样的原因,jsoup的那些结束标记也存在问题.它不会将它们视为结束标签,而是将其视为文本,因此它会对其进行转义,然后通过添加这些标签将html规范化.

If you go to the site and press Ctrl+U (using chrome) then you will see what jsoup will parse. Chrome will give color to the valid html tags that it recognizes. For some odd reason it won't recognize the tags in the bottom (the same ones that appear with the escaped characters). For the same reason jsoup has a problem with those closing tags too. It doesn't treat them as closing tags, but as text, so it escapes them and then it normalizes the html by adding those tags as I explained earlier.

编辑 我设法复制了这种行为.

EDIT I managed to replicate the behavior.

Document doc = Jsoup.parse("<iframe /><span>test</span>");
System.out.println(doc.html());

您可以看到完全相同的行为.问题在于自动关闭iframe.像这样解决问题

You can see the exact same behavior. The problem is with the self closing iframe. Making it like this fixes the problem

Document doc = Jsoup.parse("<iframe></iframe><span>test</span>");
System.out.println(doc.html());

编辑2 如果您只想接收html而无需构建文档对象,则可以这样做

EDIT 2 If you want to just receive the html without building the document object you can do this

Connection.Response html = Jsoup.connect("http://iqtestsites.adtech.de/pictelatest/custombkgd/StylelistDevil.html").execute();
System.out.println(html.body());

具有以上所述,您可以找到自动关闭的iframe并将其替换为有效的表示形式(或将其完全删除).然后,您可以使用Jsoup.parse()解析该字符串 这将解决在iframe之后无法识别结束标记的问题,因为它将是有效的.

Having the above, you can find the self closing iframe and replace it with the valid representation (or remove it completely). Then you can parse that string with Jsoup.parse() This will fix the issue of not recognizing the closing tags after iframe, because it will be valid.

这篇关于JSOUP为HTML添加额外的编码内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆