Jsoup解析带有tbody标签的HTML文件 [英] Jsoup parsing an Html file with a tbody tag

查看:604
本文介绍了Jsoup解析带有tbody标签的HTML文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在涉及tbody标签时出现了不一致的Jsoup行为, 当我使用Html结构解析网络上的遥远页面时:

I just recently a inconsistent Jsoup behavior when it comes to the tbody tags, When I'm parsing a distant page on the web with a Html structure like:

<table>
   <tbody>
     <tr><td>... text
   </tbody>
</table>

Jsoup在select方法()返回的元素中不包含tbody元素.

Jsoup does not include the tbody element in the elements returned by the select method().

我使用connect().get()方法将远程页面加载到Document变量中,例如:

I use the method connect().get() to load the remote page in a Document variable like:

Document doc = Jsoup.connect(url).get();
String expr = "table>tr>td";
String parsedTxt = doc.select(expr).text();

但是当我解析本地磁盘上的同一页面时(下载后). Jsoup包含tbody标签.我的表达式不再起作用,因为它缺少tbody元素.

But when I parsed the same page on my local disk (after I downloading it). Jsoup includes the tbody tag. My expression will not work anymore because it's missing the tbody element.

我使用:

File input = new File(locationOfFile);
Document doc = Jsoup.parse(input, "UTF-8", "");

我的Jsoup表达式仅在第一种情况下有效.

My Jsoup expression works only in the first case.

是否有一种方法可以强制Jsoup识别tbody元素(或将其删除),以便在两种情况下都可以使用相同的表达式?

Is there a way to force Jsoup to recognize the tbody element (or to remove it) so the same expression can used in both cases?

这是Jsoup的正常行为吗?

Is this a normal behavior from Jsoup?

在解析本地页面时也应该使用connect方法吗?

Should I be using the connect method in parsing the local page as well?

推荐答案

听起来像是您用来保存包含/创建的tbody标签的文件的浏览器.您使用哪个浏览器将文件保存到桌面?

It sounds like the browser you used to save the file included/created tbody tags when it saved the file. Which browser did you use to save the file to your desktop?

我会尝试使用curlwget手动下载文件,然后尝试从文件中进行解析.

I would try downloading the file manually using curl or wget and then trying the parse from file.

这篇关于Jsoup解析带有tbody标签的HTML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆