Jsoup解析带有tbody标签的HTML文件 [英] Jsoup parsing an Html file with a tbody tag
问题描述
我最近在涉及tbody标签时出现了不一致的Jsoup行为, 当我使用Html结构解析网络上的遥远页面时:
I just recently a inconsistent Jsoup behavior when it comes to the tbody tags, When I'm parsing a distant page on the web with a Html structure like:
<table>
<tbody>
<tr><td>... text
</tbody>
</table>
Jsoup在select方法()返回的元素中不包含tbody元素.
Jsoup does not include the tbody element in the elements returned by the select method().
我使用connect().get()方法将远程页面加载到Document变量中,例如:
I use the method connect().get() to load the remote page in a Document variable like:
Document doc = Jsoup.connect(url).get();
String expr = "table>tr>td";
String parsedTxt = doc.select(expr).text();
但是当我解析本地磁盘上的同一页面时(下载后). Jsoup包含tbody标签.我的表达式不再起作用,因为它缺少tbody元素.
But when I parsed the same page on my local disk (after I downloading it). Jsoup includes the tbody tag. My expression will not work anymore because it's missing the tbody element.
我使用:
File input = new File(locationOfFile);
Document doc = Jsoup.parse(input, "UTF-8", "");
我的Jsoup表达式仅在第一种情况下有效.
My Jsoup expression works only in the first case.
是否有一种方法可以强制Jsoup识别tbody元素(或将其删除),以便在两种情况下都可以使用相同的表达式?
Is there a way to force Jsoup to recognize the tbody element (or to remove it) so the same expression can used in both cases?
这是Jsoup的正常行为吗?
Is this a normal behavior from Jsoup?
在解析本地页面时也应该使用connect方法吗?
Should I be using the connect method in parsing the local page as well?
推荐答案
听起来像是您用来保存包含/创建的tbody
标签的文件的浏览器.您使用哪个浏览器将文件保存到桌面?
It sounds like the browser you used to save the file included/created tbody
tags when it saved the file. Which browser did you use to save the file to your desktop?
我会尝试使用curl
或wget
手动下载文件,然后尝试从文件中进行解析.
I would try downloading the file manually using curl
or wget
and then trying the parse from file.
这篇关于Jsoup解析带有tbody标签的HTML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!