JSoup未显示Java中的所有html(缺少td和tr标签) [英] JSoup not showing all the html in Java (td and tr tags missing)

查看:115
本文介绍了JSoup未显示Java中的所有html(缺少td和tr标签)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在获取标签下的所有html代码时遇到了麻烦.这是我当前的代码:

I'm having trouble getting all the html code under the tags. Here is my current code:

Document document = Jsoup.connect("http://stackoverflow.com/questions/2971155/what-is-the-fastest-way-to-scrape-html-webpage-in-android").get();
Elements desc = document.select("tr");

System.out.println(desc.toString());

是针对该问题的,而我正尝试从该问题的描述中获取文字.但是我没有得到像问题标签那样的某些tr或td标签.这是我想要获取的td标签:

It's for that question, and I'm trying to get the text from the question's description. But I'm getting not getting certain tr or td tags like the ones for the question. Here is td tag I'm trying to get:

<td class="postcell">

该标签下的是实际帖子.现在,当我打印出我实际得到的内容时,我得到了大量的空td标签和一些注释,但没有实际的帖子.

Under that tag is the actual post. Now when I print out what I'm actually getting, I'm getting a ton of empty td tags and some comments, but not the actual post.

 <tr id="comment-37956942" class="comment ">
 <td>
 <table>
 <tbody>
 <tr>
  <td class=" comment-score"> &nbsp;&nbsp; </td>
  <td> &nbsp; </td>
  </tr>
</tbody>
</table> </td>
 <td class="comment-text">
<div style="display: block;" class="comment-body">
 <span class="comment-copy">You shouldn't parse HTML with regexes: <a   href="http://blog.codinghorror.com/parsing-html-the-cthulhu-way/" rel="nofollow">blog.codinghorror.com/parsing-html-the-cthulhu-way</a></span> –&nbsp;
 ﹕    <a href="/users/25612/motob%c3%b3i" title="469 reputation" class="comment-user">motobói</a>

它继续使用空的td和tr标签.我找不到实际的问题.有人知道为什么会这样吗?

And it keeps on going with empty td and tr tags. I can't find the actual question. Anyone know why this is happening?

从本质上讲,我只想要问题帖中的文本,而且我不知道该如何获取,因此如果有人可以向我展示如何获取文本,那将是很好的选择.

Essentially, I just want the text from the question's post, and I don't know how to get it, so it would be nice if someone could show me how to get the text.

推荐答案

Jsoup是解析器.这意味着它无法执行任何可生成html的javascript代码.遇到此问题时,检索该内容的唯一方法是通过无头浏览器,其中包括一个JavaScript引擎.流行的库是 selenium webdriver .

Jsoup is a parser. That means that it can't execute any javascript code, that could generate html. When you encounter this problem the only way to retrieve that content is through a headless browser, that includes a javascript engine. A popular library is selenium webdriver.

为了确定您要解析的内容是在服务器(静态内容)还是在客户端(动态内容-javascript生成)中生成的,您可以执行以下操作:

In order to determine if the content you are trying to parse is generated in the server (static content) or in the client (dynamic content-javascript generated) you can do the following:

  1. 访问您要解析的页面
  2. Ctrl + U
  1. Visit the page you want to parse
  2. Press Ctrl + U

以上步骤将打开一个新标签,其中包含jsoup接收的内容.如果您需要的内容不存在,则由javascript生成.

The steps above will open a new tab that contains the content that jsoup receives. If the content you need is not there, then it's generated by javascript.

按照步骤搜索内容.如果存在,但是jsoup仍然有问题,则很可能是该站点将您视为机器人或移动设备.尝试设置桌面浏览器的userAgent,然后看看会发生什么.

Follow the steps and search for the content. If it's there, but jsoup still has problems, then most probably the case is that the site considers you a bot or a mobile device. Try setting the userAgent of a desktop browser and see what happens.

Document document = Jsoup.connect("http://stackoverflow.com/questions/2971155/what-is-the-fastest-way-to-scrape-html-webpage-in-android").userAgent("USER_AGENT_HERE").get();

最重要的是,当网站公开API并让用户以编程方式提取信息时,最好使用它. Stackoverflow有可用的API

Most importantly, when the site exposes and API for the users to extract information programmatically then it's better to just use that. Stackoverflow has an API available

这篇关于JSoup未显示Java中的所有html(缺少td和tr标签)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆