为什么使用Jsoup解析网站时使用的HTML代码与使用浏览器进行解析时的HTML代码不同 [英] Why HTML code is different when parsing site using Jsoup than using browser

查看:87
本文介绍了为什么使用Jsoup解析网站时使用的HTML代码与使用浏览器进行解析时的HTML代码不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在网站 http://www.flashscore.com/nhl/上,我正在尝试提取今日比赛"表的链接.

I am on the website http://www.flashscore.com/nhl/ and I am trying to extract the links of the 'Today's Matches' table.

我正在尝试使用以下代码,但无法正常工作.您能指出错误在哪里吗?

I am trying it with the following code, but it does not work Can you point out where the mistake is?

  final Document page = Jsoup
    .connect("http://d.flashscore.com/x/feed/t_4_200_G2Op923t_1_en_1")
    .cookie("_ga","GA1.2.47011772.1485726144")
    .referrer("http://d.flashscore.com/x/feed/proxy-local")
    .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36")
    .header("X-Fsign", "SW9D1eZo")
    .header("X-GeoIP", "1")
    .header("X-Requested-With", "XMLHttpRequest")
    .header("Accept" , "*/*")
    .get();

for (Element game : page.select("table.hockey tr")) {
Elements links = game.getElementsByClass("tr-first stage-finished");
for (Element link : links) {
    String linkHref = link.attr("href");
    String linkText = link.text();
}
 }

要尝试修复它,我开始对其进行调试.它表明我们得到了页面(虽然我们得到的是一种奇怪的HTML).之后,调试显示for循环甚至没有启动.我试图将page.select(")部分更改为其他部分(例如getElementByAttribute等),但是我刚刚开始学习网络抓取,因此我需要熟悉那些方法来浏览文档.我应该如何提取这些数据?

To try to fix it I started to debug it. It shows that we get the page (althouh we are getting kind of a strange HTML). After that the debugging showed that the for loop does not even start. I was trying to change the page.select("") part to different ones (like getElementByAttribute etc.), but I have just started to learn web scraping, so I need to get familiar with those methods to navigate through a document. How am I supposed to extract this data?

推荐答案

如评论中所述,该网站需要执行一些Javascript才能构建可链接的元素. Jsoup仅解析HTML,它不运行任何JS,并且如果从浏览器或Jsoup获取,您将看不到相同的HTML源.

As said in comments, this website need to execute some Javascript in order to build that linkable elements. Jsoup only parse HTML, it doesn't run any JS and you won't see same HTML source if you get from a browser or if you get from Jsoup.

您需要像在真实浏览器上一样运行网站.您可以使用WebDriverFirefox以编程方式进行操作.

You need to get the website as if you were running it on a real browser. You can do that programatically using WebDriver and Firefox.

我已经尝试过您的示例网站,并且可以运行:

I've tried with your example site and works:

pom.xml

<project>

<modelVersion>4.0.0</modelVersion>
<groupId>com.test</groupId>
<artifactId>test</artifactId>
<version>1.0-SNAPSHOT</version>
<build>
  <plugins>
    <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-compiler-plugin</artifactId>
    <configuration>
      <source>1.8</source>
      <target>1.8</target>
      </configuration>
    </plugin>
  </plugins>
</build>
<packaging>jar</packaging>

<name>test</name>
<url>http://maven.apache.org</url>

<properties>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

<dependencies>
  <dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-firefox-driver</artifactId>
    <version>2.43.0</version>
  </dependency>
</dependencies>

</project>

App.java

package com.test;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import java.util.Collections;
import java.util.List;
import java.util.stream.Collectors;

public class App {

public static void main( String[] args ) {
    App app = new App();
    List<String> links = app.parseLinks();
    links.forEach(System.out::println);
}

public List<String> parseLinks() {
    try {
        WebDriver driver ;
        // should download geckodriver https://github.com/mozilla/geckodriver/releases and set according your local file
        System.setProperty("webdriver.firefox.marionette","C:\\apps\\geckodriver.exe");
        driver = new FirefoxDriver();
        String baseUrl = "http://www.flashscore.com/nhl/";

        driver.get(baseUrl);

        return driver.findElement(By.className("hockey"))
                .findElements(By.tagName("tr"))
                .stream()
                .distinct()
                .filter(we -> !we.getAttribute("id").isEmpty())
                .map(we -> createLink(we.getAttribute("id")))
                .collect(Collectors.toList());

    } catch (Exception e) {
        e.printStackTrace();
        return Collections.EMPTY_LIST;
    }
}

private String createLink(String id) {
    return String.format("http://www.flashscore.com/match/%s/#match-summary", extractId(id));
}

private String extractId(String id) {
    if (id.contains("x_4_")) {
        id = id.replace("x_4_","");
    } else if (id.contains("g_4_")) {
        id = id.replace("g_4_","");
    }

    return id;
}
}

输出:

http://www.flashscore.com/match/f9MJJI69/#match-summary
http://www.flashscore.com/match/zZCyd0dC/#match-summary
http://www.flashscore.com/match/drEXdts6/#match-summary
http://www.flashscore.com/match/EJOScMRa/#match-summary
http://www.flashscore.com/match/0GKOb2Cg/#match-summary
http://www.flashscore.com/match/6gLKarcm/#match-summary
...
...

PS:使用Firefox 32.0和Selenium 2.43.0可以工作.在Selenium和Firefox之间使用不受支持的版本是一个常见错误.

这篇关于为什么使用Jsoup解析网站时使用的HTML代码与使用浏览器进行解析时的HTML代码不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆