使用 jsoup 和 selenium 进行网页抓取 [英] Web scraping with jsoup and selenium

查看:51
本文介绍了使用 jsoup 和 selenium 进行网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用selenium和jsoup从这个动态网站中提取一些信息.要获取我想要提取的信息,我必须单击Details öffnen"按钮.第一张图显示点击按钮前的网站,第二张图显示点击按钮后的网站.红色标记的信息是我要提取的信息.

I want to extract some information from this dynamic website with selenium and jsoup. To get the information I want to extract I have to click to the button "Details öffnen". The first picture shows the website before cklicking the button and the second shows the website after clicking the button. The red marked information is the information I want to extract.

我首先尝试仅使用 Jsoup 提取信息,但有人告诉我 Jsoup 无法处理动态内容,因此我现在尝试使用 selenium 和 Jsoup 提取信息,就像您在源代码中看到的那样.但是,我不确定 selenium 是否适合于此,所以也许还有其他方法可以更简单地提取我需要的信息,但重要的是这可以用 Java 来完成.

I first tried to extract the information only with Jsoup, but as I was told Jsoup can not handle dynamic content, so I am now trying to extract the Information with selenium and Jsoup like you can see in the sourcecode. Howerver I am not really sure if selenium is the right thing for this, so maybe there are other ways to extract the information I need more simple, but it is important that this could be done with Java.

下面两张图分别是点击按钮前和点击按钮后的html代码.

The next two pictures show the html code before clicking the button and after clicking the button.

public static void main(String[] args) {
    
    WebDriver driver = new FirefoxDriver(createFirefoxProfile());
    driver.get("http://www.seminarbewertung.de/seminar-bewertungen?id=3448");
    //driver.findElement(By.cssSelector("input[type='button'][value='Details öffnen']")).click();
    WebElement webElement = driver.findElement(By.cssSelector("input[type='submit'][value='Details öffnen'][rating_id='2318']"));
    JavascriptExecutor executor = (JavascriptExecutor)driver;
    executor.executeScript("arguments[0].click();", webElement);
    String html_content = driver.getPageSource();
    //driver.close();
    
    
    Document doc1 = Jsoup.parse(html_content);
    System.out.println("Hallo");
    
    Elements elements = doc1.getAllElements();
    for (Element element : elements) {
        System.out.println(element);
    }

}

private static FirefoxProfile createFirefoxProfile() {
    File profileDir = new File("/tmp/firefox-profile-dir");
    if (profileDir.exists())
        return new FirefoxProfile(profileDir);
    FirefoxProfile firefoxProfile = new FirefoxProfile();
    File dir = firefoxProfile.layoutOnDisk();
    try {
        profileDir.mkdirs();
        FileUtils.copyDirectory(dir, profileDir);
    } catch (IOException e) {
        e.printStackTrace();
    }
    return firefoxProfile;
}

使用此源代码,我找不到包含我要提取的信息的 div 元素.

With this source code I can not find the div element with the information I want to extract.

如果有人能帮助我,那就太好了.

It would be really great, if somebody could help me with this.

推荐答案

Jsoup 确实无法处理动态内容,如果它是 javascript 生成的,但在您的情况下,按钮正在发出 Ajax 请求,这可以完成与 Jsoup 搭配得很好.

It is true that Jsoup can't handle dynamic content if it is javascript generated, but in your case the button is making an Ajax request and this can be done with Jsoup pretty well.

我建议调用以检索按钮及其 ID,然后进行连续调用(Ajax 帖子)以检索详细信息(评论或其他内容).

I'd suggest to make a call to retieve the buttons and their ids, and then make succesive calls (Ajax posts) to retrieve the details (comments or whatever).

代码可以是:

    Document document = Jsoup.connect("http://www.seminarbewertung.de/seminar-bewertungen?id=3448").get();
    //we retrieve the buttons
    Elements select = document.select("input.rating_expand");
    //we go for the first
    Element element = select.get(0);
    //we pick the id
    String ratingId = element.attr("rating_id");

    //the Ajax call
    Document document2 = Jsoup.connect("http://www.seminarbewertung.de/bewertungs-details-abfragen")
            .header("Accept", "*/*")
            .header("X-Requested-With", "XMLHttpRequest")
            .data("rating_id", ratingId)
            .post();

    //we find the comment, and we are done
    //note that this selector is only as a demo, feel free to adjust to your needs
    Elements select2 = document2.select("div.ratingbox div.panel-body.text-center");
    //We are done!
    System.out.println(select2.text());

此代码将打印所需的内容:

This code will print the desired:

Das Eingehen auf individuelle Bedürfnisse eines jeden einzelnen Teilnehmer scheint mir ein Markenzeichen von Fromm zu sein.Bei einem früheren Seminar habe ich dies auch schon so erlebt!

Das Eingehen auf individuelle Bedürfnisse eines jeden einzelnen Teilnehmer scheint mir ein Markenzeichen von Fromm zu sein. Bei einem früheren Seminar habe ich dies auch schon so erlebt!

我希望它会有所帮助.

新年快乐!

这篇关于使用 jsoup 和 selenium 进行网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆