用Jsoup和Selenium进行Web抓取 [英] Web scraping with jsoup and selenium

查看:262
本文介绍了用Jsoup和Selenium进行Web抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用硒和jsoup从这个动态网站中提取一些信息.为了获得我想要提取的信息,我必须单击按钮"Detailsöffnen".第一张图片显示了单击该按钮之前的网站,第二张图片显示了单击该按钮之后的网站.红色标记的信息是我要提取的信息.

I want to extract some information from this dynamic website with selenium and jsoup. To get the information I want to extract I have to click to the button "Details öffnen". The first picture shows the website before cklicking the button and the second shows the website after clicking the button. The red marked information is the information I want to extract.

我首先尝试仅使用Jsoup提取信息,但是由于有人告诉我Jsoup无法处理动态内容,所以我现在尝试像在源代码中看到的那样使用硒和Jsoup提取信息.但是,我不确定硒是否是正确的选择,因此也许还有其他方法可以更简单地提取我需要的信息,但是使用Java来完成这一点很重要.

I first tried to extract the information only with Jsoup, but as I was told Jsoup can not handle dynamic content, so I am now trying to extract the Information with selenium and Jsoup like you can see in the sourcecode. Howerver I am not really sure if selenium is the right thing for this, so maybe there are other ways to extract the information I need more simple, but it is important that this could be done with Java.

接下来的两张图片显示了单击按钮之前和单击按钮之后的html代码.

The next two pictures show the html code before clicking the button and after clicking the button.

public static void main(String[] args) {
    
    WebDriver driver = new FirefoxDriver(createFirefoxProfile());
    driver.get("http://www.seminarbewertung.de/seminar-bewertungen?id=3448");
    //driver.findElement(By.cssSelector("input[type='button'][value='Details öffnen']")).click();
    WebElement webElement = driver.findElement(By.cssSelector("input[type='submit'][value='Details öffnen'][rating_id='2318']"));
    JavascriptExecutor executor = (JavascriptExecutor)driver;
    executor.executeScript("arguments[0].click();", webElement);
    String html_content = driver.getPageSource();
    //driver.close();
    
    
    Document doc1 = Jsoup.parse(html_content);
    System.out.println("Hallo");
    
    Elements elements = doc1.getAllElements();
    for (Element element : elements) {
        System.out.println(element);
    }

}

private static FirefoxProfile createFirefoxProfile() {
    File profileDir = new File("/tmp/firefox-profile-dir");
    if (profileDir.exists())
        return new FirefoxProfile(profileDir);
    FirefoxProfile firefoxProfile = new FirefoxProfile();
    File dir = firefoxProfile.layoutOnDisk();
    try {
        profileDir.mkdirs();
        FileUtils.copyDirectory(dir, profileDir);
    } catch (IOException e) {
        e.printStackTrace();
    }
    return firefoxProfile;
}

使用此源代码,找不到包含我要提取的信息的div元素.

With this source code I can not find the div element with the information I want to extract.

如果有人可以帮助我,那真是太好了.

It would be really great, if somebody could help me with this.

推荐答案

确实,如果Jsoup是由javascript生成的,则Jsoup无法处理动态内容,但是在您的情况下,该按钮正在发出Ajax请求,并且可以完成此操作与Jsoup配合得很好.

It is true that Jsoup can't handle dynamic content if it is javascript generated, but in your case the button is making an Ajax request and this can be done with Jsoup pretty well.

我建议拨打电话以重新获得按钮及其ID,然后进行成功调用(Ajax帖子)以检索详细信息(评论或其他内容).

I'd suggest to make a call to retieve the buttons and their ids, and then make succesive calls (Ajax posts) to retrieve the details (comments or whatever).

代码可以是:

    Document document = Jsoup.connect("http://www.seminarbewertung.de/seminar-bewertungen?id=3448").get();
    //we retrieve the buttons
    Elements select = document.select("input.rating_expand");
    //we go for the first
    Element element = select.get(0);
    //we pick the id
    String ratingId = element.attr("rating_id");

    //the Ajax call
    Document document2 = Jsoup.connect("http://www.seminarbewertung.de/bewertungs-details-abfragen")
            .header("Accept", "*/*")
            .header("X-Requested-With", "XMLHttpRequest")
            .data("rating_id", ratingId)
            .post();

    //we find the comment, and we are done
    //note that this selector is only as a demo, feel free to adjust to your needs
    Elements select2 = document2.select("div.ratingbox div.panel-body.text-center");
    //We are done!
    System.out.println(select2.text());

此代码将打印所需的内容:

This code will print the desired:

个人独立纪念日(Beesfürsse)的内心世界(Edens jeden einzelnen Teilnehmer scheint mir ein Markenzeichen von Fromm zu sein). Bei einemfrüheren研讨会已经逝世了,所以很烦!

Das Eingehen auf individuelle Bedürfnisse eines jeden einzelnen Teilnehmer scheint mir ein Markenzeichen von Fromm zu sein. Bei einem früheren Seminar habe ich dies auch schon so erlebt!

我希望它会有所帮助.

新年快乐!

这篇关于用Jsoup和Selenium进行Web抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆