用jsoup和硒进行网页报废 [英] Web scrapping with jsoup and selenium

查看:103
本文介绍了用jsoup和硒进行网页报废的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从这个动态网站中用硒和jsoup提取一些信息。要获取我想提取的信息,我必须点击按钮Detailsöffnen。第一张图片在点击按钮之前显示网站,第二张图片在点击按钮之后显示网站。红色标记的信息是我想提取的信息。







我首先尝试仅使用Jsoup提取信息,但正如我被告知Jsoup无法处理动态内容,所以我现在试图将信息硒和Jsoup就像你可以在源代码中看到的一样。 Howerver我不确定硒是否是正确的,所以也许有其他方法可以提取我需要的更简单的信息,但重要的是可以用Java来完成。



接下来的两张图片显示了在点击按钮之后以及点击按钮之后的html代码。




  public static void main(String [] args){ 

WebDriver driver = new FirefoxDriver(createFirefoxProfile());
driver.get(http://www.seminarbewertung.de/seminar-bewertungen?id=3448);
//driver.findElement(By.cssSelector(\"input[type='button'][value='Detailsöffnen']))。click();
WebElement webElement = driver.findElement(By.cssSelector(input [type ='submit'] [value ='Detailsöffnen'] [rating_id ='2318']));
JavascriptExecutor executor =(JavascriptExecutor)驱动程序;
executor.executeScript(arguments [0] .click();,webElement);
String html_content = driver.getPageSource();
//driver.close();


文档doc1 = Jsoup.parse(html_content);
System.out.println(Hallo);

元素元素= doc1.getAllElements();元素元素元素)
{
System.out.println(element);



$ b private static FirefoxProfile createFirefoxProfile(){
File profileDir = new File(/ tmp / firefox-profile-dir) ;
if(profileDir.exists())
返回新的FirefoxProfile(profileDir);
FirefoxProfile firefoxProfile = new FirefoxProfile();
文件dir = firefoxProfile.layoutOnDisk();
尝试{
profileDir.mkdirs();
FileUtils.copyDirectory(dir,profileDir);
} catch(IOException e){
e.printStackTrace();
}
返回firefoxProfile;
}

使用此源代码,我无法找到包含我想要的信息的div元素提取。



如果有人能帮助我,这将非常棒。

解决方案

确实,Jsoup无法处理动态内容,如果它是由javascript生成的,但在您的情况下,按钮正在制作一个Ajax请求,这可以用Jsoup完成。



我建议拨打电话来检查按钮及其ID,然后进行连续呼叫( Ajax帖子)来检索细节(评论或其他)。



代码可以是:

  Document document = Jsoup.connect ( http://www.seminarbewertung.de/seminar-bewertungen?id=3448)获得(); 
//我们检索按钮
元素select = document.select(input.rating_expand);
//我们先找到
元素element = select.get(0);
//我们选择id
String ratingId = element.attr(rating_id);

// Ajax调用
Document document2 = Jsoup.connect(http://www.seminarbewertung.de/bewertungs-details-abfragen)
.header(接受,* / *)
.header(X-Requested-With,XMLHttpRequest)
.data(rating_id,ratingId)
.post();

//我们找到了评论,并且我们完成了
//注意,这个选择器只是作为一个演示,随时根据您的需求进行调整
元素select2 = document2 .select(div.ratingbox div.panel-body.text-center);
//我们完成了!
System.out.println(select2.text());

此代码将打印所需的内容:


Das Eingehen auf individuelleBedürfnisseeines jeden einzelnen Teilnehmer scheint mir ein Markenzeichen von Fromm zu sein。 Bei einemfrüherenSeminar habe ich dies auch schon so erlebt!

我希望这会有帮助。



祝新年快乐!

I want to extract some information from this dynamic website with selenium and jsoup. To get the information I want to extract I have to click to the button "Details öffnen". The first picture shows the website before cklicking the button and the second shows the website after cklicking the button. The red marked information is the information I want to extract.

I first tried to extract the information only with Jsoup, but as I was told Jsoup can not handle dynamic content, so I am now trying to extract the Information with selenium and Jsoup like you can see in the sourcecode. Howerver I am not really sure if selenium is the right thing for this, so maybe there are other ways to extract the information I need more simple, but it is important that this could be done with Java.

The next two pictures show the html code before clicking the button and after clicking the button.

public static void main(String[] args) {

    WebDriver driver = new FirefoxDriver(createFirefoxProfile());
    driver.get("http://www.seminarbewertung.de/seminar-bewertungen?id=3448");
    //driver.findElement(By.cssSelector("input[type='button'][value='Details öffnen']")).click();
    WebElement webElement = driver.findElement(By.cssSelector("input[type='submit'][value='Details öffnen'][rating_id='2318']"));
    JavascriptExecutor executor = (JavascriptExecutor)driver;
    executor.executeScript("arguments[0].click();", webElement);
    String html_content = driver.getPageSource();
    //driver.close();


    Document doc1 = Jsoup.parse(html_content);
    System.out.println("Hallo");

    Elements elements = doc1.getAllElements();
    for (Element element : elements) {
        System.out.println(element);
    }

}

private static FirefoxProfile createFirefoxProfile() {
    File profileDir = new File("/tmp/firefox-profile-dir");
    if (profileDir.exists())
        return new FirefoxProfile(profileDir);
    FirefoxProfile firefoxProfile = new FirefoxProfile();
    File dir = firefoxProfile.layoutOnDisk();
    try {
        profileDir.mkdirs();
        FileUtils.copyDirectory(dir, profileDir);
    } catch (IOException e) {
        e.printStackTrace();
    }
    return firefoxProfile;
}

With this source code I can not find the div element with the information I want to extract.

It would be really great, if somebody could help me with this.

解决方案

It is true that Jsoup can't handle dynamic content if it is javascript generated, but in your case the button is making an Ajax request and this can be done with Jsoup pretty well.

I'd suggest to make a call to retieve the buttons and their ids, and then make succesive calls (Ajax posts) to retrieve the details (comments or whatever).

The code could be:

    Document document = Jsoup.connect("http://www.seminarbewertung.de/seminar-bewertungen?id=3448").get();
    //we retrieve the buttons
    Elements select = document.select("input.rating_expand");
    //we go for the first
    Element element = select.get(0);
    //we pick the id
    String ratingId = element.attr("rating_id");

    //the Ajax call
    Document document2 = Jsoup.connect("http://www.seminarbewertung.de/bewertungs-details-abfragen")
            .header("Accept", "*/*")
            .header("X-Requested-With", "XMLHttpRequest")
            .data("rating_id", ratingId)
            .post();

    //we find the comment, and we are done
    //note that this selector is only as a demo, feel free to adjust to your needs
    Elements select2 = document2.select("div.ratingbox div.panel-body.text-center");
    //We are done!
    System.out.println(select2.text());

This code will print the desired:

Das Eingehen auf individuelle Bedürfnisse eines jeden einzelnen Teilnehmer scheint mir ein Markenzeichen von Fromm zu sein. Bei einem früheren Seminar habe ich dies auch schon so erlebt!

I hope it will help.

Have a happy new year!

这篇关于用jsoup和硒进行网页报废的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆