如何使用htmlunitsriver进行网页抓取? [英] How to do web scraping using htmlunitsriver?

查看:139
本文介绍了如何使用htmlunitsriver进行网页抓取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在得到这样的东西
我正在使用Selenium Webdriver整理一个网页,我能够实现我的数据,但是问题是,这直接与浏览器交互,我不想打开网页浏览器,也不想按原样抓取所有数据

i am getting somthing like this
Hi i am scarping a web page using Selenium Webdriver an i am able to achieve my data but problem is that this directly interact with browser and i dont want to open a web browser and want to scrape all data as it is

我如何实现我的目标

这是我的代码

    import org.openqa.selenium.By;
    import org.openqa.selenium.WebDriver;
    import org.openqa.selenium.WebElement;
    import org.openqa.selenium.firefox.FirefoxDriver;
    import org.openqa.selenium.support.ui.Select;

    public class GetData {

        public static void main(String args[]) throws InterruptedException {
            String sDate = "27/03/2014";
            WebDriver driver = new FirefoxDriver();
            String url="http://www.upmandiparishad.in/commodityWiseAll.aspx";
            driver.get(url);
            Thread.sleep(5000);
            // select barge
            new Select(driver.findElement(By.id("ctl00_ContentPlaceHolder1_ddl_commodity"))).selectByVisibleText("Jo");
             driver.findElement(By.id("ctl00_ContentPlaceHolder1_txt_rate")).sendKeys(sDate);
            // click buttonctl00_ContentPlaceHolder1_txt_rate
            Thread.sleep(3000);
            driver.findElement(By.id("ctl00_ContentPlaceHolder1_btn_show")).click();
            Thread.sleep(5000);

            //get only table tex
            WebElement findElement = driver.findElement(By.id("ctl00_ContentPlaceHolder1_GridView1"));
            String htmlTableText = findElement.getText();
            // do whatever you want now, This is raw table values.
        System.out.println(htmlTableText);


            driver.close();
            driver.quit();

        }
    }


My updated New code



import com.gargoylesoftware.htmlunit.BrowserVersion;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import org.openqa.selenium.support.ui.Select;

    public class Getdata1 {

        public static void main(String args[]) throws InterruptedException {
            WebDriver driver = new HtmlUnitDriver(BrowserVersion.FIREFOX_3_6);
        driver.get("http://www.upmandiparishad.in/commodityWiseAll.aspx");
        System.out.println(driver.getPageSource());
        Thread.sleep(5000);
        // select barge         
        new Select(driver.findElement(By.id("ctl00_ContentPlaceHolder1_ddl_commodity"))).selectByVisibleText("Jo");

        String sDate = "12/04/2014"; //What date you want
        driver.findElement(By.id("ctl00_ContentPlaceHolder1_txt_rate")).sendKeys(sDate);

        driver.findElement(By.id("ctl00_ContentPlaceHolder1_btn_show")).click();
        Thread.sleep(3000);

        //get only table tex
        WebElement findElement = driver.findElement(By.id("ctl00_ContentPlaceHolder1_GridView1"));
        String htmlTableText = findElement.getText();
        // do whatever you want now, This is raw table values.
        System.out.println(htmlTableText);

        driver.close();
        driver.quit();

        }
    }

预先感谢

推荐答案

使用Selenium的HtmlUnit或HtmlUnitDriver

Use HtmlUnit or HtmlUnitDriver by Selenium

    WebDriver driver = new HtmlUnitDriver(BrowserVersion.FIREFOX_17);
    driver.get("http://www.upmandiparishad.in/commodityWiseAll.aspx");
    System.out.println(driver.getPageSource());
    Thread.sleep(5000);
    // select barge         
    new Select(driver.findElement(By.id("ctl00_ContentPlaceHolder1_ddl_commodity"))).selectByVisibleText("Jo");

    String sDate = "12/04/2014"; //What date you want
    driver.findElement(By.id("ctl00_ContentPlaceHolder1_txt_rate")).sendKeys(sDate);

    driver.findElement(By.id("ctl00_ContentPlaceHolder1_btn_show")).click();
    Thread.sleep(3000);

    //get only table tex
    WebElement findElement = driver.findElement(By.id("ctl00_ContentPlaceHolder1_GridView1"));
    String htmlTableText = findElement.getText();
    // do whatever you want now, This is raw table values.
    System.out.println(htmlTableText);

    driver.close();
    driver.quit();

要获得表格输出,您可以尝试这样的操作.

To get tabular output, you can try something like this..

    String arrCells[] = htmlTableText.split(" ");
    Boolean bIsANumber = false;
    for(int i = 0; i < arrCells.length; i++) {

        try {
            int tmp = Integer.parseInt(arrCells[i]);
            bIsANumber = true;
        }
        catch(Exception ex) {
            bIsANumber = false;
        }

        if(bIsANumber) {
            System.out.print("\n"+arrCells[i]+"\t");
        }
        else {
            System.out.print(arrCells[i]+"\t");
        }
    }

这篇关于如何使用htmlunitsriver进行网页抓取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆