扩展一个基本的网络爬虫来过滤状态码和HTML [英] Extending a basic web crawler to filter status codes and HTML

查看:91
本文介绍了扩展一个基本的网络爬虫来过滤状态码和HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我跟着一篇关于在Java中编写基本web爬虫的教程,并且介绍了一些基本的功能。

目前它只是从网站检索HTML,将其打印到控制台。
我希望能够扩展它,以便它可以过滤HTML页面标题和HTTP状态代码等具体信息?



我发现这个库:
http://htmlparser.sourceforge.net/
...我认为可能会为我工作,但我可以做到这一点,而无需使用外部库?



这是我到目前为止:

  public static void main(String [] args){

//表示URL的字符串
String input =;

//检查在命令行添加的参数
if(args.length> = 1){
input = args [0];
}

//如果命令行中没有参数使用默认
else {
input =http://www.my_site.com/;
System.out.println(\\\
没有输入参数,所以默认为+ input
+used:\\\
);
}
//输入测试URL并从文件输入流读取
尝试{

testURL = new URL(input);
BufferedReader reader = new BufferedReader(new InputStreamReader(
testURL.openStream()));

//保存返回内容的字符串变量
String line =;

//打印内容到控制台直到没有新内容
while((line = reader.readLine())!= null){
System.out.println(线);

} catch(Exception e){

e.printStackTrace();
System.out.println(Exception thrown);
}
}


解决方案

绝对是HTTP通信的工具。但是,如果您更愿意自己实现一个 - 请参阅java.net.HttpURLConnection。它可以让你更好地控制HTTP通信。这里有一个小例子给你:

  public static void main(String [] args)throws IOException 
{
网址url =新网址(http://www.google.com);
HttpURLConnection连接=(HttpURLConnection)url.openConnection();

connection.setRequestMethod(GET);

String resp = getResponseBody(connection);

System.out.println(RESPONSE CODE:+ connection.getResponseCode());
System.out.println(resp);

$ b $ private static String getResponseBody(HttpURLConnection connection)
throws IOException
{
try
{
BufferedReader reader = new BufferedReader(new InputStreamReader(
connection.getInputStream()));

StringBuilder responseBody = new StringBuilder();
String line =; ((line = reader.readLine())!= null)
{
responseBody.append(line +\\\
);


}

reader.close();
return responseBody.toString();
}
catch(IOException e)
{
e.printStackTrace();
return;
}
}


I followed a tutorial on writing a basic web crawler in Java and have got something with basic functionality.

At the moment it just retrieves the HTML from the site and prints it to the console. I was hoping to extend it so it can filter out specifics like the HTML page title and the HTTP status code?

I found this library: http://htmlparser.sourceforge.net/ ... which I think might be able to do the job for me but could I do it without using an external library?

Here's what I have so far:

public static void main(String[] args) {

    // String representing the URL
    String input = "";

    // Check if argument added at command line
    if (args.length >= 1) {
        input = args[0];
    }

    // If no argument at command line use default
    else {
        input = "http://www.my_site.com/";
        System.out.println("\nNo argument entered so default of " + input
                + " used: \n");
    }
    // input test URL and read from file input stream
    try {

        testURL = new URL(input);
        BufferedReader reader = new BufferedReader(new InputStreamReader(
                testURL.openStream()));

        // String variable to hold the returned content
        String line = "";

        // print content to console until no new lines of content
        while ((line = reader.readLine()) != null) {
            System.out.println(line);
        }
    } catch (Exception e) {

        e.printStackTrace();
        System.out.println("Exception thrown");
    }
}

解决方案

There are definitely tools out there for HTTP communication. However, if you prefer to implement one yourself - look into java.net.HttpURLConnection. It will give you more fine grained control over HTTP communications. Here's a little sample for you:

public static void main(String[] args) throws IOException
{
  URL url = new URL("http://www.google.com");
  HttpURLConnection connection = (HttpURLConnection) url.openConnection();

  connection.setRequestMethod("GET");

  String resp = getResponseBody(connection);

  System.out.println("RESPONSE CODE: " + connection.getResponseCode());
  System.out.println(resp);
}

private static String getResponseBody(HttpURLConnection connection)
    throws IOException
{
  try
  {
    BufferedReader reader = new BufferedReader(new InputStreamReader(
        connection.getInputStream()));

    StringBuilder responseBody = new StringBuilder();
    String line = "";

    while ((line = reader.readLine()) != null)
    {
      responseBody.append(line + "\n");
    }

    reader.close();
    return responseBody.toString();
  }
  catch (IOException e)
  {
    e.printStackTrace();
    return "";
  }
}

这篇关于扩展一个基本的网络爬虫来过滤状态码和HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆