扩展一个基本的网络爬虫来过滤状态码和HTML [英] Extending a basic web crawler to filter status codes and HTML

查看：91 发布时间：2018/6/26 20:11:21 java html html-parsing web-crawler http-status-codes

本文介绍了扩展一个基本的网络爬虫来过滤状态码和HTML的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我跟着一篇关于在Java中编写基本web爬虫的教程，并且介绍了一些基本的功能。

目前它只是从网站检索HTML，将其打印到控制台。
我希望能够扩展它，以便它可以过滤HTML页面标题和HTTP状态代码等具体信息？

我发现这个库：
http://htmlparser.sourceforge.net/
...我认为可能会为我工作，但我可以做到这一点，而无需使用外部库？

这是我到目前为止：

  public static void main（String [] args）{
 
 //表示URL的字符串
 String input =; 
 
 //检查在命令行添加的参数
 if（args.length> = 1）{
 input = args [0]; 
} 
 
 //如果命令行中没有参数使用默认
 else {
 input =http://www.my_site.com/; 
 System.out.println（\\\
没有输入参数，所以默认为+ input 
 +used：\\\
）; 
} 
 //输入测试URL并从文件输入流读取
尝试{
 
 testURL = new URL（input）; 
 BufferedReader reader = new BufferedReader（new InputStreamReader（
 testURL.openStream（）））; 
 
 //保存返回内容的字符串变量
 String line =; 
 
 //打印内容到控制台直到没有新内容
 while（（line = reader.readLine（））！= null）{
 System.out.println（线）; 
 
} catch（Exception e）{
 
 e.printStackTrace（）; 
 System.out.println（Exception thrown）; 
} 
}

解决方案

绝对是HTTP通信的工具。但是，如果您更愿意自己实现一个 - 请参阅java.net.HttpURLConnection。它可以让你更好地控制HTTP通信。这里有一个小例子给你：

  public static void main（String [] args）throws IOException 
 {
网址url =新网址（http://www.google.com）; 
 HttpURLConnection连接=（HttpURLConnection）url.openConnection（）; 
 
 connection.setRequestMethod（GET）; 
 
 String resp = getResponseBody（connection）; 
 
 System.out.println（RESPONSE CODE：+ connection.getResponseCode（））; 
 System.out.println（resp）; 
 
 $ b $ private static String getResponseBody（HttpURLConnection connection）
 throws IOException 
 {
 try 
 {
 BufferedReader reader = new BufferedReader（new InputStreamReader（
 connection.getInputStream（）））; 
 
 StringBuilder responseBody = new StringBuilder（）; 
 String line =; （（line = reader.readLine（））！= null）
 {
 responseBody.append（line +\\\
）; 
 
 
} 
 
 reader.close（）; 
 return responseBody.toString（）; 
} 
 catch（IOException e）
 {
 e.printStackTrace（）; 
 return; 
} 
}

I followed a tutorial on writing a basic web crawler in Java and have got something with basic functionality.

At the moment it just retrieves the HTML from the site and prints it to the console. I was hoping to extend it so it can filter out specifics like the HTML page title and the HTTP status code?

I found this library: http://htmlparser.sourceforge.net/ ... which I think might be able to do the job for me but could I do it without using an external library?

Here's what I have so far:

public static void main(String[] args) {

    // String representing the URL
    String input = "";

    // Check if argument added at command line
    if (args.length >= 1) {
        input = args[0];
    }

    // If no argument at command line use default
    else {
        input = "http://www.my_site.com/";
        System.out.println("\nNo argument entered so default of " + input
                + " used: \n");
    }
    // input test URL and read from file input stream
    try {

        testURL = new URL(input);
        BufferedReader reader = new BufferedReader(new InputStreamReader(
                testURL.openStream()));

        // String variable to hold the returned content
        String line = "";

        // print content to console until no new lines of content
        while ((line = reader.readLine()) != null) {
            System.out.println(line);
        }
    } catch (Exception e) {

        e.printStackTrace();
        System.out.println("Exception thrown");
    }
}

解决方案

There are definitely tools out there for HTTP communication. However, if you prefer to implement one yourself - look into java.net.HttpURLConnection. It will give you more fine grained control over HTTP communications. Here's a little sample for you:

public static void main(String[] args) throws IOException
{
  URL url = new URL("http://www.google.com");
  HttpURLConnection connection = (HttpURLConnection) url.openConnection();

  connection.setRequestMethod("GET");

  String resp = getResponseBody(connection);

  System.out.println("RESPONSE CODE: " + connection.getResponseCode());
  System.out.println(resp);
}

private static String getResponseBody(HttpURLConnection connection)
    throws IOException
{
  try
  {
    BufferedReader reader = new BufferedReader(new InputStreamReader(
        connection.getInputStream()));

    StringBuilder responseBody = new StringBuilder();
    String line = "";

    while ((line = reader.readLine()) != null)
    {
      responseBody.append(line + "\n");
    }

    reader.close();
    return responseBody.toString();
  }
  catch (IOException e)
  {
    e.printStackTrace();
    return "";
  }
}

这篇关于扩展一个基本的网络爬虫来过滤状态码和HTML的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

扩展一个基本的网络爬虫来过滤状态码和HTML [英] Extending a basic web crawler to filter status codes and HTML

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

扩展一个基本的网络爬虫来过滤状态码和HTML [英] Extending a basic web crawler to filter status codes and HTML

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭