如何使用Java从此页面读取html内容? [英] How to read html content from this page with Java?

查看:77
本文介绍了如何使用Java从此页面读取html内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的Java应用正在尝试从以下URL中读取内容: https://www .iplocation.net/?query = 62.92.63.48

My Java app is trying to read content from the following url : https://www.iplocation.net/?query=62.92.63.48

我使用了以下方法:

  StringBuffer readFromUrl(String Url)
  {
    StringBuffer sb=new StringBuffer();
    BufferedReader in=null;
    
    try
    {
      in=new BufferedReader(new InputStreamReader(new URL(Url).openStream()));
      String inputLine;
    
      while ((inputLine=in.readLine()) != null) sb.append(inputLine+"\n");
      in.close();
    }
    catch (Exception e) { e.printStackTrace(); }
    finally 
    {
      try 
      {
        if (in!=null)
        {
          in.close();
          in=null;
        }
      }
      catch (Exception ex) { ex.printStackTrace(); }
    }
    return sb;
  }

通常它对于其他URL都可以正常工作,但是对于这个URL,结果与浏览器中显示的结果不同,它看起来像这样:

Usually it works fine for other urls, but for this one, the result is different from what's showing in a browser, it looks like this :

<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script>
(function(){function getSessionCookies(){var cookieArray=new Array();var cName=/^\s?incap_ses_/;var c=document.cookie.split(";");for(var i=0;i<c.length;i++){var key=c[i].substr(0,c[i].indexOf("="));var value=c[i].substr(c[i].indexOf("=")+1,c[i].length);if(cName.test(key)){cookieArray[cookieArray.length]=value}}return cookieArray}function setIncapCookie(vArray){var res;try{var cookies=getSessionCookies();var digests=new Array(cookies.length);for(var i=0;i<cookies.length;i++){digests[i]=simpleDigest((vArray)+cookies[i])}res=vArray+",digest="+(digests.join())}catch(e){res=vArray+",digest="+(encodeURIComponent(e.toString()))}createCookie("___utmvc",res,20)}function simpleDigest(mystr){var res=0;for(var i=0;i<mystr.length;i++){res+=mystr.charCodeAt(i)}return res}function createCookie(name,value,seconds){var expires="";if(seconds){var date=new Date();date.setTime(date.getTime()+(seconds*1000));var expires="; expires="+date.toGMTString()}document.cookie=name+"="+value+expires+"; path=/"}function test(o){var res="";var vArray=new Array();for(var j=0;j<o.length;j++){var test=o[j][0];switch(o[j][1]){case"exists":try{if(typeof(eval(test))!="undefined"){vArray[vArray.length]=encodeURIComponent(test+"=true")}else{vArray[vArray.length]=encodeURIComponent(test+"=false")}}catch(e){vArray[vArray.length]=encodeURIComponent(test+"=false")}break;case"value":try{try{res=eval(test);if(typeof(res)==="undefined"){vArray[vArray.length]=encodeURIComponent(test+"=undefined")}else if(res===null){vArray[vArray.length]=encodeURIComponent(test+"=null")}else{vArray[vArray.length]=encodeURIComponent(test+"="+res.toString())}}catch(e){vArray[vArray.length]=encodeURIComponent(test+"=cannot evaluate");break}break}catch(e){vArray[vArray.length]=encodeURIComponent(test+"="+e)}case"plugin_extentions":try{var extentions=[];try{i=extentions.indexOf("i")}catch(e){vArray[vArray.length]=encodeURIComponent("plugin_ext=indexOf is not a function");break}try{var num=navigator.plugins.length if(num==0||num==null){vArray[vArray.length]=encodeURIComponent("plugin_ext=no plugins");break}}catch(e){vArray[vArray.length]=encodeURIComponent("plugin_ext=cannot evaluate");break}for(var i=0;i<navigator.plugins.length;i++){if(typeof(navigator.plugins[i])=="undefined"){vArray[vArray.length]=encodeURIComponent("plugin_ext=plugins[i] is undefined");break}var filename=navigator.plugins[i].filename var ext="no extention";if(typeof(filename)=="undefined"){ext="filename is undefined"}else if(filename.split(".").length>1){ext=filename.split('.').pop()}if(extentions.indexOf(ext)<0){extentions.push(ext)}}for(i=0;i<extentions.length;i++){vArray[vArray.length]=encodeURIComponent("plugin_ext="+extentions[i])}}catch(e){vArray[vArray.length]=encodeURIComponent("plugin_ext="+e)}break}}vArray=vArray.join();return vArray}var o=[["navigator","exists"],["navigator.vendor","value"],["navigator.appName","value"],["navigator.plugins.length==0","value"],["navigator.platform","value"],["navigator.webdriver","value"],["platform","plugin_extentions"],["ActiveXObject","exists"],["webkitURL","exists"],["_phantom","exists"],["callPhantom","exists"],["chrome","exists"],["yandex","exists"],["opera","exists"],["opr","exists"],["safari","exists"],["awesomium","exists"],["puffinDevice","exists"],["navigator.cpuClass","exists"],["navigator.oscpu","exists"],["navigator.connection","exists"],["window.outerWidth==0","value"],["window.outerHeight==0","value"],["window.WebGLRenderingContext","exists"],["document.documentMode","value"],["eval.toString().length","value"]];try{setIncapCookie(test(o));document.createElement("img").src="/_Incapsula_Resource?SWKMTFSR=1&e="+Math.random()}catch(e){img=document.createElement("img");img.src="/_Incapsula_Resource?SWKMTFSR=1&e="+e}})();
</script>
<script>
(function() { 
var z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D2273746128......6F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval('String.fromCharCode('+z+')'));})();
</script></head>
<body>
<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>
</body></html>

在这种情况下,阅读显示在浏览器中的html内容的正确方法是什么?

So what's the proper way to read the html content that shows up in the browser, in this case ?

阅读建议后,我将程序更新为如下所示:

Edit : After reading the suggestions, I've updated my program to look like the following :

StringBuilder response=new StringBuilder();
String USER_AGENT="Mozilla/5.0",inputLine;
BufferedReader in=null;    

try
{
  HttpURLConnection con=(HttpURLConnection)new URL(Url).openConnection();
  con.setRequestMethod("GET");
  con.setRequestProperty("Accept-Charset","UTF-8");
  con.setRequestProperty("User-Agent",USER_AGENT);                         // Add request header

  int responseCode=con.getResponseCode();
  in=new BufferedReader(new InputStreamReader(con.getInputStream()));
  while ((inputLine=in.readLine())!=null) { response.append(inputLine); }
  in.close();
}
catch (Exception e) { e.printStackTrace(); }
finally 
{
  try { if (in!=null) in.close(); }
  catch (Exception ex) { ex.printStackTrace(); }
}
return response.toString();

但是仍然没有用,我得到的响应如下:

Yet still didn't work, the response I got look like this :

<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=24&xinfo=8-75933493-0 0NNN RT(1479758027223 127) q(0 -1 -1 -1) r(0 -1) B12(4,315,0) U10000&incident_id=516000100118713619-514529209419563176&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 516000100118713619-514529209419563176</iframe></body></html>

有人可以显示一些有效的示例代码吗?

Could someone show some sample code that works ?

由于@thatguy,我已将程序修改为如下所示:

Thanks to @thatguy I've modified my program to look like the following :

import java.util.*;
import java.util.concurrent.*;
import java.io.*;
import java.net.*;
import java.util.Map.Entry;

public class Read_From_Url_Runner implements Callable<String[]>
{
  int Id;
  String Read_From_Url_Result[]=null,IP_Location_Url="https://www.iplocation.net/?query=[IP]",IP="62.92.63.48",Cookie,Result[],A_Url;
  
  public Read_From_Url_Runner(int Id)
  {
    this.Id=Id;
    
    A_Url=IP_Location_Url.replace("[IP]",IP);
    Cookie=getIncapsulaCookie(A_Url);
    Out("Cookie = [ "+Cookie+" ]");
    
    try
    {
      Result=call();
//      for (int i=0;i<Result.length;i++) Out(Result[i]);
    }
    catch (Exception e) { e.printStackTrace(); }
  }
  
  public String[] call() throws InterruptedException
  {
    String Text;
    
    try
    {
      Text=readUrl(A_Url,Cookie);
      Out(Text);
    }
    catch (Exception e)
    {
      Out(" --> Error in data : IP = "+IP);
//    e.printStackTrace();
    }
    return Read_From_Url_Result;
  }
  
  public static String readUrl(String url,String incapsulaCookie)
  {
    StringBuilder response=new StringBuilder();
    String USER_AGENT="Mozilla/5.0",inputLine;
    BufferedReader in=null;

    try
    {
      HttpURLConnection connection=(HttpURLConnection)new URL(url).openConnection();
      connection.setRequestMethod("GET");
      connection.setRequestProperty("Accept","text/html; charset=UTF-8");
      connection.setRequestProperty("User-Agent",USER_AGENT);
      connection.setDoInput(true);
      connection.setDoOutput(true);
      connection.setRequestProperty("Cookie",incapsulaCookie);                           // Set the Incapsula cookie
      Out(connection.getRequestProperty("Cookie"));

      in=new BufferedReader(new InputStreamReader(connection.getInputStream()));
      while ((inputLine=in.readLine())!=null) { response.append(inputLine+"\n"); }
      in.close();
    }
    catch (Exception e) { e.printStackTrace(); }
    finally
    {
      try { if (in!=null) in.close(); }
      catch (Exception ex) { ex.printStackTrace(); }
    }
    return response.toString();
  }
  
  public static String getIncapsulaCookie(String url)
  {
    String USER_AGENT="Mozilla/5.0",incapsulaCookie=null,visid=null,incap=null;          // Cookies for Incapsula, preserve order
    BufferedReader in=null;

    try
    {
      HttpURLConnection cookieConnection=(HttpURLConnection)new URL(url).openConnection();
      cookieConnection.setRequestMethod("GET");
      cookieConnection.setRequestProperty("Accept","text/html; charset=UTF-8");
      cookieConnection.setRequestProperty("User-Agent",USER_AGENT);
      cookieConnection.connect();
      
      for (Entry<String,List<String>> header : cookieConnection.getHeaderFields().entrySet())
      {
        if (header.getKey()!=null && header.getKey().equals("Set-Cookie"))               // Incapsula gives you the required cookies
        {
          for (String cookieValue : header.getValue())                                   // Search for the desired cookies
          {
            if (cookieValue.contains("visid")) visid=cookieValue.substring(0,cookieValue.indexOf(";")+1);
            if (cookieValue.contains("incap_ses")) incap=cookieValue.substring(0,cookieValue.indexOf(";"));
          }
        }
      }
      incapsulaCookie=visid+" "+incap;
      cookieConnection.disconnect();
    }
    catch (Exception e) { e.printStackTrace(); }
    finally
    {
      try { if (in!=null) in.close(); }
      catch (Exception ex) { ex.printStackTrace(); }
    }
    return incapsulaCookie;
  }
  
  private static void out(String message) { System.out.print(message); }
  private static void Out(String message) { System.out.println(message); }
  
  public static void main(String[] args)
  {
    final Read_From_Url_Runner demo=new Read_From_Url_Runner(0);
  }
}

但这只能得到响应的第一部分,如下所示:

But this only got the first portion of the response as shown below :

我真正想要得到的是如下内容:

What I really wanted to get is something like the following :

通过在以下位置运行我的程序可以得到此结果:如何关闭Javafx?

This result was got by running my program at : How to shut down Javafx?

推荐答案

您面临的问题本质上可能是 HTTP请求标头,您没有对其进行明确设置.网站通常以不同的表示形式交付,具体取决于HTTP标头(和有效负载)中的属性,以便以适当的方式为台式机或移动客户端提供服务.关于代码,您无需进行任何设置,因此无论库设置如何,都发送默认标头.如果检查浏览器正在发送的具体HTTP标头,则很可能存在差异(例如用户代理或编码,...).如果您在代码中重建标头,则结果应该是相同的.

The problem you are facing may essentially be the HTTP request header, which you do not set explicitly. Websites are usually delivered in different representations, depending on the attributes in the HTTP header (and payload), as to serve desktop or mobile clients in an appropriate manner. Regarding your code, you do not set anything, so you send a default header, whatever the library sets. If you inspect the concrete HTTP header your browser is sending, there will most likely be differences (like a user-agent or encoding,...). If you rebuild the header in your code, the result should be the same.

此外,您可以使用HttpUrlConnection,因此您可以轻松设置或读取相应的HTTP标头,例如此处查看

Additionally, you could use a HttpUrlConnection, so you can easily set or read the corresponding HTTP header, like in this SO post. Otherwise for URLConnection, look here.

进一步调查

您的方法将检索一个特殊的错误页面,该页面指示该网站使用了 Incapsula 的其他安全功能.您获得的网站看起来像这样:

Your method rerieves a special error page, which indicates that the website uses additional security features from Incapsula. The site you get looks like this:

在调查标题时,我注意到需要显示两个cookie字符串,因此您可以直接访问网站,而不是进行安全检查:

As I investigated the headers, I noticed two cookie strings that need to be present, so you get directly to the website, instead of the security check:

visid_incap_...=...
incap_ses_..._...=...

您可以执行的操作是通过一个请求进入错误页面,这将在Set-Cookie标头中为您提供两个cookie字符串.然后,您可以直接将Cookie字符串设置为visid_incap_...=...; incap_ses_..._...=...来请求网站.您可以多次执行请求,直到cookie过期.只需检查错误页面即可检测到该错误.这是有效代码,显然缺少样式和其他检查,但可以解决您的问题.其余的取决于您.

What you can do is land on the error page with a single request, which gives you both cookie strings in the Set-Cookie headers. Then you can directly request the website with the cookie strings set as visid_incap_...=...; incap_ses_..._...=.... You can execute requests multiple times, until the cookie expires. Just check for the error page to detect that. Here is working code, which obviously lacks style and additional checks, but solves your problem. The rest is up to you.

public static String getIncapsulaCookie(String url) {

    String USER_AGENT = "Mozilla/5.0";
    BufferedReader in = null;

    String incapsulaCookie = null;

    try {

        HttpURLConnection cookieConnection =
                (HttpURLConnection) new URL(url).openConnection();
        cookieConnection.setRequestMethod("GET");
        cookieConnection.setRequestProperty("Accept",
                "text/html; charset=UTF-8");
        cookieConnection.setRequestProperty("User-Agent", USER_AGENT);

        // Disable 'keep-alive'
        cookieConnection.setRequestProperty("Connection", "close");

        // Cookies for Incapsula, preserve order
        String visid = null;
        String incap = null;

        cookieConnection.connect();

        for (Entry<String, List<String>> header : cookieConnection
                .getHeaderFields().entrySet()) {

            // Incapsula gives you the required cookies
            if (header.getKey() != null
                    && header.getKey().equals("Set-Cookie")) {

                // Search for the desired cookies
                for (String cookieValue : header.getValue()) {
                    if (cookieValue.contains("visid")) {
                        visid = cookieValue.substring(0,
                                cookieValue.indexOf(";") + 1);
                    }
                    if (cookieValue.contains("incap_ses")) {
                        incap = cookieValue.substring(0,
                                cookieValue.indexOf(";"));
                    }
                }
            }
        }

        incapsulaCookie = visid + " " + incap;

        // Explicitly disconnect, also essential in this method!
        cookieConnection.disconnect();

    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        try {
            if (in != null)
                in.close();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }

    return incapsulaCookie;

}

此方法为您提取封装cookie.这是使用cookie的方法的修改版本:

This method extracts the encapsula cookie for you. Here is a modified version of your method, which uses the cookie:

public static String readUrl(String url, String incapsulaCookie) {

    StringBuilder response = new StringBuilder();
    String USER_AGENT = "Mozilla/5.0", inputLine;
    BufferedReader in = null;

    try {

        HttpURLConnection connection =
                (HttpURLConnection) new URL(url).openConnection();
        connection.setRequestMethod("GET");
        connection.setRequestProperty("Accept", "text/html; charset=UTF-8");
        connection.setRequestProperty("User-Agent", USER_AGENT);

        // Set the Incapsula cookie
        connection.setRequestProperty("Cookie", incapsulaCookie);

        in = new BufferedReader(
                new InputStreamReader(connection.getInputStream()));

        while ((inputLine = in.readLine()) != null) {
            response.append(inputLine);
        }

        in.close();

    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        try {
            if (in != null)
                in.close();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
    return response.toString();

}

正如我所观察到的,用户代理和其他属性似乎无关紧要.现在,您可以调用一次getIncapsulaCookie(String url)一次,或者每次您想要一个新的cookie时都调用它,并获取readUrl(String url, String incapsulaCookie) 多次来请求页面,直到cookie过期为止.结果是完整 HTML页面,如以下部分图片所示:

As I have observed, the user agent and other attributes do not seem to matter. You can now call getIncapsulaCookie(String url) once or whenever you want a new cookie, to get the cookie and readUrl(String url, String incapsulaCookie) multiple times to request the page, until the cookie expires. The result is the complete HTML page, as seen in this partial image:

重要详细信息: getIncapsulaCookie(...)方法中有两个基本命令,分别是cookieConnection.setRequestProperty("Connection", "close");cookieConnection.disconnect();.如果您要在之后立即拨打readUrl(...) ,则两者都是必需.如果您省略这些命令,则在收到Cookie后,服务器端的HTTP连接将保持有效状态,而对readUrl(...)的下一次调用将向您返回错误的页面.您可以通过省略以下命令来尝试执行此操作,而不是调用getIncapsulaCookie(...)然后等待5到65秒并调用readUrl(...).您会看到这也是可行的,因为连接会自动超时.另请参见此处.

Important details: There are two essential commands in the getIncapsulaCookie(...) method, namely cookieConnection.setRequestProperty("Connection", "close"); and cookieConnection.disconnect();. Both are required, if you want to call readUrl(...) immediately after. If you omit these commands, the HTTP connection will be kept alive on the server side after you received the cookie and the next call to readUrl(...) will return the wrong page to you. You can try this by leaving out these commands and instead calling getIncapsulaCookie(...), then wait 5 to 65 seconds and call readUrl(...). You will see that this also works, because the connection times out automatically. See also here.

这篇关于如何使用Java从此页面读取html内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆