为什么过一会儿我为什么要用Java获取403状态代码? [英] Why am I getting 403 status code in Java after a while?

查看：170 发布时间：2020/11/25 0:16:22 java http web-scraping http-headers http-status-code-403

本文介绍了为什么过一会儿我为什么要用Java获取403状态代码?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当我尝试检查网站内的状态代码时，过了一会儿我遇到403响应代码.首先，当我运行代码时，每个站点都发送回数据，但是在我的代码用Timer重复执行之后，我看到一个网页返回403响应代码.这是我的代码.

When I try to check status codes within sites I face off 403 response code after a while. First when I run the code every sites send back datas but after my code repeat itself with Timer I see one webpage returns 403 response code. Here is my code.

public class Main {

    public static void checkSites() {
        Timer ifSee403 = new Timer();

        try {
            File links = new File("./linkler.txt");
            Scanner scan = new Scanner(links);
            ArrayList<String> list = new ArrayList<>();
            while(scan.hasNext()) {
                list.add(scan.nextLine());
            }
            File linkStatus = new File("LinkStatus.txt");
            if(!linkStatus.exists()){
                linkStatus.createNewFile();
            }else{
                System.out.println("File already exists");
            }
            BufferedWriter writer = new BufferedWriter(new FileWriter(linkStatus));
            for(String link : list) {
                try {
                    if(!link.startsWith("http")) {
                        link = "http://"+link;
                    }
                    URL url = new URL(link);
                    HttpURLConnection.setFollowRedirects(true);
                    HttpURLConnection http = (HttpURLConnection)url.openConnection();
                    http.setRequestMethod("HEAD");
                    http.setConnectTimeout(5000);
                    http.setReadTimeout(8000);

                    int statusCode = http.getResponseCode();
                    if (statusCode == 200) {
                        ifSee403.wait(5000);
                        System.out.println("Hello, here we go again");
                    }
                    http.disconnect();
                    System.out.println(link + " " + statusCode);
                    writer.write(link + " " + statusCode);
                    writer.newLine();
                } catch (Exception e) {
                    writer.write(link + " " + e.getMessage());
                    writer.newLine();

                    System.out.println(link + " " +e.getMessage());
                }
            }
            try {
                writer.close();

            } catch (Exception e) {
                System.out.println(e.getMessage());
            }

            System.out.println("Finished.");

        } catch (Exception e) {
            System.out.println(e.getMessage());
        }



    }

    public static void main(String[] args) throws Exception {


        Timer myTimer = new Timer();

        TimerTask sendingRequest = new TimerTask() {
            public void run() {
                checkSites();
            }
        };
        myTimer.schedule(sendingRequest,0,150000);

    }
}

我该如何解决?谢谢

编辑评论:

我添加了http.disconnect();用于在检查状态代码后关闭连接.

I've added http.disconnect(); for closing connection after checked status codes.

我也添加了

if(statusCode == 200) {
ifSee403.wait(5000);
System.out.println("Test message);

}

但是没有用.编译器返回的当前线程不是所有者错误.我需要解决此问题，并用403更改200，然后说ifSee403.wait(5000)并再次尝试输入状态代码.

But it didn't work. Compiler returned current thread is not owner error. I need to fix this and change 200 with 403 and say ifSee403.wait(5000) and try it again the status code.

推荐答案

一个替代"字样； -顺便说一句-对IP/欺骗/匿名化将(而不是)尝试服从" IP.安全代码希望您执行的操作.如果您要编写抓取工具"，并且知道存在机器人检测"，则表示自动检测".那不像您反复访问网站时调试代码的方式-您应该尝试使用 HTML下载发布作为对您问的最后一个问题的答案.

One "alternative" - by the way - to IP / Spoofing / Anonymizing would be to (instead) try "obeying" what the security-code is expecting you to do. If you are going to write a "scraper", and are aware there is a "bot detection" that doesn't like you debugging your code while you visit the site over and over and over - you should try using the HTML Download which I posted as an answer to the last question you asked.

如果您 下载HTML并保存 (将其保存到文件-每小时一次)，然后为您写 HTML解析/监控代码使用您保存的文件的HTML内容，您(可能)会遵守网站 的安全要求，并且仍然能够检查可用性 .

If you download the HTML and save it (save it to a file - once an hour), and then write you HTML Parsing / Monitoring Code using the HTML contents of the file you have saved, you will (likely) be abiding by the security-requirements of the web-site and still be able to check availability.

如果您希望继续使用 JSoup ，则该A.P.I.可以选择将HTML 作为字符串 接收.因此，如果您使用我发布的HTML Scrape代码，然后将HTML String写入磁盘，则可以根据需要将其频繁地馈送到 JSoup ，而不会引起 Bot Detection Security Checks 出发.

If you wish to continue to use JSoup, that A.P.I. has an option for receiving HTML as a String. So if you use the HTML Scrape Code I posted, and then write that HTML String to disk, you can feed that to JSoup as often as you like without causing the Bot Detection Security Checks to go off.

如果您偶尔按他们的规则进行游戏，则可以轻松编写测试仪.

If you play by their rules once in a while, you can write your tester without much hassle.

import java.io.*;
import java.net.*;

...

// This line asks the "url" that you are trying to connect with for
// an instance of HttpURLConnection.  These two classes (URL and HttpURLConnection)
// are in the standard JDK Package java.net.*

HttpURLConnection con = (HttpURLConnection) url.openConnection();

// Tells the connection to use "GET" ... and to "pretend" that you are
// using a "Chrome" web-browser.  Note, the User-Agent sometimes means 
// something to the web-server, and sometimes is fully ignored.

con.setRequestMethod("GET");
con.setRequestProperty("User-Agent", "Chrome/61.0.3163.100");

// The classes InputStream, InputStreamReader, and BufferedReader
// are all JDK 1.0 package java.io.* classes.

InputStream      is = con.getInputStream();
BufferedReader   br = new BufferedReader(new InputStreamReader(is));
StringBuffer     sb = new StringBuffer();
String           s;

// This reads each line from the web-server.
while ((s = br.readLine()) != null) sb.append(s + "\n");

// This writes the results from the web-server to a file
// It is using classes java.io.File and java.io.FileWriter

File outF = new File("SavedSite.html");
outF.createNewFile();
FileWriter fw = new FileWriter(outF);
fw.write(sb.toString());
fw.close();

同样，这段代码是非常基本的东西，根本不使用任何特殊的 JAR库代码 .下一个方法使用 JSoup 库(您已经明确请求了-即使我不使用它……也很好！)...这就是方法"parse".它将解析您刚刚保存的String.您可以从磁盘加载此HTML String，并使用以下命令将其发送到 JSoup :

Again, this code is very basic stuff that doesn't use any special JAR Library Code at all. The next method uses the JSoup library (which you have explicitly requested - even though I don't use it... It is just fine!) ... This is the method "parse" which will parse the String you have just saved. You may load this HTML String from disk, and send it to JSoup using:

方法文档: org.jsoup.Jsoup.parse(File in, String charsetName, String baseUri)

如果您希望调用 JSoup ，只需使用以下命令将其传递给java.io.File实例:

If you wish to invoke JSoup just pass it a java.io.File instance using the following:

File f = new File("SavedSite.html");
Document d = Jsoup.parse(f, "UTF-8", url.toString());

我认为您根本不需要计时器...

I do not think you need timers at all...

再次:如果您要对服务器进行大量呼叫.这个答案的目的是向您展示如何将服务器的响应保存到磁盘上的文件 ，这样您就不必打很多电话-只需一个！ 如果将对服务器的呼叫限制为每小时一次，那么(可能但并非保证)您将避免出现403 Forbidden Bot Detection Problem.

AGAIN: If you are making lots of calls to the server. The purpose of this answer is to show you how to save the response of the server to a file on disk, so you don't have to make lots of calls - JUST ONE! If you restrict your calls to the server to once per hour, then you will (likely, but not a guarantee) avoid getting a 403 Forbidden Bot Detection Problem.

这篇关于为什么过一会儿我为什么要用Java获取403状态代码?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么过一会儿我为什么要用Java获取403状态代码? [英] Why am I getting 403 status code in Java after a while?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

为什么过一会儿我为什么要用Java获取403状态代码? [英] Why am I getting 403 status code in Java after a while?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭