如果我要抓取的图像受 cloudflare 保护并出现 1020 错误,有没有办法使用cheerio 抓取网站? [英] Is there a way to scrape website using cheerio if the image that i want to scrape is protected by cloudflare and giving 1020 error?

查看:29
本文介绍了如果我要抓取的图像受 cloudflare 保护并出现 1020 错误,有没有办法使用cheerio 抓取网站?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个漫画抓取网站作为个人项目,就在我完成整个网站时,我才知道我的网站无法抓取或查看图像,当我尝试转到图像,我收到 1020 错误,说明访问被拒绝,有什么方法可以绕过该错误,而无需从网站所有者那里获得授权令牌,

I am trying to create a manga scraping website as a personal project and just when i completed the whole website, I got to know that the image cant be scraped or viewed by my website and when i try to go to the link of the image, I got 1020 error stating access denied, Is there any way I can bypass that error without getting the authorization token from the website owner,

如果答案是否定的,那么谁能解释一下 cloudflare 如何保护图像不被抓取,因为据我所知,前端的所有内容都可以被抓取.

If the answer is no, then can anyone explain how the cloudflare is protecting the image from scraping because as far as i know everything that are in frontend can be scraped.

这是图像之一我想抓取,但是当我在浏览器上打开时,它给出了 1020 访问被拒绝错误

Edit : Here is one of the image that i want to scrape but when i am opening on browser it is giving 1020 access denied error

推荐答案

有了那个网站,为了下载像 这个,您需要在 http 请求中包含此标头:

With that web site, in order to download an image like this one, you need this header on the http request:

Referer: "https://mangakakalot.com/"

添加该标题,然后它成功返回所需的图像.删除该标头,您会收到错误消息(在本例中为 403).

Add that header and then it successfully returns the desired image. Remove that header and you get an error (403 in this case).

这是一个简单的测试应用:

Here's a simple test app:

const got = require('got');

const url = "https://s61.mkklcdnv61.com/mangakakalot/u1/uh918990/chapter_0_prologue/1.jpg";

const options = {
    headers: {
        Referer: "https://mangakakalot.com/",
    }
}

got(url, options).then(result => {
    console.log(result);
}).catch(err => {
    console.log(err);
});

仅供参考,如果您想知道我是怎么想出来的,我去了包含这张图片的网页.我查看了 Chrome 调试器的网络"选项卡,找到了对浏览器下载此特定图像的引用.然后我查看了对服务器的请求以获取此图像,并查看请求中的其他标头.我添加了两个简单的(ReferrerUser-Agent)来更准确地模拟浏览器.这将响应从 403 更改为 200.然后,我尝试查看是否可以删除这些标头中的任何一个,并且它仅适用于 Referrer 标头.

FYI, if you're wondering how I figured this out, I went to the web page that contains this image. I looked in the Network tab of the Chrome debugger and found the reference to this particular image where the browser downloaded it. I then looked at the request to the server to fetch this image and looked at exactly what other headers were on the request. I added two easy ones (Referrer and User-Agent) to more accurately mimic the browser. That changed the response from a 403 to a 200. Then, I experimented to see if I could remove either of these headers and it worked with only the Referrer header.

我猜这里的 403 错误与您在浏览器中直接访问该链接时看到的 1020 错误之间的区别可能与所使用的 http 版本有关(浏览器比我的 nodejs 更先进脚本).但是,关键是您现在可以下载上述脚本中的图像.

I'm guessing that the difference between the 403 error here and the 1020 error you saw if you directly to that link in the browser is probably to do with the version of http being used (the browser being more advanced than my nodejs script). But, the point is you can now download the image in the above script.

这篇关于如果我要抓取的图像受 cloudflare 保护并出现 1020 错误,有没有办法使用cheerio 抓取网站?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆