Python web scraping:503响应特定网站(为什么?) [英] Python web scraping : 503 Response with specific site (how come?)

查看:229
本文介绍了Python web scraping:503响应特定网站(为什么?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试学习python和web抓取一些网站。看到我能学到什么。我注意到 Amazon.com 会给我回复503 除非我在我的 SESSION.get()中使用header属性。

Experimenting with learning python and web scraping some web sites. Seeing what I can learn. I noticed Amazon.com would give me a Response 503 unless I use a header attribute in my SESSION.get().

但这不适用于 readcomiconline.to 得到回复503 无论我尝试什么。假设这与它的JavaScript预加载器有关。

But this does not work for readcomiconline.to where I get a Response 503 no matter what I try. Assuming this has to do with it's JavaScript preloader.

有任何解决方法吗?

import requests 
urlAmazon = 'http://amazon.com'
urlComics = 'http://readcomiconline.to'
headerAgent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
client = requests.session()

resultOne = client.get(urlAmazon)
print(resultOne) #<Response [503]>
resultOne = client.get(urlAmazon, headers=headerAgent)
print(resultOne) #<Response [200]>

resultTwo = client.get(urlComics)
print(resultTwo) #<Response [503]>
resultTwo = client.get(urlComics, headers=headerAgent)
print(resultTwo) #<Response [503]>

尝试使用Selenium并仍然收到503错误。任何方式围绕javascript做一个适当的网页刮?

Tried using Selenium and still getting the 503 error. Any way around the javascript at all to do a proper web scrape?

import bs4, requests
from selenium import webdriver
from lxml import html

headerAgent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}

res = requests.get('http://readcomiconline.to/Comic/Saga/Issue-1 &readType=1',headers=headerAgent)
res.raise_for_status()

soup = bs4.BeautifulSoup(res.text, "lxml")
comicElement = soup.find('table', {'class':'listing'})


推荐答案

关于selenium的最好的事情是它可以使用 execute_script('script'来模拟脚本执行的操作)。对于内容由JS呈现的网站,最好的方法是了解JS如何呈现内容。跟踪XHR并检查响应,看它是否返回您需要的内容。

The best thing about selenium is it can emulate the actions executed by scripts using execute_script('script'). For sites whose contents are rendered by JS, the best way is to understand how JS renders the contents. Trace the XHR and check the responses to see if it returns the content you need.

这篇关于Python web scraping:503响应特定网站(为什么?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆