网页抓取导致 403 禁止错误 [英] Web scraping results in 403 Forbidden Error

查看:22
本文介绍了网页抓取导致 403 禁止错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 BeautifulSoup 从 SeekingAlpha 中抓取每家公司的收入.但是,该站点似乎正在检测到正在使用网络抓取工具?我收到HTTP 错误 403:禁止"

I'm trying to web scrape the earnings for each company off SeekingAlpha using BeautifulSoup. However, it seems like the site is detecting that a web scraper is being used? I get a "HTTP Error 403: Forbidden"

我试图抓取的页面是:https://seekingalpha.com/symbol/AMAT/收益

有谁知道可以做些什么来绕过这个?

Does anyone know what can be done to bypass this?

推荐答案

我能够通过使用代理访问站点内容,从这里找到:

I was able to access the site contents by using a proxy, found from here:

https://free-proxy-list.net/

然后,使用 requests 模块创建播放负载,您可以抓取站点:

Then, creating a playload using the requests module, you can scrape the site:

import requests
import re
from bs4 import BeautifulSoup as soup
r = requests.get('https://seekingalpha.com/symbol/AMAT/earnings', proxies={'http':'50.207.31.221:80'}).text
results = re.findall('Revenue of $[a-zA-Z0-9.]+', r)
s = soup(r, 'lxml')
titles = list(map(lambda x:x.text, s.find_all('span', {'class':'title-period'})))
epas = list(map(lambda x:x.text, s.find_all('span', {'class':'eps'})))
deciding = list(map(lambda x:x.text, s.find_all('span', {'class':re.compile('green|red')})))
results = list(map(list, zip(titles, epas, results, epas)))

输出:

[[u'Q4: 11-16-17', u'EPS of $0.93 beat by $0.02', u'Revenue of $3.97B', u'EPS of $0.93 beat by $0.02'], [u'Q3: 08-17-17', u'EPS of $0.86 beat by $0.02', u'Revenue of $3.74B', u'EPS of $0.86 beat by $0.02'], [u'Q2: 05-18-17', u'EPS of $0.79 beat by $0.03', u'Revenue of $3.55B', u'EPS of $0.79 beat by $0.03'], [u'Q1: 02-15-17', u'EPS of $0.67 beat by $0.01', u'Revenue of $3.28B', u'EPS of $0.67 beat by $0.01'], [u'Q4: 11-17-16', u'EPS of $0.66 beat by $0.01', u'Revenue of $3.30B', u'EPS of $0.66 beat by $0.01'], [u'Q3: 08-18-16', u'EPS of $0.50 beat by $0.02', u'Revenue of $2.82B', u'EPS of $0.50 beat by $0.02'], [u'Q2: 05-19-16', u'EPS of $0.34 beat by $0.02', u'Revenue of $2.45B', u'EPS of $0.34 beat by $0.02'], [u'Q1: 02-18-16', u'EPS of $0.26 beat by $0.01', u'Revenue of $2.26B', u'EPS of $0.26 beat by $0.01'], [u'Q4: 11-12-15', u'EPS of $0.29  in-line ', u'Revenue of $2.37B', u'EPS of $0.29  in-line '], [u'Q3: 08-13-15', u'EPS of $0.33  in-line ', u'Revenue of $2.49B', u'EPS of $0.33  in-line '], [u'Q2: 05-14-15', u'EPS of $0.29 beat by $0.01', u'Revenue of $2.44B', u'EPS of $0.29 beat by $0.01'], [u'Q1: 02-11-15', u'EPS of $0.27  in-line ', u'Revenue of $2.36B', u'EPS of $0.27  in-line '], [u'Q4: 11-13-14', u'EPS of $0.27  in-line ', u'Revenue of $2.26B', u'EPS of $0.27  in-line '], [u'Q3: 08-14-14', u'EPS of $0.28 beat by $0.01', u'Revenue of $2.27B', u'EPS of $0.28 beat by $0.01'], [u'Q2: 05-15-14', u'EPS of $0.28  in-line ', u'Revenue of $2.35B', u'EPS of $0.28  in-line '], [u'Q1: 02-11-14', u'EPS of $0.23 beat by $0.01', u'Revenue of $2.19B', u'EPS of $0.23 beat by $0.01']]

这篇关于网页抓取导致 403 禁止错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆