使用Python进行网页搜刮:BeautifulSoup的问题 [英] Web Scraping with Python: problem with BeautifulSoup

查看:55
本文介绍了使用Python进行网页搜刮:BeautifulSoup的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请帮助我将BeautifulSoup用于使用Python 3从investing.com进行网络抓取finaces值.无论我什么都得不到任何价值,并且filting类正在从网页上永久更改,因为这是一个实时价值.

Please help me with the use of BeautifulSoup to web scraping finaces values from investing.com using Python 3. Whatever I do never get any value, and the filting class is changing permanently from the web page at it is a live value.

import requests

from bs4 import BeautifulSoup

url = "https://es.investing.com/indices/spain-35-futures"
precio_objetivo = input("Introduce el PRECIO del disparador:")
precio_objetivo = float(precio_objetivo)
print (precio_objetivo)

while True:
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
precio_actual = soup.find('span', attrs={'class': 'arial_26 inlineblock pid-8828-last','id':'last_last','dir':'ltr'})
print (precio_actual)
break;

当我不对汤.find应用任何过滤器时(至少尝试获取所有网页),我得到以下结果:

When I don't apply any filter at soup.find (trying at least to get all the web page) I get this result:

<bound method Tag.find_all of 
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html>
<head>
<title>403 You are banned from this site.  Please contact via a different client configuration if you believe that this is a mistake.                                </title>
</head>
<body>
<h1>Error 403 You are banned from this site.  Please contact via a different client configuration if you believe that this is a mistake.</h1>
<p>You are banned from this site.  Please contact via a different client configuration if you believe that this is a mistake.</p>
<h3>Guru Meditation:</h3>
<p>XID: 850285196</p>
<hr/>
<p>Varnish cache server</p>
</body>
</html>

推荐答案

Web服务器将python脚本检测为机器人,因此将其阻止.通过使用标头,您可以防止它,并且下面的代码可以做到这一点:

The web server detects the python script as a bot and hence blocks it. By using headers you can prevent it and the following code does it:

import requests
from bs4 import BeautifulSoup

url = "https://es.investing.com/indices/spain-35-futures"

header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36'}
page=requests.get(url,headers=header)

soup=BeautifulSoup(page.content,'html.parser')
#this soup returns <span class="arial_26 inlineblock pid-8828-last" dir="ltr" id="last_last">9.182,5</span>

result = soup.find('span',attrs={'id':'last_last'}).get_text()
#use the get_text() function to extract the text

print(result)

这篇关于使用Python进行网页搜刮:BeautifulSoup的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆