浏览器中的HTML与python中的抓取数据不对应 [英] HTML in browser doesn't correspond to scraped data in python

查看:99
本文介绍了浏览器中的HTML与python中的抓取数据不对应的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于一个项目,我必须从其他网站上抓取数据,但是我遇到了一个问题.

For a project I've to scrap datas from a different website, and I'm having problem with one.

当我查看源代码时,我想要的东西在一个表中,因此似乎很容易删除.但是,当我运行脚本时,部分代码源不会显示.

When I look at the source code the things I want are in a table, so it seems to be easy to scrap. But when I run my script that part of the code source doesn't show.

这是我的代码.我尝试了不同的事情.最初没有任何标题,然后我添加了一些但没有区别.

Here is my code. I tried different things. At first there wasn't any headers, then I added some but no difference.

# import libraries
import urllib2
from bs4 import BeautifulSoup
import csv  
import requests

# specify the url 
quote_page = 'http://www.airpl.org/Pollens/pollinariums-sentinelles'

# query the website and return the html to the variable 'page'
response = requests.get(quote_page)  
response.addheaders = [('User-agent', 'Mozilla/5.0')]
print(response.text)

# parse the html using beautiful soap and store in variable `response`
soup = BeautifulSoup(response.text, 'html.parser')  

with open('allergene.txt', 'w') as f:
    f.write(soup.encode('UTF-8', 'ignore'))

我要在网站上查找的是HTML格式为Herbacée"之后的内容:

What I'm looking for in the website is the things after "Herbacée" whose HTML Look like :

<p class="level1">

      <img src="/static/img/state-0.png" alt="pas d'émission" class="state">

    Herbacee
  </p>

您知道什么地方出了问题吗?

Do you have any idea what's wrong ?

感谢您的帮助和新年快乐:)

Thanks for your help and happy new year guys :)

推荐答案

此页面使用JavaScript呈现表,包含该表的实际页面为:

This page use JavaScript to render the table, the real page contains the table is:

http://www.alertepollens.org/gardens/garden/1/state/

您可以在Chrome开发工具中找到此网址>>>网络.

You can find this url in Chrome Dev tools>>>Network.

这篇关于浏览器中的HTML与python中的抓取数据不对应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆