BeautifulSoup刮痧:加载div的内容,而不是 [英] BeautifulSoup Scraping: loading div instead of the content

查看:196
本文介绍了BeautifulSoup刮痧:加载div的内容,而不是的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

菜鸟在这里。
我试图从本网站刮的搜索结果: http://www.mastersportal.eu/search/?q=di-4|lv-master&order=relevance

我使用Python的BeautifulSoup

I'm using python's BeautifulSoup

import csv
import requests
from BeautifulSoup import BeautifulSoup

for numb in ('0', '69'):
        url = ('http://www.mastersportal.eu/search/?q=ci-30,11,10,3,4,8,9,14,15,16,17,34,1,19|di-4|lv-master|rv-1&start=' + numb + '0&order=tuition_eea&direction=asc')
        response = requests.get(url)
        html = response.content

        soup = BeautifulSoup(html)
        table = soup.find('div', attrs={'id': 'StudySearchResults'})

        lista = []
        for i in table.findAll('h3'):
            lista.append(h3.string)
print(table.prettify())

我想获得干净的数据与对掌握的基本信息(现在只是名称)。
我在这里使用的URL是网站和环路上的一个过滤的研究去与网页应该罚款。

I want to get clean data with the basic information about the Master (for now just the name). The URL I'm using here is for a filtered research on the website and the loop to go on with pages should be fine.

然而,结果是:

<div id="StudySearchResults">
  <div style="display:none" id="TrackingSearchValue" class="TrackingSearchValue" data-search=""></div>
  <div style="display:none" id="SearchViewEvent" class="TrackingEvent TrackingNoLocation" data-type="srch" data-action="view" data-id=""></div>
  <div id="StudySearchResultsStudies" class="TrackingLinkedList" data-start="" data-list-type="study" data-type="rslts">
    <!-- Wait pane, just here to make sure there is no white page -->
    <div id="WaitPane" class="WaitPane">
      <img src="http://www.mastersportal.eu/Modules/Results/Resources/Throbber.gif" />
      <span>Loading search results...</span>
    </div>
  </div>
</div>

为什么没有显示的内容,但仅装载单吗?阅读周围,我觉得它有什么做的网站使用JavaScript处理数据的方式,并为Python像一个AJAX请求存在吗? (或任何其他方式来告诉刷屏等待网页加载?)

Why isn't the content displaying but only the loading div? Reading around I feel it has something to do with the way the website handles data with JavaScript, does something like an AJAX request exist for Python? (or any other way to tell the scraper to wait for the page to load?)

推荐答案

您已经基本上回答了你自己的问题。美丽的汤是一个纯粹的网络刮板将只下载任何服务器的特定URL返回。

You have basically answered your own question. Beautiful Soup is a pure web scraper which will only download whatever the server returns for a specific URL.

如果要呈现的页面,因为它是在浏览器中所示,您将需要使用类似的硒的webdriver ,以启动一个实际的浏览器进行远程控制。

If you want to render the page as it is shown in a browser, you will need to use something like Selenium Webdriver which will start up an actual browser and remote control it.

在使用webdriver的功能非常强大,它具有比纯网页抓取,以及虽然更陡峭的学习曲线。

While using Webdriver is very powerful, it has a much steeper learning curve than pure web scraping as well though.

如果你想进入webdriver的使用与Python,href=\"http://selenium-python.readthedocs.org/getting-started.html\" rel=\"nofollow\">官方文档的

If you want to get into using Webdriver with Python, the official documentation is a good place to start.

这篇关于BeautifulSoup刮痧:加载div的内容,而不是的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆