使用Python,BeautifulSoup进行动态数据Web刮擦 [英] Dynamic Data Web Scraping with Python, BeautifulSoup

查看:239
本文介绍了使用Python,BeautifulSoup进行动态数据Web刮擦的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从HTML中提取许多页面的数据(数字)。每页的数据都不同。当我尝试使用soup.select('span [class =pull-right]')时,它应该给我数字,但只有标签出现。我相信这是因为在网页中使用了Javascript。 180,476是我想要许多页面的特定HTML的数据位置:

I am trying to extract this data(number) for many pages from the HTML. The data is different for each page. When I try to use soup.select('span[class="pull-right"]') it should give me the number, but only the tag comes. I believe it is because Javascript is used in the webpage. 180,476 is the position of data at this specific HTML that I want for many pages:

<div class="legend-block--body">
        <div class="linear-legend--counts">
          Pageviews:
          <span class="pull-right">
            180,476
          </span>
        </div>
        <div class="linear-legend--counts">
          Daily average:
          <span class="pull-right">
            8,594
          </span>
        </div></div>

我的代码(这是一个循环适用于许多页面):

My code(this is in a loop to work for many pages):

res = requests.get(wiki_page, timeout =None)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
ab=soup.select('span[class="pull-right"]')
print(ab)

输出:

[<span class="pull-right">\n<label class="logarithmic-scale">\n<input 
class="logarithmic-scale-option" type="checkbox"/>\n        Logarithmic scale      
</label>\n</span>, <span class="pull-right">\n<label class="begin-at- 
zero">\n<input class="begin-at-zero-option" type="checkbox"/>\n        Begin at 
zero      </label>\n</span>, <span class="pull-right">\n<label class="show- 
labels">\n<input class="show-labels-option" type="checkbox"/>\n        Show 
values      </label>\n</span>]

示例网址: https:/ /tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi

我想要网页浏览

推荐答案

如果你检索页面,javascript代码将无法执行requests.get。所以应该使用硒代替。它将在浏览器中打开页面模仿用户喜欢的行为,因此将执行js代码。

The javascript code won't get executed if you retrieve page with the requests.get. So the selenium shall be used instead. It will mimic user like behaviour with the opening of the page in browser, so the js code will be executed.

要从selenium开始,您需要使用<$安装c $ c> pip install selenium 。然后检索你的项目使用下面的代码:

To start with selenium, you need to install with pip install selenium. Then to retrieve your item use code below:

from selenium import webdriver

browser = webdriver.Firefox()
# List of the page url and selector of element to retrieve.
wiki_pages = [("https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi",
               ".summary-column--container .legend-block--pageviews .linear-legend--counts:first-child span.pull-right"),]
for wiki_page in wiki_pages:
    url = wiki_page[0]
    selector = wiki_page[1]
    browser.get(wiki_page)
    page_views_count = browser.find_element_by_css_selector(selector)
    print page_views_count.text
browser.quit()

注意:如果你需要运行无头浏览器,考虑使用 PyVirtualDisplay Xvfb )运行无头WebDriver测试,请参阅'如何d o我在Xvfb中运行Selenium?以获取更多信息。

NOTE: If you need to run headless browser, consider using PyVirtualDisplay (a wrapper for Xvfb) to run headless WebDriver tests, see 'How do I run Selenium in Xvfb?' for more information.

这篇关于使用Python,BeautifulSoup进行动态数据Web刮擦的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆