使用 Python、BeautifulSoup 进行动态数据网页抓取 [英] Dynamic Data Web Scraping with Python, BeautifulSoup

查看:52
本文介绍了使用 Python、BeautifulSoup 进行动态数据网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 HTML 中提取许多页面的数据(数字).每个页面的数据都不同.当我尝试使用 soup.select('span[class="pull-right"]') 它应该给我号码,但只有标签出现.我相信这是因为网页中使用了 Javascript.180,476 是我想要用于许多页面的此特定 HTML 中的数据位置:

<div class="linear-legend--counts">浏览量:<span class="pull-right">180,476</span>

<div class="linear-legend--counts">日均值:<span class="pull-right">8,594</span></div></div>

我的代码(这是一个循环,适用于许多页面):

res = requests.get(wiki_page, timeout =None)汤 = bs4.BeautifulSoup(res.text, 'html.parser')ab=soup.select('span[class="pull-right"]')打印(ab)

输出:

[

示例网址:https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=星球大战:_The_Last_Jedi

我想要浏览量

解决方案

如果您使用 requests.get 检索页面,将不会执行 javascript 代码.所以应改用硒.它会模仿用户在浏览器中打开页面的行为,从而执行 js 代码.

要开始使用 selenium,您需要使用 pip install selenium 进行安装.然后检索您的项目使用下面的代码:

from selenium import webdriver浏览器 = webdriver.Firefox()# 页面 url 列表和要检索的元素选择器.wiki_pages = [("https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi",".summary-column--container .legend-block--pageviews .linear-legend--counts:first-child span.pull-right"),]对于 wiki_pages 中的 wiki_page:网址 = wiki_page[0]选择器 = wiki_page[1]browser.get(wiki_page)page_views_count = browser.find_element_by_css_selector(selector)打印 page_views_count.text浏览器退出()

注意:如果您需要运行无头浏览器,请考虑使用 PyVirtualDisplay(Xvfb 的包装器)来运行无头 WebDriver 测试,请参阅如何在 Xvfb 中运行 Selenium?' 了解更多信息.

I am trying to extract this data(number) for many pages from the HTML. The data is different for each page. When I try to use soup.select('span[class="pull-right"]') it should give me the number, but only the tag comes. I believe it is because Javascript is used in the webpage. 180,476 is the position of data at this specific HTML that I want for many pages:

<div class="legend-block--body">
        <div class="linear-legend--counts">
          Pageviews:
          <span class="pull-right">
            180,476
          </span>
        </div>
        <div class="linear-legend--counts">
          Daily average:
          <span class="pull-right">
            8,594
          </span>
        </div></div>

My code(this is in a loop to work for many pages):

res = requests.get(wiki_page, timeout =None)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
ab=soup.select('span[class="pull-right"]')
print(ab)

output:

[<span class="pull-right">
<label class="logarithmic-scale">
<input 
class="logarithmic-scale-option" type="checkbox"/>
        Logarithmic scale      
</label>
</span>, <span class="pull-right">
<label class="begin-at- 
zero">
<input class="begin-at-zero-option" type="checkbox"/>
        Begin at 
zero      </label>
</span>, <span class="pull-right">
<label class="show- 
labels">
<input class="show-labels-option" type="checkbox"/>
        Show 
values      </label>
</span>]

Example URL:https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi

I want the Pageviews

解决方案

The javascript code won't get executed if you retrieve page with the requests.get. So the selenium shall be used instead. It will mimic user like behaviour with the opening of the page in browser, so the js code will be executed.

To start with selenium, you need to install with pip install selenium. Then to retrieve your item use code below:

from selenium import webdriver

browser = webdriver.Firefox()
# List of the page url and selector of element to retrieve.
wiki_pages = [("https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi",
               ".summary-column--container .legend-block--pageviews .linear-legend--counts:first-child span.pull-right"),]
for wiki_page in wiki_pages:
    url = wiki_page[0]
    selector = wiki_page[1]
    browser.get(wiki_page)
    page_views_count = browser.find_element_by_css_selector(selector)
    print page_views_count.text
browser.quit()

NOTE: If you need to run headless browser, consider using PyVirtualDisplay (a wrapper for Xvfb) to run headless WebDriver tests, see 'How do I run Selenium in Xvfb?' for more information.

这篇关于使用 Python、BeautifulSoup 进行动态数据网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆