使用 Python、BeautifulSoup 进行动态数据网页抓取 [英] Dynamic Data Web Scraping with Python, BeautifulSoup
问题描述
我正在尝试从 HTML 中提取许多页面的数据(数字).每个页面的数据都不同.当我尝试使用 soup.select('span[class="pull-right"]') 它应该给我号码,但只有标签出现.我相信这是因为网页中使用了 Javascript.180,476 是我想要用于许多页面的此特定 HTML 中的数据位置:
<div class="linear-legend--counts">浏览量:<span class="pull-right">180,476</span><div class="linear-legend--counts">日均值:<span class="pull-right">8,594</span></div></div>
我的代码(这是一个循环,适用于许多页面):
res = requests.get(wiki_page, timeout =None)汤 = bs4.BeautifulSoup(res.text, 'html.parser')ab=soup.select('span[class="pull-right"]')打印(ab)
输出:
[
我想要浏览量
如果您使用 requests.get 检索页面,将不会执行 javascript 代码.所以应改用硒.它会模仿用户在浏览器中打开页面的行为,从而执行 js 代码.
要开始使用 selenium,您需要使用 pip install selenium
进行安装.然后检索您的项目使用下面的代码:
from selenium import webdriver浏览器 = webdriver.Firefox()# 页面 url 列表和要检索的元素选择器.wiki_pages = [("https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi",".summary-column--container .legend-block--pageviews .linear-legend--counts:first-child span.pull-right"),]对于 wiki_pages 中的 wiki_page:网址 = wiki_page[0]选择器 = wiki_page[1]browser.get(wiki_page)page_views_count = browser.find_element_by_css_selector(selector)打印 page_views_count.text浏览器退出()
注意:如果您需要运行无头浏览器,请考虑使用 PyVirtualDisplay(Xvfb 的包装器)来运行无头 WebDriver 测试,请参阅如何在 Xvfb 中运行 Selenium?' 了解更多信息.
I am trying to extract this data(number) for many pages from the HTML. The data is different for each page. When I try to use soup.select('span[class="pull-right"]') it should give me the number, but only the tag comes. I believe it is because Javascript is used in the webpage. 180,476 is the position of data at this specific HTML that I want for many pages:
<div class="legend-block--body">
<div class="linear-legend--counts">
Pageviews:
<span class="pull-right">
180,476
</span>
</div>
<div class="linear-legend--counts">
Daily average:
<span class="pull-right">
8,594
</span>
</div></div>
My code(this is in a loop to work for many pages):
res = requests.get(wiki_page, timeout =None)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
ab=soup.select('span[class="pull-right"]')
print(ab)
output:
[<span class="pull-right">
<label class="logarithmic-scale">
<input
class="logarithmic-scale-option" type="checkbox"/>
Logarithmic scale
</label>
</span>, <span class="pull-right">
<label class="begin-at-
zero">
<input class="begin-at-zero-option" type="checkbox"/>
Begin at
zero </label>
</span>, <span class="pull-right">
<label class="show-
labels">
<input class="show-labels-option" type="checkbox"/>
Show
values </label>
</span>]
I want the Pageviews
The javascript code won't get executed if you retrieve page with the requests.get. So the selenium shall be used instead. It will mimic user like behaviour with the opening of the page in browser, so the js code will be executed.
To start with selenium, you need to install with pip install selenium
. Then to retrieve your item use code below:
from selenium import webdriver
browser = webdriver.Firefox()
# List of the page url and selector of element to retrieve.
wiki_pages = [("https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi",
".summary-column--container .legend-block--pageviews .linear-legend--counts:first-child span.pull-right"),]
for wiki_page in wiki_pages:
url = wiki_page[0]
selector = wiki_page[1]
browser.get(wiki_page)
page_views_count = browser.find_element_by_css_selector(selector)
print page_views_count.text
browser.quit()
NOTE: If you need to run headless browser, consider using PyVirtualDisplay (a wrapper for Xvfb) to run headless WebDriver tests, see 'How do I run Selenium in Xvfb?' for more information.
这篇关于使用 Python、BeautifulSoup 进行动态数据网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!