使用BS4从隐藏的html(弹出)获取数据 [英] Getting data from hidden html (popup) using BS4

查看:52
本文介绍了使用BS4从隐藏的html(弹出)获取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在Wikipedia的弹出窗口中抓取链接的名称.因此,当您将鼠标悬停在Wikipedia中的链接上时,它会从介绍到该链接的内容中弹出一个小片段.我需要抓取该信息,但是不确定该信息在源中的位置.当我检查元素(弹出时)时,这是html(对于本示例,我将鼠标悬停在链接希腊"上)

I am trying to scrape the name of a link in a popup in wikipedia. So when you hover a link in wikipedia, it brings up a little snippet from the intro to that link. I need to scrape that information but I am unsure where it would be in the source. When I inspect the element(as it is popped up) this is the html (for this example I am hovering over the link "Greek")

<a dir="ltr" lang="en" class="mwe-popups-extract" href="/wiki/Ancient_Greek"> 
<p>The <b>Ancient Greek</b> language includes the forms of Greek...(a bunch more text)...</p></a> 

我需要提取的href是="/wiki/Ancient_Greek",但是当我不悬停链接时,这段html就会消失.有没有一种方法(使用BS4和python)以我正在抓取的源html提取此信息?

What I need to extract is the href which = "/wiki/Ancient_Greek" but this piece of html disappears when I am not hovering the link. Is there a way (with BS4 and python) to extract this information with the source html I am scraping?

我无力对网页进行额外的调用,因为该项目需要很长时间才能按原样运行.无论如何,如果要更改我检索源的方式,以便可以获取将对您有所帮助的弹出信息.这个项目非常庞大,获得此弹出信息至关重要.

I can't afford to make additional calls to webpages because the project takes long to run as it is. If there is anyway to change how I am retrieving the source such that I can get the popup information that would be helpful. This project is giant and getting this popup information is crucial.

非常感谢所有不需要完全重建项目的建议-我正在使用urllib提取源代码(带有请求)和bs4进行抓取.

any suggestions at all that don't require a complete rebuild of the project is extremely appreciated-- I am using urllib to pull source(with requests) and bs4 to scrape through.

推荐答案

在您的问题中,您说自己"...负担不起对网页的额外调用..."但这就是您的浏览器在后台执行的操作.您正在查看的页面的html不包含您所需的内容.

In your question you say that you "...can't afford to make additional calls to webpages..." but that's what your browser is doing behind the scenes. The html for the page you are looking at doesn't contain the content that you require.

对此进行演示:

  1. 在浏览器中,打开一个Wikipedia页面,例如希腊.

打开开发人员工具"窗口(Chrome中为Ctrl + Shift + i).

Bring up the Developer Tools window (Ctrl+Shift+i in Chrome).

单击网络"选项卡,并确保红色按钮点亮,以便记录所有Web请求.

Click on the Network tab and make sure that the red button is lit so that all web requests are logged.

将鼠标悬停在页面中的链接上,例如古希腊语.

Hover over a link in the page such as Ancient Greek.

您将看到,将鼠标悬停在链接上会触发对古希腊语摘要"页面.

You will see that the act of hovering over the link triggers a GET request to the Ancient_Greek summary page.

点击"Ancient_Greek"在网络标签日志中显示请求的详细信息.

Click on "Ancient_Greek" in the network tab log to show details of the request.

单击右侧的响应"选项卡.

Click on the Response tab on the right.

您应该看到JSON响应,其中包含一个名为"extract_html"的字段,包含您需要的内容:"< p>< b>古希腊语</b>语言包括以下形式...

You should see the JSON response containing a field called "extract_html" containing the content you require: "<p>The <b>Ancient Greek</b> language includes the forms...

因此,为了获得您所需的信息,每次您遇到指向< a href ="/wiki/something"的链接时,/a> ,您必须向 https://en.wikipedia.org/api/rest_v1/page/summary/something

Therefore, in order to get the information you need, every time you encounter a link to <a href="/wiki/something" /a> you will have to make a GET request to https://en.wikipedia.org/api/rest_v1/page/summary/something

这篇关于使用BS4从隐藏的html(弹出)获取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆