BeautifulSoup:仅在单击“接受"后,才刮取HTML.在相同的URL中 [英] BeautifulSoup: scrape HTML only accessible after clicking "Accept" in the same URL

查看:54
本文介绍了BeautifulSoup:仅在单击“接受"后,才刮取HTML.在相同的URL中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从某个网址中抓取一些信息.我们称之为:www.foo.bar/baz

I'm trying to scrape some information from a certain URL. Let's call it: www.foo.bar/baz

当您使用网络浏览器访问该URL时,通常会出现我比18岁大"的按钮. URL不变,仅当您手动单击该按钮时才加载真实内容.

When you access that URL with a web browser, the usual "I'm older than 18" button appears. The URL doesn't change and the real content is only loaded when you manually click said button.

我想模拟"单击我比18岁大"按钮,以便能够访问我真正想抓取的信息.

I would like to "simulate" that click in the "I'm older than 18" button in order to be able to access the information that I really want to scrape.

这是应单击的按钮的HTML代码:

This is the HTML code of the button that should be clicked:

<div align=center>
    <a href="javascript:showContent()"><span>ENTRAR</span></a>
</div>

这是由href属性调用的JavaScript函数:

And this is the JavaScript function that gets called by the href attribute:

<script type="text/javascript"><!--
function showContent() {
    document.getElementById('all-content').style.display = '';
    document.getElementById('adultmessage').style.display = 'none';
    document.cookie = 'adult=yes; path=/';
}
function hideAdultContent(){
    document.getElementById('all-content').style.display = 'none';
}
// --></script>

我将感谢您为进行此研究而提供的任何提示.

I would appreciate any tips on what to research in order to do this.

推荐答案

您无法使用Beautifulsoup与Javascript进行交互,可以通过 PhantomJS 用于无头浏览:

You cannot interact with Javascript using Beautifulsoup, you can use selenium to click the element using it with PhantomJS for headless browsing:

from selenium import webdriver

dr = webdriver.PhantomJS()

dr.get("www.foo.bar/baz")

dr.find_element_by_xpath("//a[@href='javascript:showContent()']").click()

xpath会找到该元素,然后我们模拟一次单击,它会为您提供所需的内容.

The xpath will find the element, then we simulate a click which should give you what you need.

我假定的网站是西班牙语,因此href实际上是javascript:muestradulto():

The site I presume is in Spanish so the href is actually javascript:muestradulto():

dr.find_element_by_xpath("//a[@href='javascript:muestradulto()']").click()

单击链接后,您单击print(dr.page_source),将看到下一页,该页面的顶部附近是EL BUSCANUNCIOS:

Once the link is clicked and you print(dr.page_source), you will see you get to the next page which has EL BUSCANUNCIOS near the top:

In [1]: url = "http://www.pasion.com/amistad/"

In [2]: from selenium import webdriver

In [3]: dr = webdriver.PhantomJS()

In [4]: dr.get(url)

In [5]: dr.find_element_by_xpath("//a[@href='javascript:muestradulto()']").click()

In [6]: print("EL BUSCANUNCIOS" in dr.page_source)
True

如果您更喜欢使用bs4,则可以将源传递给BeautifulSoup并进行处理,但是selenium允许您使用xpath和更大范围的css选择器,您可能会发现它更有用.

If you prefer to use bs4 you can pass the source to BeautifulSoup and work away on that but selenium lets you use xpath and a much larger range of css selectors which you might find more useful.

实际上,如果您查看页面返回的内容,则实际上仅使用请求即可获取源代码,只需在网页上单击链接即可.

Actually if you look at what the page returns, you actually get the source code just using requests, the link only needs to be clicked on the webpage:

In [14]: from requests import get

In [14]: from bs4 import BeautifulSoup

In [15]: soup = BeautifulSoup(get(url).content)

In [16]: print(soup.select("#cuerpo div[class^=x]")[:2])
[<div class="x1"><div class="x2">\n<div class="x3"></div>\n<div class="x4">Amistad en Barcelona  i  rodalies  (BARCELONA)</div>\n<div class="x5">r508491244 </div>\n<div class="x6" style="font-size:8px"><a href="/creditos/auto-renueva.php" style="color:#ee0000">AUTO\xb7RENUEVA</a></div>\n</div>\n<div class="x9"><a class="cti" href="para-mujer-busque-amistad-508491244.htm" target="_blank">PARA MUJER BUSQUE AMISTAD</a><br/><div class="tx"> Deseo coincidir con una mujer que busque una relaci\xf3n de amistad continuada con un hombre maduro,  tranquilo,  educado,  cari\xf1oso y de trato f\xe1cil.  No tengo pareja y ahora no la busco.  Busco una amiga para pasear,  hablar,  echar unas risas,  caf\xe9s,  cines,  conciertos,  etc.  No me importa para nada la talla de suje ni de pantal\xf3n que usas,  ni tu edad,  ni tampoco si tienes eso que ahora se llaman cargas.  Soy un tipo normal y busco lo mismo.  Si necesitas algo m\xe1s,  tambi\xe9n lo podemos hablar.  Con afecto.  Dani. Edad 54 a\xf1os</div><br/> <div class="x11">\n</div>\n</div>\n<div class="x10" id="ph508491244" style="width: auto">\n</div></div>, <div class="x2">\n<div class="x3"></div>\n<div class="x4">Amistad en Barcelona  i  rodalies  (BARCELONA)</div>\n<div class="x5">r508491244 </div>\n<div class="x6" style="font-size:8px"><a href="/creditos/auto-renueva.php" style="color:#ee0000">AUTO\xb7RENUEVA</a></div>\n</div>]

因此,您实际上不必担心单击任何东西.

So you don't actually need to worry about clicking anything.

这篇关于BeautifulSoup:仅在单击“接受"后,才刮取HTML.在相同的URL中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆