使用带有Python的Beautiful Soup从搜索页面提取HTML内容 [英] Extracting HTML content from a search page using Beautiful Soup with Python

查看:86
本文介绍了使用带有Python的Beautiful Soup从搜索页面提取HTML内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用美丽的汤从booking.com获得一些酒店信息。我需要从西班牙的所有住宿中获得一定的信息。这是搜索网址:

https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM&sid=1677838e3fc7c26577ea908d40ad5faf&class_interval=1&dest_id=197&dest_type =国家&安培; dtdisc = 0&安培; from_sf = 1&安培; group_adults = 2及group_children = 0&安培; INAC = 0&安培; index_pos tcard = 0&安培; label_click =是undef&安培; no_rooms = 1&安培; oos_flag = 0&安培;明信片= 0&安培; raw_dest_type =国家&安培; ROOM1 = A%2CA&安培; sb_price_type =总&安培; search_selected = 1&安培; src_elem = SB&安培; SS =西班牙&安培; ss_all = 0&安培; ss_raw = spain& ssb = empty& sshis = 0& order = popular 当我使用开发人员工具在结果页面中检查住宿时,这是要搜索的标签:



 < a class =hotel_name_link url href =& /hotel/es/aran-la-abuela.html?label = gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM; sid = 1677838e3fc7c26577ea908d40ad5faf; ucfs = 1; srpvid = b4980e34f6e50017; srepoch = 1514167274; room1 = A% 2CA; HPOS = 1; hapos = 1; dest_type =国家; DEST_ID = 197; srfid = 198499756e07f93263596e1640823813c2ee4fe1X1;从= SearchResult所&安培;#10 ;; highlight_r oom =#hotelTmpltarget =_ blankrel =noopener>< span class =sr-hotel__namedata-et-click =customGoal:YPNdKNKNKZJUESUPTOdJDUFYQC:1> Hotel Spa Aran La Abuela< / span> ;< span class =invisible_spoken>在新窗口中打开< / span>< / a>  

这是我的Python代码:

  def init_BeautifulSoup():
全球页面,汤
page = requests.get(https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM&sid=1677838e3fc7c26577ea908d40ad5faf&class_interval=1&dest_id = 197&安培; dest_type =国家&安培; dtdisc = 0&安培; from_sf = 1&安培; group_adults = 2及group_children = 0&安培; INAC = 0&安培; index_postcard = 0&安培; label_click =是undef&安培; no_rooms = 1&安培; oos_flag = 0&安培;明信片= 0&安培; raw_dest_type =国家&安培; ROOM1 = A%2CA&安培; sb_price_type =总&安培; search_selected = 1&安培; src_elem = SB&安培; SS =西班牙&安培; ss_all = 0&安培; ss_raw =西班牙&安培; SSB =空&安培; sshis = 0& order = popular)
soup = BeautifulSoup(page.content,'html.parser')

$ b $ def get_spain_accomodations():
global住宿
accomodations = soup.find_all(class _ =hotel_name_link.url)

我运行代码并打印accomodations变量,输出一对括号([])。然后我打印汤对象,我意识到解析的HTML与我在Chrome中的开发人员工具中看到的截然不同,这就是为什么汤对象无法找到类hotel_name_link.url



发生了什么事?

解决方案

因此,当您使用 page.content 时,它会在JavaScript修改页面之前为您提供页面的HTML内容。我不认为你可以直接使用 requests 模块来使用JavaScript来抓取页面。
查看这个问题。



如果您决定使用Selenium处理此问题,请查看这个问题。加载页面后,您可以使用 driver.page_souce 在JavaScript修改并将其传递给BeautifulSoup后获取页面源代码。

 来自selenium import webdriver 
来自selenium.webdriver.support从selenium.webdriver.common.by导入expected_conditions作为EC
import by
from selenium.webdriver.support.ui import WebDriverWait
$ b $ def get_page(url):
driver = webdriver.Chrome()
driver.get(url)
try:
WebDriverWait(驱动程序,10).until(EC.presence_of_element_located((By.TAG_NAME,'h1')))
,除了TimeoutException:
print('Page timed out。')
返回无
page = driver.page_source
返回页面

def init_BeautifulSoup():
全局页面,汤
page = get_page('your -url')
#处理页面可能为空的情况None
soup = BeautifulSoup(page,'html.parser')

不是快速或简单的解决方案,但可以完成工作。



编辑

你需要在这里改变一件事。



什么是 WebDriverWait(driver,10).until(EC.presence_of_element_located((By.TAG_NAME,'h1'))) code>的作用是它使得驱动程序显式等待,直到元素位于我们指定的网页上,或者在指定的延迟时间之后抛出 TimeoutException 已经使用了10秒)。

我刚刚提供了一个例子。您需要找到加载页面上的元素,该元素在执行JavaScript之前不存在,并在此替换它:(By.TAG_NAME,'h1')



您可以通过在页面加载后检查元素并检查页面源HTML代码中是否存在该元素来完成此操作。



而不是 By.TAG_NAME ,您可以根据您的要求使用以下任何一项:




  • ID

  • 名称

  • CLASS_NAME

  • CSS_SELECTOR

  • XPATH



我推荐使用id或任何名称选择器而不是css或xpath,因为它们有点慢。 ID 最快,而 XPATH 的效果最慢。


I'm trying to get some hotels info from booking.com using Beautiful Soup. I need to get certain info from all the accomodations in Spain. This is the search url:

https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM&sid=1677838e3fc7c26577ea908d40ad5faf&class_interval=1&dest_id=197&dest_type=country&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&oos_flag=0&postcard=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&src_elem=sb&ss=Spain&ss_all=0&ss_raw=spain&ssb=empty&sshis=0&order=popularity

When I inspect an accomodation in the result page using the developer tools it says that this is the tag to search:

<a class="hotel_name_link url" href="&#10;/hotel/es/aran-la-abuela.html?label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM;sid=1677838e3fc7c26577ea908d40ad5faf;ucfs=1;srpvid=b4980e34f6e50017;srepoch=1514167274;room1=A%2CA;hpos=1;hapos=1;dest_type=country;dest_id=197;srfid=198499756e07f93263596e1640823813c2ee4fe1X1;from=searchresults&#10;;highlight_room=#hotelTmpl" target="_blank" rel="noopener">
<span class="sr-hotel__name
" data-et-click="
customGoal:YPNdKNKNKZJUESUPTOdJDUFYQC:1
">
Hotel Spa Aran La Abuela
</span>
<span class="invisible_spoken">Opens in new window</span>
</a>

This is my Python code:

def init_BeautifulSoup():
    global page, soup
    page= requests.get("https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM&sid=1677838e3fc7c26577ea908d40ad5faf&class_interval=1&dest_id=197&dest_type=country&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&oos_flag=0&postcard=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&src_elem=sb&ss=Spain&ss_all=0&ss_raw=spain&ssb=empty&sshis=0&order=popularity")
    soup = BeautifulSoup(page.content, 'html.parser')


def get_spain_accomodations():
    global accomodations
    accomodations = soup.find_all(class_="hotel_name_link.url")

But when I run the code and print the accomodations variable it outputs a pair of brackets ([]). Then I printed the soup object and I realized that the parsed HTML is very different from the one I see in the developer tools in Chrome, that's why the soup object cant find the class "hotel_name_link.url"

What's going on?

解决方案

JavaScript is modifying the page after it loads. So, when you use page.content, it gives you the HTML content of the page before JavaScript modifies the page. I don't think you can use requests module directly for scraping pages with JavaScript. Have a look at this question.

If you decide to handle this using Selenium, check this question. After the page loads, you can use driver.page_souce to get the page source after JavaScript modifies it and pass it to BeautifulSoup.

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

def get_page(url):
    driver = webdriver.Chrome()
    driver.get(url)
    try:
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'h1')))
    except TimeoutException:
        print('Page timed out.')
        return None
    page = driver.page_source
    return page

def init_BeautifulSoup():
    global page, soup
    page = get_page('your-url')
    # handle the case where page may be None
    soup = BeautifulSoup(page, 'html.parser')

Not a fast or easy solution, but gets the job done.

EDIT:

You'll need to change one thing here.

What the part WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'h1'))) does is that it makes the driver wait explicitly until the element is located on the webpage that we specify or throws TimeoutException after the delay time you specify (I've used 10 seconds).

I've just provided you with an example. You need to find out the element on the loaded page that is not present before the execution of the JavaScript and replace it here: (By.TAG_NAME, 'h1')

You can do this by inspecting elements after the page is loaded and checking whether the element exists or not in the HTML code of the page source.

Instead of By.TAG_NAME, you can use any of the following according to your requirement:

  • ID
  • NAME
  • CLASS_NAME
  • CSS_SELECTOR
  • XPATH

I'd recommend using id or any of the name selectors instead of css or xpath, since they are a bit slower. ID works the fastest while XPATH works the slowest.

这篇关于使用带有Python的Beautiful Soup从搜索页面提取HTML内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆