使用带有Python的Beautiful Soup从搜索页面提取HTML内容 [英] Extracting HTML content from a search page using Beautiful Soup with Python
问题描述
我试图使用美丽的汤从booking.com获得一些酒店信息。我需要从西班牙的所有住宿中获得一定的信息。这是搜索网址:
< a class =hotel_name_link url href =& /hotel/es/aran-la-abuela.html?label = gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM; sid = 1677838e3fc7c26577ea908d40ad5faf; ucfs = 1; srpvid = b4980e34f6e50017; srepoch = 1514167274; room1 = A% 2CA; HPOS = 1; hapos = 1; dest_type =国家; DEST_ID = 197; srfid = 198499756e07f93263596e1640823813c2ee4fe1X1;从= SearchResult所&安培;#10 ;; highlight_r oom =#hotelTmpltarget =_ blankrel =noopener>< span class =sr-hotel__namedata-et-click =customGoal:YPNdKNKNKZJUESUPTOdJDUFYQC:1> Hotel Spa Aran La Abuela< / span> ;< span class =invisible_spoken>在新窗口中打开< / span>< / a>
这是我的Python代码:
def init_BeautifulSoup():
全球页面,汤
page = requests.get(https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM&sid=1677838e3fc7c26577ea908d40ad5faf&class_interval=1&dest_id = 197&安培; dest_type =国家&安培; dtdisc = 0&安培; from_sf = 1&安培; group_adults = 2及group_children = 0&安培; INAC = 0&安培; index_postcard = 0&安培; label_click =是undef&安培; no_rooms = 1&安培; oos_flag = 0&安培;明信片= 0&安培; raw_dest_type =国家&安培; ROOM1 = A%2CA&安培; sb_price_type =总&安培; search_selected = 1&安培; src_elem = SB&安培; SS =西班牙&安培; ss_all = 0&安培; ss_raw =西班牙&安培; SSB =空&安培; sshis = 0& order = popular)
soup = BeautifulSoup(page.content,'html.parser')
$ b $ def get_spain_accomodations():
global住宿
accomodations = soup.find_all(class _ =hotel_name_link.url)
我运行代码并打印accomodations变量,输出一对括号([])。然后我打印汤对象,我意识到解析的HTML与我在Chrome中的开发人员工具中看到的截然不同,这就是为什么汤对象无法找到类hotel_name_link.url
发生了什么事?
因此,当您使用 page.content
时,它会在JavaScript修改页面之前为您提供页面的HTML内容。我不认为你可以直接使用 requests
模块来使用JavaScript来抓取页面。
查看这个问题。
如果您决定使用Selenium处理此问题,请查看这个问题。加载页面后,您可以使用 driver.page_souce
在JavaScript修改并将其传递给BeautifulSoup后获取页面源代码。
来自selenium import webdriver
来自selenium.webdriver.support从selenium.webdriver.common.by导入expected_conditions作为EC
import by
from selenium.webdriver.support.ui import WebDriverWait
$ b $ def get_page(url):
driver = webdriver.Chrome()
driver.get(url)
try:
WebDriverWait(驱动程序,10).until(EC.presence_of_element_located((By.TAG_NAME,'h1')))
,除了TimeoutException:
print('Page timed out。')
返回无
page = driver.page_source
返回页面
def init_BeautifulSoup():
全局页面,汤
page = get_page('your -url')
#处理页面可能为空的情况None
soup = BeautifulSoup(page,'html.parser')
不是快速或简单的解决方案,但可以完成工作。
编辑:
你需要在这里改变一件事。
什么是 我刚刚提供了一个例子。您需要找到加载页面上的元素,该元素在执行JavaScript之前不存在,并在此替换它: 您可以通过在页面加载后检查元素并检查页面源HTML代码中是否存在该元素来完成此操作。 而不是 我推荐使用id或任何名称选择器而不是css或xpath,因为它们有点慢。 I'm trying to get some hotels info from booking.com using Beautiful Soup. I need to get certain info from all the accomodations in Spain. This is the search url: When I inspect an accomodation in the result page using the developer tools it says that this is the tag to search:
This is my Python code: But when I run the code and print the accomodations variable it outputs a pair of brackets ([]). Then I printed the soup object and I realized that the parsed HTML is very different from the one I see in the developer tools in Chrome, that's why the soup object cant find the class "hotel_name_link.url" What's going on? JavaScript is modifying the page after it loads. So, when you use If you decide to handle this using Selenium, check this question. After the page loads, you can use Not a fast or easy solution, but gets the job done. EDIT: You'll need to change one thing here. What the part I've just provided you with an example. You need to find out the element on the loaded page that is not present before the execution of the JavaScript and replace it here: You can do this by inspecting elements after the page is loaded and checking whether the element exists or not in the HTML code of the page source. Instead of I'd recommend using id or any of the name selectors instead of css or xpath, since they are a bit slower. 这篇关于使用带有Python的Beautiful Soup从搜索页面提取HTML内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! WebDriverWait(driver,10).until(EC.presence_of_element_located((By.TAG_NAME,'h1'))) code>的作用是它使得驱动程序显式等待,直到元素位于我们指定的网页上,或者在指定的延迟时间之后抛出
TimeoutException
已经使用了10秒)。
(By.TAG_NAME,'h1')
By.TAG_NAME
,您可以根据您的要求使用以下任何一项:
ID
最快,而 XPATH
的效果最慢。<a class="hotel_name_link url" href=" /hotel/es/aran-la-abuela.html?label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM;sid=1677838e3fc7c26577ea908d40ad5faf;ucfs=1;srpvid=b4980e34f6e50017;srepoch=1514167274;room1=A%2CA;hpos=1;hapos=1;dest_type=country;dest_id=197;srfid=198499756e07f93263596e1640823813c2ee4fe1X1;from=searchresults ;highlight_room=#hotelTmpl" target="_blank" rel="noopener">
<span class="sr-hotel__name
" data-et-click="
customGoal:YPNdKNKNKZJUESUPTOdJDUFYQC:1
">
Hotel Spa Aran La Abuela
</span>
<span class="invisible_spoken">Opens in new window</span>
</a>
def init_BeautifulSoup():
global page, soup
page= requests.get("https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM&sid=1677838e3fc7c26577ea908d40ad5faf&class_interval=1&dest_id=197&dest_type=country&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&oos_flag=0&postcard=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&src_elem=sb&ss=Spain&ss_all=0&ss_raw=spain&ssb=empty&sshis=0&order=popularity")
soup = BeautifulSoup(page.content, 'html.parser')
def get_spain_accomodations():
global accomodations
accomodations = soup.find_all(class_="hotel_name_link.url")
page.content
, it gives you the HTML content of the page before JavaScript modifies the page. I don't think you can use requests
module directly for scraping pages with JavaScript.
Have a look at this question.driver.page_souce
to get the page source after JavaScript modifies it and pass it to BeautifulSoup.from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
def get_page(url):
driver = webdriver.Chrome()
driver.get(url)
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'h1')))
except TimeoutException:
print('Page timed out.')
return None
page = driver.page_source
return page
def init_BeautifulSoup():
global page, soup
page = get_page('your-url')
# handle the case where page may be None
soup = BeautifulSoup(page, 'html.parser')
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'h1')))
does is that it makes the driver wait explicitly until the element is located on the webpage that we specify or throws TimeoutException
after the delay time you specify (I've used 10 seconds). (By.TAG_NAME, 'h1')
By.TAG_NAME
, you can use any of the following according to your requirement:
ID
works the fastest while XPATH
works the slowest.