无法从BeautifulSoup页面获得实际标记 [英] Unable to get actual Markup from a page with BeautifulSoup

查看:138
本文介绍了无法从BeautifulSoup页面获得实际标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想凑这个网址与组合 BeautifulSoup Selinium

<$p$p><$c$c>http://starwood.ugc.bazaarvoice.com/3523si-en_us/115/reviews.djs?format=embeddedhtml&page=2&scrollToTop=true

我已经试过这code

  active_review_page_html = browser.page_source
active_review_page_html = active_review_page_html.replace('\\\\',)
hotel_page_soup = BeautifulSoup(active_review_page_html)
打印(hotel_page_soup)

但什么是做它返回我像数据

 ;&放大器; LT;跨度类=BVRRReviewText&放大器; GT;酒店住宿和工作人员都很好....

但我要刮从该页面,跨度

 在hotel_page_soup.select(跨度.BVRRReviewText)review_div:

如何才能得到真正的标记从该网址是什么?


解决方案

首先,你给我们错误的链接,而不是<一个href=\"http://www.starwoodhotels.com/sheraton/property/reviews/index.html?propertyID=115&language=en_US\"相对=nofollow> 你想刮实际的页面,你给我们在页面加载JS文件与会的链接这将是一个不必要的挑战进行解析。

其次,你不需要 BeautifulSoup 在这种情况下,本身擅长定位元素和提取文本或属性。无需在这里一个额外的步骤。

下面是一个使用实际的页面的评论你想获得工作的例子:

 硒进口的webdriver
从selenium.webdriver.common.by进口国
从selenium.webdriver.support.ui进口WebDriverWait
从selenium.webdriver.support进口expected_conditions为EC司机= webdriver.Chrome()#或webdriver.Firefox()
driver.get('http://www.starwoodhotels.com/sheraton/property/reviews/index.html?propertyID=115&language=en_US')#等待审查加载
WebDriverWait(驱动程序,10)。直到(EC。presence_of_element_located((By.CSS_SELECTORspan.BVRRReviewText)))#获得评论
在driver.find_elements_by_css_selector(span.BVRRReviewText)review_div:
    打印(review_div.text)
    打印( - -)driver.close()

打印:

 这不是一个低的经济型酒店。然而,宾馆提供的设施。什么也没有WiFi功能。事实上,你阻止带有我的元胞计划的WiFi。我2组是忠于喜来登,阿拉巴马州A&功放的一部分; M和第九主教区AMEChurch但喜来登是不忠于我们。
---
我们是一家公司,有5间客房的酒店。尽管有立案室和税外收费信用卡,我的客人是整个量收取她的个人信用卡。我花了(5)电话和我自己的时间和精力来获得这一法案扭转。我猜留下的信息和电话号码无数次的消息在这家酒店被忽略。你能保证我们不会与我们的业务回报。你可以在你的会计的办公室对她缺乏个人服务的感谢Kimerlin或金佰利和跟进为失去的业务在未来。
---
...

我故意留下您处理分页 - 让我知道,如果你有困难

I am trying to scrape this URL with combination of BeautifulSoup and Selinium

http://starwood.ugc.bazaarvoice.com/3523si-en_us/115/reviews.djs?format=embeddedhtml&page=2&scrollToTop=true

I have tried this code

active_review_page_html  = browser.page_source
active_review_page_html = active_review_page_html.replace('\\', "")
hotel_page_soup = BeautifulSoup(active_review_page_html)
print(hotel_page_soup)

But what is does that it is returning me data like

;&lt;span class="BVRRReviewText"&gt;Hotel accommodations and staff were fine ....

But I have to scrape that span from that page with

for review_div in hotel_page_soup.select("span .BVRRReviewText"):

How can I get real markup from that URL?

解决方案

First of all, you are giving us the wrong link, instead of the actual page you are trying to scrape, you give us a link to the participating in the page load js file which would be a unnecessary challenge to parse.

Secondly, you don't need BeautifulSoup in this case, selenium itself is good at locating elements and extracting the text or attributes. No need for an extra step here.

Here's a working example using the actual page with reviews you want to get:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()  # or webdriver.Firefox()
driver.get('http://www.starwoodhotels.com/sheraton/property/reviews/index.html?propertyID=115&language=en_US')

# wait for the reviews to load
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "span.BVRRReviewText")))

# get reviews
for review_div in driver.find_elements_by_css_selector("span.BVRRReviewText"):
    print(review_div.text)
    print("---")

driver.close()

Prints:

This is not a low budget hotel . Yet the hotel offers no amenities. Nothing and no WiFi. In fact, you block the wifi that comes with my celluar plan. I am a part of 2 groups that are loyal to the Sheraton, Alabama A&M and the 9th Episcopal District AMEChurch but the Sheraton is not loyal to us.
---
We are a company that had (5) guest rooms at the hotel. Despite having a credit card on file for room and tax charges, my guest was charged the entire amount to her personal credit card. It has taken me (5) PHONE CALLS and my own time and energy to get this bill reversed. I guess leaving a message with information and a phone number numerous times is IGNORED at this hotel. You can guarantee that we will not return with our business. YOu may thank Kimerlin or Kimberly in your accounting office for her lack of personal service and follow through for the lost business in the future.
---
...

I've intentionally left you to handle pagination - let me know if you have difficulties.

这篇关于无法从BeautifulSoup页面获得实际标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆