获取特定div的span类内的文本 [英] Get text inside a span class of a particular div

查看:393
本文介绍了获取特定div的span类内的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在刮擦T-Mobile网站上有关三星Galaxy S9的评论.我可以为HTML代码创建一个Beautiful Soup对象,但是我无法获取span类中存在的评论文本,还需要遍历评论页面以收集所有评论.

I am scraping the T-Mobile website for reviews on Samsung Galaxy S9. I am able to create a Beautiful Soup object for the HTML code, but I cannot fetch the text of reviews which is present inside a span class, also need to iterate through the pages of reviews to collect all the reviews.

我尝试了2个代码,但是一个返回错误,另一个返回空列表.我也找不到汤对象中需要的特定跨度类.

I have tried 2 codes, but one is returning an error and the other is returning an empty list. I also cannot find the particular span class I require in the soup object.

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

tmo_ratings_s9 = []

req = Request('https://www.t-mobile.com/cell-phone/samsung-galaxy-s9', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
tmo_soup_s9 = BeautifulSoup(webpage, 'html.parser')
tmo_soup_s9.prettify()
for review in tmo_soup_s9.find_all(class_="BVRRReviewText"):
    text = review.span.get_text(strip=True)
    tmo_soup_s9.append(text)

print(tmo_ratings_s9)


############################################################################

from urllib.request import urlopen
html = urlopen("https://www.t-mobile.com/cell-phone/samsung-galaxy-s9")

soup=BeautifulSoup(html)

ratings = soup.find_all('div', class_='BVRRReviewTextParagraph BVRRReviewTextFirstParagraph BVRRReviewTextLastParagraph')     
textofrep = ratings.get_text().strip()
tmo_ratings_s9.append(textofrep)

我希望从网页上的所有8个页面中获取评论文字,并将其存储在HTML文件中.

I expect to get the text of the reviews from all the 8 pages on the webpage and store them in an HTML file.

推荐答案

由于无法通过脚本动态加载内容,因此无法获取数据.您可以尝试硒和草皮.

You are not getting the data due to dynamic content loading through script. You can try selenium along with scrapy.

import scrapy
from selenium import webdriver
from scrapy.http import HtmlResponse

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    allowed_domains = ['t-mobile.com']
    start_urls = ['https://www.t-mobile.com/cell-phone/samsung-galaxy-s9']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        body = str.encode(self.driver.page_source)
        self.parse_response(HtmlResponse(self.driver.current_url, body=body, encoding='utf-8'))

    def parse_response(self, response):
        tmo_ratings_s9 = []
        for review in response.css('#reviews div.BVRRContentReview'):
            text = review.css('.BVRRReviewText::text').get().strip()
            tmo_ratings_s9.append(text)

        print(tmo_ratings_s9)

    def spider_closed(self, spider, reason):
        self.driver.close()

这篇关于获取特定div的span类内的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆