如何抓取 Flipkart 评论中的“阅读更多"中的评论数据 [英] How to scrape review data present in Read more in Flipkart reviews

查看:32
本文介绍了如何抓取 Flipkart 评论中的“阅读更多"中的评论数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取 Flipkart 以使用请求和beautifulsoup 包提取产品评论.如何取出这些评论中存在的阅读更多点击事件中的数据.

I am trying to scrape Flipkart to extract reviews for a product using request and beautifulsoup package.how can take out data present in Read more click event present in those review.

推荐答案

from selenium import webdriver
from selenium.webdriver.common.by import By
from contextlib import closing
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver import Firefox
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time
import urllib2
import re
from bs4 import BeautifulSoup
import unicodedata

def remove_non_ascii_1(text):

    return ''.join([i if ord(i) < 128 else ' ' for i in text])

with closing(Firefox()) as browser:
    site = "https://www.flipkart.com/asus-zenfone-2-laser-ze550kl-black-16-gb/product-reviews/itme9j58yzyzqzgc?pid=MOBE9J587QGMXBB7"
    browser.get(site)

    file = open("review.txt", "w")

    for count in range(1, 10):
        nav_btns = browser.find_elements_by_class_name('_33m_Yg')

        button = ""

        for btn in nav_btns:
            number = int(btn.text)
            if(number==count):
                button = btn
                break

        button.send_keys(Keys.RETURN)
        WebDriverWait(browser, timeout=10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "_2xg6Ul")))

        read_more_btns = browser.find_elements_by_class_name('_1EPkIx')


        for rm in read_more_btns:
            browser.execute_script("return arguments[0].scrollIntoView();", rm)
            browser.execute_script("window.scrollBy(0, -150);")
            rm.click()

        page_source = browser.page_source

        soup = BeautifulSoup(page_source, "lxml")
        ans = soup.find_all("div", class_="_3DCdKt")


        for tag in ans:
            title = unicode(tag.find("p", class_="_2xg6Ul").string).replace(u"\u2018", "'").replace(u"\u2019", "'")
            title = remove_non_ascii_1(title)
            title.encode('ascii','ignore')
            content = tag.find("div", class_="qwjRop").div.prettify().replace(u"\u2018", "'").replace(u"\u2019", "'")
            content = remove_non_ascii_1(content)
            content.encode('ascii','ignore')
            content = content[15:-7]

            votes = tag.find_all("span", class_="_1_BQL8")
            upvotes = int(votes[0].string)
            downvotes = int(votes[1].string)

            file.write("Review Title : %s\n\n" % title )
            file.write("Upvotes : " + str(upvotes) + "\n\nDownvotes : " + str(downvotes) + "\n\n")
            file.write("Review Content :\n%s\n\n\n\n" % content )

    file.close()

用法:

  1. 通过运行 pip install bs4 selenium 来安装需求.
  2. 将 geckodriver 添加到 PATH.遵循这些说明.
  3. 将产品的链接放在脚本内的站点变量中.
  4. 通过运行 python scrape.py 来运行脚本.
  5. 评论将保存在文件 review.txt 中.

这篇关于如何抓取 Flipkart 评论中的“阅读更多"中的评论数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆