数据挖掘 IMDB 评论 - 仅提取前 25 条评论 [英] Data Mining IMDB Reviews - Only extracting the first 25 reviews

查看:33
本文介绍了数据挖掘 IMDB 评论 - 仅提取前 25 条评论的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试提取关于蜘蛛侠英雄归来电影的所有评论,但我只能获得前 25 条评论.我能够在 IMDB 中加载更多以获取所有评论,因为它最初只显示前 25 条评论,但由于某种原因,我无法在加载每个评论后挖掘所有评论.有谁知道我做错了什么?

下面是我正在运行的代码:

导入请求从 bs4 导入 BeautifulSoup从硒导入网络驱动程序从 selenium.webdriver.common.keys 导入密钥将熊猫导入为 pd从 vaderSentiment.vaderSentiment 导入 SentimentIntensityAnalyzer从 textblob 导入 TextBlob导入时间从 selenium.webdriver.support.ui 导入 WebDriverWait从 selenium.webdriver.support 导入 expected_conditions 作为 ECfrom selenium.webdriver.common.by import By#设置浏览器driver = webdriver.Chrome(executable_path=rC:\Users\Kent_\Desktop\WorkStudy\chromedriver.exe")#去谷歌driver.get(https://www.imdb.com/title/tt6320628/reviews?ref_=tt_urv")#Loop 加载更多按钮等待 = WebDriverWait(驱动程序,10)为真:尝试:driver.find_element_by_css_selector("button#load-more-trigger").click()wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))汤 = BeautifulSoup(driver.page_source, 'lxml')除了例外:中断#Scrape IMBD 评论ans = driver.current_url页面 = requests.get(ans)汤 = BeautifulSoup(page.content, html.parser")全部 = 汤.find(id=main")#获取电影名称全部 = 汤.find(id=main")parent = all.find(class_ =父")name = parent.find(itemprop = "name")url = name.find(itemprop = 'url')电影标题 = url.get_text()print('通过查找阶段.....')#获取评论的标题title_rev = all.select(".title")title = [t.get_text().replace("\n", "") for t in title_rev]print('获取评论标题并保存到列表中')#获取评论review_rev = all.select(.content .text")review = [r.get_text() for r in review_rev]print('获取评论内容并保存到列表中')#制作成数据框table_review = pd.DataFrame({标题": 标题,审查": 审查})table_review.to_csv('Spiderman_Reviews.csv')打印(标题)打印(评论)

解决方案

嗯,实际上,没有必要使用 Selenium.数据可通过以下格式向网站 API 发送 GET 请求获得:

https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey=MY-KEY

您必须为 URL 中的 paginationKey 提供 key (...&paginationKey=MY-KEY)

key 可以在 load-more-data 类中找到:


因此,要将所有评论刮到 DataFrame 中,请尝试:

将pandas导入为pd进口请求从 bs4 导入 BeautifulSoup网址 = (https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey={}")键="数据 = {标题":[],评论":[]}为真:响应 = requests.get(url.format(key))汤 = BeautifulSoup(response.content, html.parser")# 找到分页键pagination_key = soup.find(div", class_=load-more-data")如果不是 pagination_key:休息# 更新 `key` 变量以获取更多评论key = pagination_key[数据键"]对于标题,在 zip 中查看(汤.find_all(class_=title"),soup.find_all(class_=text show-more__control")):数据[标题"].append(title.get_text(strip=True))数据[评论"].append(review.get_text())df = pd.DataFrame(数据)打印(df)

输出(截断):

 标题审核0 极好的娱乐蜘蛛侠:远离家乡并不打算成为......1 蜘蛛侠身份的错觉.蜘蛛侠之家的精彩故事延续...2坏人发生了什么我相信昆腾贝克/神秘客得到了什么......3 壮观 最好的之一,如果不是最好的蜘蛛侠电影............

I am currently trying to extract all the reviews on Spiderman Homecoming movie but I am only able to get the first 25 reviews. I was able to load more in IMDB to get all the reviews as originally it only shows the first 25 but for some reason I am unable to mine all the reviews after every review has been loaded. Does anyone know what I am doing wrong?

Below is the code I am running:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By


#Set the web browser
driver = webdriver.Chrome(executable_path=r"C:\Users\Kent_\Desktop\WorkStudy\chromedriver.exe")

#Go to Google
driver.get("https://www.imdb.com/title/tt6320628/reviews?ref_=tt_urv")

#Loop load more button
wait = WebDriverWait(driver,10)
while True:
    try:
        driver.find_element_by_css_selector("button#load-more-trigger").click()
        wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))
        soup = BeautifulSoup(driver.page_source, 'lxml')
    except Exception:break


#Scrape IMBD review
ans = driver.current_url
page = requests.get(ans)
soup = BeautifulSoup(page.content, "html.parser")
all = soup.find(id="main")

#Get the title of the movie
all = soup.find(id="main")
parent = all.find(class_ ="parent")
name = parent.find(itemprop = "name")
url = name.find(itemprop = 'url')
film_title = url.get_text()
print('Pass finding phase.....')

#Get the title of the review
title_rev = all.select(".title")
title = [t.get_text().replace("\n", "") for t in title_rev]
print('getting title of reviews and saving into a list')

#Get the review
review_rev = all.select(".content .text")
review = [r.get_text() for r in review_rev]
print('getting content of reviews and saving into a list')

#Make it into dataframe
table_review = pd.DataFrame({
    "Title" : title,
    "Review" : review
})
table_review.to_csv('Spiderman_Reviews.csv')

print(title)
print(review)

解决方案

Well, actually, there's no need to use Selenium. The data is available via sending a GET request to the websites API in the following format:

https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey=MY-KEY

where you have to provide a key for the paginationKey in the URL (...&paginationKey=MY-KEY)

The key is found in the class load-more-data:

<div class="load-more-data" data-key="g4wp7crmqizdeyyf72ux5nrurdsmqhjjtzpwzouokkd2gbzgpnt6uc23o4zvtmzlb4d46f2swblzkwbgicjmquogo5tx2">
            </div>


So, to scrape all the reviews into a DataFrame, try:

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = (
    "https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey={}"
)
key = ""
data = {"title": [], "review": []}

while True:
    response = requests.get(url.format(key))
    soup = BeautifulSoup(response.content, "html.parser")
    # Find the pagination key
    pagination_key = soup.find("div", class_="load-more-data")
    if not pagination_key:
        break

    # Update the `key` variable in-order to scrape more reviews
    key = pagination_key["data-key"]
    for title, review in zip(
        soup.find_all(class_="title"), soup.find_all(class_="text show-more__control")
    ):
        data["title"].append(title.get_text(strip=True))
        data["review"].append(review.get_text())

df = pd.DataFrame(data)
print(df)

Output (truncated):

                                                title                                             review
0                              Terrific entertainment  Spiderman: Far from Home is not intended to be...
1         THe illusion of the identity of Spider man.  Great story in continuation of spider man home...
2                       What Happened to the Bad Guys  I believe that Quinten Beck/Mysterio got what ...
3                                         Spectacular  One of the best if not the best Spider-Man mov...

...
...

这篇关于数据挖掘 IMDB 评论 - 仅提取前 25 条评论的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆