使用 BeautifulSoup 抓取 IMDb 页面 [英] Webscraping an IMDb page using BeautifulSoup

查看:31
本文介绍了使用 BeautifulSoup 抓取 IMDb 页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 WebScraping/Python 和 BeautifulSoup 的新手,很难让我的代码正常工作.

I am new to WebScraping/Python and BeautifulSoup and am having difficulty getting my code to work.

我想抓取网址:http://m.imdb.com/feature/bornondate" 得到:

I would like to scrape the url: http://m.imdb.com/feature/bornondate" to get the:

  • 名人姓名
  • 名人形象
  • 职业
  • 最佳作品

该页面上的十位名人.我不确定我做错了什么.

for the ten celebrities on that page. I am not sure what I am doing wrong.

这是我的代码:

import urllib2
from bs4 import BeautifulSoup

url = 'http://m.imdb.com/feature/bornondate'

test_url = urllib2.urlopen(url)
readHtml = test_url.read()
test_url.close()

soup = BeautifulSoup(readHtml)
# Using it track the number of Actor
count = 0
# Fetching the value present within tag results
person = soup.findChildren('section', 'posters list')
# Changing the person into an iterator
iterperson = iter(person[0].findChildren('a'))

# Finding 'a' in iterperson. Every 'a' tag contains information of a person
for a in iterperson:
    imgSource = a.find('img')['src'].split('._V1.')[0] + '._V1_SX214_AL_.jpg'
    person = a.findChildren('div', 'label')
    title = person[0].find('span', 'title').contents[0]
    ##profession = person[0].find('div', 'detail').contents[0].split(,)
    ##bestWork = person[0].find('div', 'detail').contents[1].split(,)

    print '*******************************IMDB People Born Today***********************************'
    # Printing the S.No of the person
    print 'S.No. --> ',
    count += 1
    print count
    # Printing the title/name of the person
    print 'Title --> ' + title
    # Printing the Image Source of the person
    print 'Image Source --> ', imgSource
    # Printing the Profession of the person
    ##print 'Profession --> ', profession
    # Printing the Best work of the person
    ##print 'Best Work --> ', bestWork

目前没有打印出来.另外,如果这很模糊,您能否解释一下如何仅使用名人姓名?

Currently nothing is getting printed out. Also if this to vague could you explain how to do just Name of Celebrity for instance?

如果有帮助,这里是第一个名人的 html 代码:

Here is the first celebrity's html code if that helps:

<section class="posters list">
<h1>March 7</h1>

    <a href="/name/nm0186505/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1._CR0,0,1369,2019_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Bryan Cranston</span><div class="detail">Actor, "Ozymandias"</div></div></a>

推荐答案

首先,IMDb 明确禁止屏幕抓取 使用条件":

First of all, screen scraping is explicitly forbidden by the IMDb "Conditions of Use":

机器人和屏幕抓取:您不得使用数据挖掘、机器人、屏幕抓取,或类似的数据收集和提取工具本网站,除非我们明确书面同意如下所述.

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

尝试探索 IMDb JSON API 而不是网页抓取方法.

Try exploring the IMDb JSON API instead of a web-scraping approach.

您当前的问题是 - 通过单独调用IMDb APIjavascript逻辑加载特定日期出生的人的列表 涉及.

Your current problem is - the list of people born on the specific date is loaded via a separate call to the IMDb API and with a javascript logic involved.

现在最简单的选择是切换到 selenium 浏览器自动化工具.使用 headless PhantomJS 浏览器的工作示例:

The easiest option right now would be to switch to selenium browser automation tool. Working example using a headless PhantomJS browser:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS()
driver.get("http://m.imdb.com/feature/bornondate")

# waiting for posters to load
wait = WebDriverWait(driver, 10)
posters = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "section.posters")))

# extracting the data poster by poster
for a in posters.find_elements_by_css_selector('a.poster'):
    img = a.find_element_by_tag_name('img').get_attribute('src').split('._V1.')[0] + '._V1_SX214_AL_.jpg'

    person = a.find_element_by_css_selector('div.detail').text
    title = a.find_element_by_css_selector('span.title').text

    print img, person, title

打印:

http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1_SX214_AL_.jpg Actor, "Ozymandias" Bryan Cranston
http://ia.media-imdb.com/images/M/MV5BNjUxNjcxMjE4N15BMl5BanBnXkFtZTgwNDk4NjA2MzE@._V1_SX214_AL_.jpg Actress, "Karla" Laura Prepon
http://ia.media-imdb.com/images/M/MV5BMTQ4MzM1MDAwMV5BMl5BanBnXkFtZTcwNTU4NzQwMw@@._V1_SX214_AL_.jpg Actress, "The Mummy" Rachel Weisz
http://ia.media-imdb.com/images/M/MV5BMjE0Mjg0NzE2Nl5BMl5BanBnXkFtZTcwMDE1MTkxMw@@._V1_SX214_AL_.jpg Actor, "Jarhead" Peter Sarsgaard
http://ia.media-imdb.com/images/M/MV5BMTMyOTYzODQ5MF5BMl5BanBnXkFtZTcwMjE3MDgzMQ@@._V1_SX214_AL_.jpg Actress, "Blades of Glory" Jenna Fischer
http://ia.media-imdb.com/images/M/MV5BMzE2OTAwNzM0Ml5BMl5BanBnXkFtZTcwNzE1MDg0Mw@@._V1_SX214_AL_.jpg Actress, "Tangled" Donna Murphy
http://ia.media-imdb.com/images/M/MV5BMTI0OTMzMzE0N15BMl5BanBnXkFtZTcwMjI1MzYyMQ@@._V1_SX214_AL_.jpg Actor, "How the Grinch Stole Christmas" T.J. Thyne
http://ia.media-imdb.com/images/M/MV5BNzczODkyNzY4OV5BMl5BanBnXkFtZTcwNTU0NjQzMQ@@._V1_SX214_AL_.jpg Actor, "Home Alone" John Heard
http://ia.media-imdb.com/images/M/MV5BMTg4MjU2MzA2OV5BMl5BanBnXkFtZTgwOTIxMjc4MjE@._V1_SX214_AL_.jpg Actress, "Beerfest" Audrey Marie Anderson
http://ia.media-imdb.com/images/M/MV5BMTQyOTc5NzA0M15BMl5BanBnXkFtZTYwODQ2MjYz._V1_SX214_AL_.jpg Producer, "Kick-Ass" Matthew Vaughn

这篇关于使用 BeautifulSoup 抓取 IMDb 页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆