Webscraping使用BeautifulSoup的IMDB页 [英] Webscraping an IMDb page using BeautifulSoup

查看:124
本文介绍了Webscraping使用BeautifulSoup的IMDB页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来WebScraping / Python和BeautifulSoup和我有困难让我的code工作。

我想刮的网址: http://m.imdb.com/feature/bornondate 来获得:


  • 名人的名称

  • 名人形象

  • 行业

  • 最好的作品

该页面在十名人。我不知道我做错了。

下面是我的code:

 进口的urllib2
从BS4进口BeautifulSoupURL ='http://m.imdb.com/feature/bornondatetest_url = urllib2.urlopen(URL)
readHtml = test_url.read()
test_url.close()汤= BeautifulSoup(readHtml)
#使用它追踪演员的数量
数= 0
#标签抓取结果中值present
人= soup.findChildren('节','海报名单)
#改变人变成一个iterator
iterperson = ITER(人[0] .findChildren('A'))#寻找'一'的iterperson。每一个'A'标签包含一个人的信息
对于在iterperson:
    imgSource = a.find('IMG')['src'中。斯普利特('._ V1。')[0] +'._V1_SX214_AL_.jpg
    人= a.findChildren('格','标签')
    标题=人[0] .find('跨','标题')。内容[0]
    ##职业=人[0] .find('格','细节')。内容[0] .split(,)
    ## bestWork =人[0] .find('格','细节')。内容[1] .split(,)    打印************ IMDB出生的人如今************* **********************
    #打印的人的S.No
    打印S.No. - > ',
    数+ = 1
    打印计数
    #打印的人的标题/名称
    打印标题 - > '+标题
    #打印的人的图像源
    打印图片来源 - > ',imgSource
    #打印的人的职业
    ##打印专业 - > '职业
    #打印的人的最好的工作
    ##打印最好的工作 - > ',bestWork

目前没有什么是越来越打印出来。
此外,如果这种含糊你能解释一下如何为实例做名人的只是名称?

这是第一个名人的HTML code是否有帮助:

 <节类=海报列表>
< H1> 3月7 LT; / H1>    < A HREF =/名/ nm0186505 /级=海报>< IMG src=\"http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1._CR0,0,1369,2019_SX40_SY59.jpg\"风格=背景:网址('http://i.media-imdb​​.com/images/mobile/people-40x59-fade.png')WIDTH =40HEIGHT =59>< D​​IV类=标签><跨度类=标题>布赖恩·克兰斯顿< / SPAN>< D​​IV CLASS =细节>演员,Ozymandias< / DIV>< / DIV>< / A>


解决方案

首先,屏幕抓取明确由IMDB的条件


  

机器人和屏幕抓取:你可以不使用数据挖掘,机器人,
  屏幕抓取,或类似的数据收集和提取工具
  这个网站,除了与我们的前preSS的书面同意,如下所述。


请尝试的探索IMDB JSON API 的代替Web刮的方法。


您目前的问题是 - 出生在特定日期的人名单通过的 IMDB API单独调用的和一个装的 JavaScript逻辑的参与。

最简单的方法,现在就改用 浏览器自动化工具。使用的无头的工作例如 PhantomJS 浏览器的:

 硒进口的webdriver
从selenium.webdriver.common.by进口国
从selenium.webdriver.support.ui进口WebDriverWait
从selenium.webdriver.support进口expected_conditions为EC司机= webdriver.PhantomJS()
driver.get(http://m.imdb.com/feature/bornondate)#等待海报加载
等待= WebDriverWait(驱动程序,10)
海报= wait.until(EC。presence_of_element_located((By.CSS_SELECTORsection.posters)))#提取由海报数据海报
对于在posters.find_elements_by_css_selector('a.poster'):
    IMG = a.find_element_by_tag_name('IMG')。get_attribute(SRC)。斯普利特('._ V1。')[0] +'._V1_SX214_AL_.jpg    人= a.find_element_by_css_selector('div.detail')。文本
    标题= a.find_element_by_css_selector('span.title')。文本    打印IMG,人,标题

打印:

<$p$p><$c$c>http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1_SX214_AL_.jpg演员,Ozymandias布莱恩·克兰斯顿
http://ia.media-imdb​​.com/images/M/MV5BNjUxNjcxMjE4N15BMl5BanBnXkFtZTgwNDk4NjA2MzE@._V1_SX214_AL_.jpg女主角,卡拉劳拉prePON
http://ia.media-imdb​​.com/images/M/MV5BMTQ4MzM1MDAwMV5BMl5BanBnXkFtZTcwNTU4NzQwMw@@._V1_SX214_AL_.jpg女主角,木乃伊蕾切尔薇兹
http://ia.media-imdb​​.com/images/M/MV5BMjE0Mjg0NzE2Nl5BMl5BanBnXkFtZTcwMDE1MTkxMw@@._V1_SX214_AL_.jpg演员,锅盖头彼得·萨斯加德
http://ia.media-imdb​​.com/images/M/MV5BMTMyOTYzODQ5MF5BMl5BanBnXkFtZTcwMjE3MDgzMQ@@._V1_SX214_AL_.jpg女主角,刀片荣耀珍娜菲舍尔
http://ia.media-imdb​​.com/images/M/MV5BMzE2OTAwNzM0Ml5BMl5BanBnXkFtZTcwNzE1MDg0Mw@@._V1_SX214_AL_.jpg女主角,纠结唐娜·墨菲
http://ia.media-imdb​​.com/images/M/MV5BMTI0OTMzMzE0N15BMl5BanBnXkFtZTcwMjI1MzYyMQ@@._V1_SX214_AL_.jpg演员的圣诞怪杰T.J. Thyne
http://ia.media-imdb​​.com/images/M/MV5BNzczODkyNzY4OV5BMl5BanBnXkFtZTcwNTU0NjQzMQ@@._V1_SX214_AL_.jpg演员,小鬼当家约翰·赫德
http://ia.media-imdb​​.com/images/M/MV5BMTg4MjU2MzA2OV5BMl5BanBnXkFtZTgwOTIxMjc4MjE@._V1_SX214_AL_.jpg女主角,Beerfest奥黛丽玛丽·安德森
http://ia.media-imdb​​.com/images/M/MV5BMTQyOTc5NzA0M15BMl5BanBnXkFtZTYwODQ2MjYz._V1_SX214_AL_.jpg监制,海扁王马修·沃恩

I am new to WebScraping/Python and BeautifulSoup and am having difficulty getting my code to work.

I would like to scrape the url: http://m.imdb.com/feature/bornondate" to get the:

  • Name of the celebrity
  • Celebrity Image
  • Profession
  • Best Work

for the ten celebrities on that page. I am not sure what I am doing wrong.

Here is my code:

import urllib2
from bs4 import BeautifulSoup

url = 'http://m.imdb.com/feature/bornondate'

test_url = urllib2.urlopen(url)
readHtml = test_url.read()
test_url.close()

soup = BeautifulSoup(readHtml)
# Using it track the number of Actor
count = 0
# Fetching the value present within tag results
person = soup.findChildren('section', 'posters list')
# Changing the person into an iterator
iterperson = iter(person[0].findChildren('a'))

# Finding 'a' in iterperson. Every 'a' tag contains information of a person
for a in iterperson:
    imgSource = a.find('img')['src'].split('._V1.')[0] + '._V1_SX214_AL_.jpg'
    person = a.findChildren('div', 'label')
    title = person[0].find('span', 'title').contents[0]
    ##profession = person[0].find('div', 'detail').contents[0].split(,)
    ##bestWork = person[0].find('div', 'detail').contents[1].split(,)

    print '*******************************IMDB People Born Today***********************************'
    # Printing the S.No of the person
    print 'S.No. --> ',
    count += 1
    print count
    # Printing the title/name of the person
    print 'Title --> ' + title
    # Printing the Image Source of the person
    print 'Image Source --> ', imgSource
    # Printing the Profession of the person
    ##print 'Profession --> ', profession
    # Printing the Best work of the person
    ##print 'Best Work --> ', bestWork

Currently nothing is getting printed out. Also if this to vague could you explain how to do just Name of Celebrity for instance?

Here is the first celebrity's html code if that helps:

<section class="posters list">
<h1>March 7</h1>

    <a href="/name/nm0186505/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1._CR0,0,1369,2019_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Bryan Cranston</span><div class="detail">Actor, "Ozymandias"</div></div></a>

解决方案

First of all, screen scraping is explicitly forbidden by the IMDb "Conditions of Use":

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

Try exploring the IMDb JSON API instead of a web-scraping approach.


Your current problem is - the list of people born on the specific date is loaded via a separate call to the IMDb API and with a javascript logic involved.

The easiest option right now would be to switch to selenium browser automation tool. Working example using a headless PhantomJS browser:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS()
driver.get("http://m.imdb.com/feature/bornondate")

# waiting for posters to load
wait = WebDriverWait(driver, 10)
posters = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "section.posters")))

# extracting the data poster by poster
for a in posters.find_elements_by_css_selector('a.poster'):
    img = a.find_element_by_tag_name('img').get_attribute('src').split('._V1.')[0] + '._V1_SX214_AL_.jpg'

    person = a.find_element_by_css_selector('div.detail').text
    title = a.find_element_by_css_selector('span.title').text

    print img, person, title

Prints:

http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1_SX214_AL_.jpg Actor, "Ozymandias" Bryan Cranston
http://ia.media-imdb.com/images/M/MV5BNjUxNjcxMjE4N15BMl5BanBnXkFtZTgwNDk4NjA2MzE@._V1_SX214_AL_.jpg Actress, "Karla" Laura Prepon
http://ia.media-imdb.com/images/M/MV5BMTQ4MzM1MDAwMV5BMl5BanBnXkFtZTcwNTU4NzQwMw@@._V1_SX214_AL_.jpg Actress, "The Mummy" Rachel Weisz
http://ia.media-imdb.com/images/M/MV5BMjE0Mjg0NzE2Nl5BMl5BanBnXkFtZTcwMDE1MTkxMw@@._V1_SX214_AL_.jpg Actor, "Jarhead" Peter Sarsgaard
http://ia.media-imdb.com/images/M/MV5BMTMyOTYzODQ5MF5BMl5BanBnXkFtZTcwMjE3MDgzMQ@@._V1_SX214_AL_.jpg Actress, "Blades of Glory" Jenna Fischer
http://ia.media-imdb.com/images/M/MV5BMzE2OTAwNzM0Ml5BMl5BanBnXkFtZTcwNzE1MDg0Mw@@._V1_SX214_AL_.jpg Actress, "Tangled" Donna Murphy
http://ia.media-imdb.com/images/M/MV5BMTI0OTMzMzE0N15BMl5BanBnXkFtZTcwMjI1MzYyMQ@@._V1_SX214_AL_.jpg Actor, "How the Grinch Stole Christmas" T.J. Thyne
http://ia.media-imdb.com/images/M/MV5BNzczODkyNzY4OV5BMl5BanBnXkFtZTcwNTU0NjQzMQ@@._V1_SX214_AL_.jpg Actor, "Home Alone" John Heard
http://ia.media-imdb.com/images/M/MV5BMTg4MjU2MzA2OV5BMl5BanBnXkFtZTgwOTIxMjc4MjE@._V1_SX214_AL_.jpg Actress, "Beerfest" Audrey Marie Anderson
http://ia.media-imdb.com/images/M/MV5BMTQyOTc5NzA0M15BMl5BanBnXkFtZTYwODQ2MjYz._V1_SX214_AL_.jpg Producer, "Kick-Ass" Matthew Vaughn

这篇关于Webscraping使用BeautifulSoup的IMDB页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆