Webscraping使用BeautifulSoup的IMDB页 [英] Webscraping an IMDb page using BeautifulSoup
问题描述
我是新来WebScraping / Python和BeautifulSoup和我有困难让我的code工作。
我想刮的网址: http://m.imdb.com/feature/bornondate 来获得:
- 名人的名称
- 名人形象
- 行业
- 最好的作品
该页面在十名人。我不知道我做错了。
下面是我的code:
进口的urllib2
从BS4进口BeautifulSoupURL ='http://m.imdb.com/feature/bornondatetest_url = urllib2.urlopen(URL)
readHtml = test_url.read()
test_url.close()汤= BeautifulSoup(readHtml)
#使用它追踪演员的数量
数= 0
#标签抓取结果中值present
人= soup.findChildren('节','海报名单)
#改变人变成一个iterator
iterperson = ITER(人[0] .findChildren('A'))#寻找'一'的iterperson。每一个'A'标签包含一个人的信息
对于在iterperson:
imgSource = a.find('IMG')['src'中。斯普利特('._ V1。')[0] +'._V1_SX214_AL_.jpg
人= a.findChildren('格','标签')
标题=人[0] .find('跨','标题')。内容[0]
##职业=人[0] .find('格','细节')。内容[0] .split(,)
## bestWork =人[0] .find('格','细节')。内容[1] .split(,) 打印************ IMDB出生的人如今************* **********************
#打印的人的S.No
打印S.No. - > ',
数+ = 1
打印计数
#打印的人的标题/名称
打印标题 - > '+标题
#打印的人的图像源
打印图片来源 - > ',imgSource
#打印的人的职业
##打印专业 - > '职业
#打印的人的最好的工作
##打印最好的工作 - > ',bestWork
目前没有什么是越来越打印出来。
此外,如果这种含糊你能解释一下如何为实例做名人的只是名称?
这是第一个名人的HTML code是否有帮助:
<节类=海报列表>
< H1> 3月7 LT; / H1> < A HREF =/名/ nm0186505 /级=海报>< IMG src=\"http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1._CR0,0,1369,2019_SX40_SY59.jpg\"风格=背景:网址('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')WIDTH =40HEIGHT =59>< DIV类=标签><跨度类=标题>布赖恩·克兰斯顿< / SPAN>< DIV CLASS =细节>演员,Ozymandias< / DIV>< / DIV>< / A>
首先,屏幕抓取明确由IMDB的条件:
机器人和屏幕抓取:你可以不使用数据挖掘,机器人,
屏幕抓取,或类似的数据收集和提取工具
这个网站,除了与我们的前preSS的书面同意,如下所述。
块引用>请尝试的探索IMDB JSON API 的代替Web刮的方法。
您目前的问题是 - 出生在特定日期的人名单通过的到
IMDB
API单独调用的和一个装的 JavaScript逻辑的参与。最简单的方法,现在就改用
硒
浏览器自动化工具。使用的无头的工作例如PhantomJS
浏览器的:硒进口的webdriver
从selenium.webdriver.common.by进口国
从selenium.webdriver.support.ui进口WebDriverWait
从selenium.webdriver.support进口expected_conditions为EC司机= webdriver.PhantomJS()
driver.get(http://m.imdb.com/feature/bornondate)#等待海报加载
等待= WebDriverWait(驱动程序,10)
海报= wait.until(EC。presence_of_element_located((By.CSS_SELECTORsection.posters)))#提取由海报数据海报
对于在posters.find_elements_by_css_selector('a.poster'):
IMG = a.find_element_by_tag_name('IMG')。get_attribute(SRC)。斯普利特('._ V1。')[0] +'._V1_SX214_AL_.jpg 人= a.find_element_by_css_selector('div.detail')。文本
标题= a.find_element_by_css_selector('span.title')。文本 打印IMG,人,标题打印:
<$p$p><$c$c>http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1_SX214_AL_.jpg演员,Ozymandias布莱恩·克兰斯顿
http://ia.media-imdb.com/images/M/MV5BNjUxNjcxMjE4N15BMl5BanBnXkFtZTgwNDk4NjA2MzE@._V1_SX214_AL_.jpg女主角,卡拉劳拉prePON
http://ia.media-imdb.com/images/M/MV5BMTQ4MzM1MDAwMV5BMl5BanBnXkFtZTcwNTU4NzQwMw@@._V1_SX214_AL_.jpg女主角,木乃伊蕾切尔薇兹
http://ia.media-imdb.com/images/M/MV5BMjE0Mjg0NzE2Nl5BMl5BanBnXkFtZTcwMDE1MTkxMw@@._V1_SX214_AL_.jpg演员,锅盖头彼得·萨斯加德
http://ia.media-imdb.com/images/M/MV5BMTMyOTYzODQ5MF5BMl5BanBnXkFtZTcwMjE3MDgzMQ@@._V1_SX214_AL_.jpg女主角,刀片荣耀珍娜菲舍尔
http://ia.media-imdb.com/images/M/MV5BMzE2OTAwNzM0Ml5BMl5BanBnXkFtZTcwNzE1MDg0Mw@@._V1_SX214_AL_.jpg女主角,纠结唐娜·墨菲
http://ia.media-imdb.com/images/M/MV5BMTI0OTMzMzE0N15BMl5BanBnXkFtZTcwMjI1MzYyMQ@@._V1_SX214_AL_.jpg演员的圣诞怪杰T.J. Thyne
http://ia.media-imdb.com/images/M/MV5BNzczODkyNzY4OV5BMl5BanBnXkFtZTcwNTU0NjQzMQ@@._V1_SX214_AL_.jpg演员,小鬼当家约翰·赫德
http://ia.media-imdb.com/images/M/MV5BMTg4MjU2MzA2OV5BMl5BanBnXkFtZTgwOTIxMjc4MjE@._V1_SX214_AL_.jpg女主角,Beerfest奥黛丽玛丽·安德森
http://ia.media-imdb.com/images/M/MV5BMTQyOTc5NzA0M15BMl5BanBnXkFtZTYwODQ2MjYz._V1_SX214_AL_.jpg监制,海扁王马修·沃恩
I am new to WebScraping/Python and BeautifulSoup and am having difficulty getting my code to work.
I would like to scrape the url: http://m.imdb.com/feature/bornondate" to get the:
- Name of the celebrity
- Celebrity Image
- Profession
- Best Work
for the ten celebrities on that page. I am not sure what I am doing wrong.
Here is my code:
import urllib2 from bs4 import BeautifulSoup url = 'http://m.imdb.com/feature/bornondate' test_url = urllib2.urlopen(url) readHtml = test_url.read() test_url.close() soup = BeautifulSoup(readHtml) # Using it track the number of Actor count = 0 # Fetching the value present within tag results person = soup.findChildren('section', 'posters list') # Changing the person into an iterator iterperson = iter(person[0].findChildren('a')) # Finding 'a' in iterperson. Every 'a' tag contains information of a person for a in iterperson: imgSource = a.find('img')['src'].split('._V1.')[0] + '._V1_SX214_AL_.jpg' person = a.findChildren('div', 'label') title = person[0].find('span', 'title').contents[0] ##profession = person[0].find('div', 'detail').contents[0].split(,) ##bestWork = person[0].find('div', 'detail').contents[1].split(,) print '*******************************IMDB People Born Today***********************************' # Printing the S.No of the person print 'S.No. --> ', count += 1 print count # Printing the title/name of the person print 'Title --> ' + title # Printing the Image Source of the person print 'Image Source --> ', imgSource # Printing the Profession of the person ##print 'Profession --> ', profession # Printing the Best work of the person ##print 'Best Work --> ', bestWork
Currently nothing is getting printed out. Also if this to vague could you explain how to do just Name of Celebrity for instance?
Here is the first celebrity's html code if that helps:
<section class="posters list"> <h1>March 7</h1> <a href="/name/nm0186505/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1._CR0,0,1369,2019_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Bryan Cranston</span><div class="detail">Actor, "Ozymandias"</div></div></a>
解决方案First of all, screen scraping is explicitly forbidden by the IMDb "Conditions of Use":
Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.
Try exploring the IMDb JSON API instead of a web-scraping approach.
Your current problem is - the list of people born on the specific date is loaded via a separate call to the
IMDb
API and with a javascript logic involved.The easiest option right now would be to switch to
selenium
browser automation tool. Working example using a headlessPhantomJS
browser:from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.PhantomJS() driver.get("http://m.imdb.com/feature/bornondate") # waiting for posters to load wait = WebDriverWait(driver, 10) posters = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "section.posters"))) # extracting the data poster by poster for a in posters.find_elements_by_css_selector('a.poster'): img = a.find_element_by_tag_name('img').get_attribute('src').split('._V1.')[0] + '._V1_SX214_AL_.jpg' person = a.find_element_by_css_selector('div.detail').text title = a.find_element_by_css_selector('span.title').text print img, person, title
Prints:
http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1_SX214_AL_.jpg Actor, "Ozymandias" Bryan Cranston http://ia.media-imdb.com/images/M/MV5BNjUxNjcxMjE4N15BMl5BanBnXkFtZTgwNDk4NjA2MzE@._V1_SX214_AL_.jpg Actress, "Karla" Laura Prepon http://ia.media-imdb.com/images/M/MV5BMTQ4MzM1MDAwMV5BMl5BanBnXkFtZTcwNTU4NzQwMw@@._V1_SX214_AL_.jpg Actress, "The Mummy" Rachel Weisz http://ia.media-imdb.com/images/M/MV5BMjE0Mjg0NzE2Nl5BMl5BanBnXkFtZTcwMDE1MTkxMw@@._V1_SX214_AL_.jpg Actor, "Jarhead" Peter Sarsgaard http://ia.media-imdb.com/images/M/MV5BMTMyOTYzODQ5MF5BMl5BanBnXkFtZTcwMjE3MDgzMQ@@._V1_SX214_AL_.jpg Actress, "Blades of Glory" Jenna Fischer http://ia.media-imdb.com/images/M/MV5BMzE2OTAwNzM0Ml5BMl5BanBnXkFtZTcwNzE1MDg0Mw@@._V1_SX214_AL_.jpg Actress, "Tangled" Donna Murphy http://ia.media-imdb.com/images/M/MV5BMTI0OTMzMzE0N15BMl5BanBnXkFtZTcwMjI1MzYyMQ@@._V1_SX214_AL_.jpg Actor, "How the Grinch Stole Christmas" T.J. Thyne http://ia.media-imdb.com/images/M/MV5BNzczODkyNzY4OV5BMl5BanBnXkFtZTcwNTU0NjQzMQ@@._V1_SX214_AL_.jpg Actor, "Home Alone" John Heard http://ia.media-imdb.com/images/M/MV5BMTg4MjU2MzA2OV5BMl5BanBnXkFtZTgwOTIxMjc4MjE@._V1_SX214_AL_.jpg Actress, "Beerfest" Audrey Marie Anderson http://ia.media-imdb.com/images/M/MV5BMTQyOTc5NzA0M15BMl5BanBnXkFtZTYwODQ2MjYz._V1_SX214_AL_.jpg Producer, "Kick-Ass" Matthew Vaughn
这篇关于Webscraping使用BeautifulSoup的IMDB页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!