Python + 网页抓取 + scrapy:如何从 IMDb 页面获取所有电影的链接? [英] Python + web scraping + scrapy : How to get the links to all movies from an IMDb page?

查看:122
本文介绍了Python + 网页抓取 + scrapy:如何从 IMDb 页面获取所有电影的链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须从这个 IMDb 页面抓取所有电影:https://www.imdb.com/list/ls055386972/.

I have to scrape all movies from this IMDb page : https://www.imdb.com/list/ls055386972/.

我的方法是首先抓取 <a href="/title/tt0068646/?ref_=ttls_li_tt" 的所有值,即提取 /title/tt0068646/?ref_=ttls_li_tt 部分,然后添加'https://www.imdb.com' 准备电影的完整 URL,即 https://www.imdb.com/标题/tt0068646/?ref_=ttls_li_tt.但是每当我给 response.xpath('//h3[@class]/a[@href]').extract() 时,它都会提取所需的部分以及电影标题:[u'<a href="/title/tt0068646/?ref_=ttls_li_tt">教父</a>', u'<a href="/title/tt0108052/?ref_=ttls_li_tt">辛德勒的名单</a>......]'我只想要"/title/tt0068646/?ref_=ttls_li_tt"部分.

My approach is first to scrape all the values of <a href="/title/tt0068646/?ref_=ttls_li_tt" , i.e., to extract /title/tt0068646/?ref_=ttls_li_tt portions and then add 'https://www.imdb.com' to prepare the complete URL to the movie, i.e., https://www.imdb.com/title/tt0068646/?ref_=ttls_li_tt . But whenever I am giving response.xpath('//h3[@class]/a[@href]').extract() it is extracting the desired portion along with the movie title: [u'<a href="/title/tt0068646/?ref_=ttls_li_tt">The Godfather</a>', u'<a href="/title/tt0108052/?ref_=ttls_li_tt">Schindler\'s List</a>......]'I want only the "/title/tt0068646/?ref_=ttls_li_tt" portion.

如何进行?

推荐答案

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.imdb.com/list/ls055386972/")
soup = BeautifulSoup(page.content, 'html.parser')

movies = soup.findAll('h3', attrs={'class' : 'lister-item-header'})
for movie in movies:
    print(movie.a['href'])

输出:

/title/tt0068646/?ref_=ttls_li_tt
/title/tt0108052/?ref_=ttls_li_tt
/title/tt0050083/?ref_=ttls_li_tt
/title/tt0118799/?ref_=ttls_li_tt
.
.
.
.
/title/tt0088763/?ref_=ttls_li_tt
/title/tt0266543/?ref_=ttls_li_tt

这篇关于Python + 网页抓取 + scrapy:如何从 IMDb 页面获取所有电影的链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆