用于刮特定链接的CSS选择器 [英] CSS selectors to be used for scraping specific links

查看:124
本文介绍了用于刮特定链接的CSS选择器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的Python和刮项目。我使用Firebug的复制必需的链接的CSS路径。我想从 http://kiascenehai.pk/ 但它收集活动预告的选项卡下的链接只是为了学习我如何能得到指定的链接。

我要寻找这个问题的修复,也为如何检索使用CSS选择器指定链接的建议。

 从BS4进口BeautifulSoup
进口要求URL =htt​​p://kiascenehai.pk/R = requests.get(URL)数据= r.text汤= BeautifulSoup(数据)在soup.select链接(HTML身体div.body-外包装div.body-wrapper.boxed模式div.main-外wrapper.mt30 div.main-wrapper.container div.row.row,包装DIV 。第-wrapper.twelve.columns.b0 div.row div.page-wrapper.twelve.columns div.row div.eight.columns.b0 div.content.clearfix节#主要内容div.row div.six.columns div.small-后包装div.small-内容后h2.small-后所有权):
    打印link.get('href属性)


解决方案

首先,该页面要求一个城市选择,以作出(在cookie)。使用 Session对象来处理这个问题:

  S = requests.Session()
s.post('http://kiascenehai.pk/select_city/submit_city',数据= {'城市':'拉合尔'})
响应= s.get('http://kiascenehai.pk/')

现在的反应得到实际的页面内容,而不是重定向到城市选择页面。

接下来,让你的CSS选择器不大于需要的。在这个页面中没有太多,因为它采用了网格布局下去,所以我们首先需要放大右侧行:

  upcoming_events_header = soup.find('DIV',类_ ='特色事件')
upcoming_events_row = upcoming_events_header.find_next(类_ ='行')在upcoming_events_row.select链接(H 2的[HREF]'):
    打印链接['href属性]

I am new to Python and working on a scraping project. I am using Firebug to copy the CSS path of required links. I am trying to collect the links under the tab of "UPCOMING EVENTS" from http://kiascenehai.pk/ but it is just for learning how I can get the specified links.

I am looking for the fix of this problem and also suggestions for how to retrieve specified links using CSS selectors.

from bs4 import BeautifulSoup
import requests

url = "http://kiascenehai.pk/"

r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data)

for link in soup.select("html body div.body-outer-wrapper div.body-wrapper.boxed-mode div.main-     outer-wrapper.mt30 div.main-wrapper.container div.row.row-wrapper div.page-wrapper.twelve.columns.b0 div.row div.page-wrapper.twelve.columns div.row div.eight.columns.b0 div.content.clearfix section#main-content div.row div.six.columns div.small-post-wrapper div.small-post-content h2.small-post-title a"):
    print  link.get('href')

解决方案

First of all, that page requires a city selection to be made (in a cookie). Use a Session object to handle this:

s = requests.Session()
s.post('http://kiascenehai.pk/select_city/submit_city', data={'city': 'Lahore'})
response = s.get('http://kiascenehai.pk/')

Now the response gets the actual page content, not redirected to the city selection page.

Next, keep your CSS selector no larger than needed. In this page there isn't much to go on as it uses a grid layout, so we first need to zoom in on the right rows:

upcoming_events_header = soup.find('div', class_='featured-event')
upcoming_events_row = upcoming_events_header.find_next(class_='row')

for link in upcoming_events_row.select('h2 a[href]'):
    print link['href']

这篇关于用于刮特定链接的CSS选择器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆