如何使用BeautifulSoup使用CSS选择器来检索某个类中的特定链接? [英] How to use CSS selectors to retrieve specific links lying in some class using BeautifulSoup?

查看:134
本文介绍了如何使用BeautifulSoup使用CSS选择器来检索某个类中的特定链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Python的新手,我正在学习它以进行抓取,因此我正在使用BeautifulSoup来收集链接(即'a'标签的href).我正在尝试在网站 http://allevents.in/lahore/.我正在使用Firebug检查元素并获取CSS路径,但是此代码未返回任何内容.我正在寻找此修复程序,以及有关如何选择合适的CSS选择器以从任何站点检索所需链接的一些建议.我写了这段代码:

from bs4 import BeautifulSoup

import requests

url = "http://allevents.in/lahore/"

r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data)
for link in soup.select( 'html body div.non-overlay.gray-trans-back div.container div.row div.span8 div#eh-1748056798.events-horizontal div.eh-container.row ul.eh-slider li.h-item div.h-meta div.title a[href]'):
    print link.get('href')

在使用类和标记时,页面并不是最友好的页面,但是即使如此,您的CSS选择器也是如此,无法在此处使用.

如果您要进行即将发生的事件,则只需要第一个<div class="events-horizontal">,然后只获取<div class="title"><a href="..."></div>标签,因此标题上的链接:

upcoming_events_div = soup.select_one('div#events-horizontal')
for link in upcoming_events_div.select('div.title a[href]'):
    print link['href']

请注意,您应该使用r.text;使用r.content,并将对Unicode的解码留给BeautifulSoup.参见在utf-8中编码字符的问题

I am new to Python and I am learning it for scraping purposes I am using BeautifulSoup to collect links (i.e href of 'a' tag). I am trying to collect the links under the "UPCOMING EVENTS" tab of site http://allevents.in/lahore/. I am using Firebug to inspect the element and to get the CSS path but this code returns me nothing. I am looking for the fix and also some suggestions for how I can choose proper CSS selectors to retrieve desired links from any site. I wrote this piece of code:

from bs4 import BeautifulSoup

import requests

url = "http://allevents.in/lahore/"

r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data)
for link in soup.select( 'html body div.non-overlay.gray-trans-back div.container div.row div.span8 div#eh-1748056798.events-horizontal div.eh-container.row ul.eh-slider li.h-item div.h-meta div.title a[href]'):
    print link.get('href')

解决方案

The page is not the most friendly in the use of classes and markup, but even so your CSS selector is too specific to be useful here.

If you want Upcoming Events, you want just the first <div class="events-horizontal">, then just grab the <div class="title"><a href="..."></div> tags, so the links on titles:

upcoming_events_div = soup.select_one('div#events-horizontal')
for link in upcoming_events_div.select('div.title a[href]'):
    print link['href']

Note that you should not use r.text; use r.content and leave decoding to Unicode to BeautifulSoup. See Encoding issue of a character in utf-8

这篇关于如何使用BeautifulSoup使用CSS选择器来检索某个类中的特定链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆