获得< a>的内容使用python的标签 [英] get contents of <a> tags using python

查看:99
本文介绍了获得< a>的内容使用python的标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我已经像这样将html读入程序:

Assuming I have html read into my program like this:

<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T &amp; P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p>

<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html">Optical Sales Position</a> - <font size="-1"> (New Westminster)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817709780.html">Sales Clerk</a> - <font size="-1"> (Kits)</font></p>

<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817676850.html">MARINE SALES</a> - <font size="-1"> (VANCOUVER ( KITS ))</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817608506.html">Retail Sales Associate</a> - <font size="-1"> (Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817573985.html">Retail with small parts appliance background</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817540938.html">Manager *Enjoyable work atmosphere</a> - <font size="-1"> (Langley Centre)</font></p>

<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html">Team Member - Retail Store - FT</a> - <font size="-1"> (Burnaby South)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817459155.html">STORE MANAGER-SHOE WAREHOUSE</a> - <font size="-1"> (South Surrey-Semiahmoo)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/pml/ret/1817448777.html">Retail Sales</a> - <font size="-1"> (Coquitlam)</font></p>

如何获取文本节点的内容?我最后想在终端上打印类似于此行的内容:

How do I grab the contents of the text node? What I would like to end up with is printing something similar to this line in the terminal:

http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - TRAVEL AGENT

到目前为止,我有以下代码可以很好地提取href链接,但是我不确定如何提取数据本身.我正在考虑从sgmllib.py模块中覆盖handle_data(self, data),但是到目前为止,我似乎还没有想到一种实现方法.

So far I have the following code which extracts the href link fine but I'm not sure how to extract the data itself. I'm thinking of overriding handle_data(self, data) from the sgmllib.py module but so far I can't seem to think of a way to do it.

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):
        href = [v for k, v in attrs if k == "href"]
        if href:
            self.urls.extend(href)

谢谢!

推荐答案

最简单的方法可能是 BeautifulSoup (除非您使用的是Python 3,否则请确保使用3.0.8或更高版本的3.0.*版本, 3.1.*),请参见

Simplest is probably BeautifulSoup (be sure to use 3.0.8 or higher 3.0.* release, not 3.1.*, unless you're on Python 3 -- see here!).

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(thehtmlstring)

for anchor in soup.findAll('a'):
  print anchor['href'], anchor.string

BeautifulSoup会生成unicode字符串-如果存在问题,请确保对它们进行编码,因为您希望以所需的方式获取字节字符串!

BeautifulSoup produce unicode strings -- if that's a problem, be sure to encode them as you wish to get the byte strings the way you want them!

这篇关于获得&lt; a&gt;的内容使用python的标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆