分解HTML链接的文本和目标 [英] Decomposing HTML to link text and target

查看:156
本文介绍了分解HTML链接的文本和目标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于像

<a href="urltxt" class="someclass" close="true">texttxt</a>

我怎么能隔离URL和文字?

how can I isolate the url and the text?

更新

我用美丽的汤,而我无法弄清楚如何做到这一点。

I'm using Beautiful Soup, and am unable to figure out how to do that.

我做

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))

links = soup.findAll('a')

for link in links:
    print "link content:", link.content," and attr:",link.attrs

我得到

*link content: None  and attr: [(u'href', u'_redirectGeneric.asp?genericURL=/root    /support.asp')]*  ...
...

为什么我缺少的内容?

Why am i missing the content?

编辑:阐述了坚持'作为建议:)

edit: elaborated on 'stuck' as advised :)

推荐答案

使用美味的汤。自己做起来比看起来难,你会更好使用久经考验的模块。

Use Beautiful Soup. Doing it yourself is harder than it looks, you'll be better off using a tried and tested module.

编辑:

我觉得你想要的:

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())

顺便说一句,这是一个坏主意,尝试打开URL那里,如果它出了问题就可以得到难看。

By the way, it's a bad idea to try opening the URL there, as if it goes wrong it could get ugly.

编辑2:

这将显示在页面中的所有链接:

This should show you all the links in a page:

import urlparse, urllib
from BeautifulSoup import BeautifulSoup

url = "http://www.example.com/index.html"
source = urllib.urlopen(url).read()

soup = BeautifulSoup(source)

for item in soup.fetchall('a'):
    try:
        link =  urlparse.urlparse(item['href'].lower())
    except:
        # Not a valid link
        pass
    else:
        print link

这篇关于分解HTML链接的文本和目标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆