分解HTML链接的文本和目标 [英] Decomposing HTML to link text and target

查看：156 发布时间：2016/8/5 19:04:33 python html regex beautifulsoup

本文介绍了分解HTML链接的文本和目标的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

由于像

<a href="urltxt" class="someclass" close="true">texttxt</a>

我怎么能隔离URL和文字？

how can I isolate the url and the text?

更新

我用美丽的汤，而我无法弄清楚如何做到这一点。

I'm using Beautiful Soup, and am unable to figure out how to do that.

我做

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))

links = soup.findAll('a')

for link in links:
    print "link content:", link.content," and attr:",link.attrs

我得到

*link content: None  and attr: [(u'href', u'_redirectGeneric.asp?genericURL=/root    /support.asp')]*  ...
...

为什么我缺少的内容？

Why am i missing the content?

编辑：阐述了坚持'作为建议：）

edit: elaborated on 'stuck' as advised :)

推荐答案

使用美味的汤。自己做起来比看起来难，你会更好使用久经考验的模块。

Use Beautiful Soup. Doing it yourself is harder than it looks, you'll be better off using a tried and tested module.

编辑：

我觉得你想要的：

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())

顺便说一句，这是一个坏主意，尝试打开URL那里，如果它出了问题就可以得到难看。

By the way, it's a bad idea to try opening the URL there, as if it goes wrong it could get ugly.

编辑2：

这将显示在页面中的所有链接：

This should show you all the links in a page:

import urlparse, urllib
from BeautifulSoup import BeautifulSoup

url = "http://www.example.com/index.html"
source = urllib.urlopen(url).read()

soup = BeautifulSoup(source)

for item in soup.fetchall('a'):
    try:
        link =  urlparse.urlparse(item['href'].lower())
    except:
        # Not a valid link
        pass
    else:
        print link

这篇关于分解HTML链接的文本和目标的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

分解HTML链接的文本和目标 [英] Decomposing HTML to link text and target

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

分解HTML链接的文本和目标 [英] Decomposing HTML to link text and target

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭