解析HTML标记 [英] Parsing an HTML a tag

查看:73
本文介绍了解析HTML标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何解析HTML文件并仅收集A标签。我有一个

的代码开头,但无法弄清楚如何完成代码。

HTML_parse从URL文档中获取数据。感谢您的帮助


def HTML_parse(数据):
来自HTMLParser的
导入HTMLParser

parser = MyHTMLParser()


parser.feed(数据)


类MyHTMLParser(HTMLParser):


def handle_starttag(self,标签,attrs):


def handle_endtag(self,tag):


def read_page(URL):

此函数返回指定URL的全部内容

文档

import urllib

connect = urllib.urlopen(url)

data = connect.read()

connect.close()

返回数据

How can I parse an HTML file and collect only that the A tags. I have a
start for the code but an unable to figure out how to finish the code.
HTML_parse gets the data from the URL document. Thanks for the help

def HTML_parse(data):
from HTMLParser import HTMLParser
parser = MyHTMLParser()

parser.feed(data)

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):

def handle_endtag(self, tag):

def read_page(URL):
"this function returns the entire content of the specified URL
document"
import urllib
connect = urllib.urlopen(url)
data = connect.read()
connect.close()
return data

推荐答案

我真的不知道,你想做什么。从html文件的

标签获取网址?我认为最简单的方法是定期的

表达式。
I do not really know, what you want to do. Getting he urls from the a
tags of a html file? I think the easiest method would be a regular
expression.
import urllib, sre
html = urllib.urlopen(" http://www.google.com")。read()
sre.findall(''href ="([^>] +) "'',html)
[''/ imghp?hl = de& tab = wi& ie = UTF-8'',

''http://groups.google .de / grphp?hl = de& tab = wg& ie = UTF-8'',

''/ dirhp?hl = de& tab = wd& ie = UTF-8'',

''http://news.google.de/nwshp?hl=de&tab=wn&ie=UTF-8'',

''http: //froogle.google.de/frghp?hl=de&tab=wf&ie=UTF-8'',

''/ intl / de / options /''] sre.findall( ''href = [^>] +>([^<] +)< / a>'',html)
import urllib, sre
html = urllib.urlopen("http://www.google.com").read()
sre.findall(''href="([^>]+)"'', html) [''/imghp?hl=de&tab=wi&ie=UTF-8'',
''http://groups.google.de/grphp?hl=de&tab=wg&ie=UTF-8'',
''/dirhp?hl=de&tab=wd&ie=UTF-8'',
''http://news.google.de/nwshp?hl=de&tab=wn&ie=UTF-8'',
''http://froogle.google.de/frghp?hl=de&tab=wf&ie=UTF-8'',
''/intl/de/options/''] sre.findall(''href=[^>]+>([^<]+)</a>'', html)



[''Bilder'',''群组'',''Verzeichnis'',''新闻'''''Froogle'',

''Mehr& nbsp;& raquo; ',''Erweiterte Suche'',''Einstellungen'',

''Sprachtools'',''Werbung'',''Unternehmensangebote'',''Alles \ xfcber
Google'',''Google.com in English'']

Google有一些奇怪的html,href没有引号:< a

href = http://www.google.com/ncr> Google.com in English< / a>


[''Bilder'', ''Groups'', ''Verzeichnis'', ''News'', ''Froogle'',
''Mehr&nbsp;&raquo;'', ''Erweiterte Suche'', ''Einstellungen'',
''Sprachtools'', ''Werbung'', ''Unternehmensangebote'', ''Alles \xfcber
Google'', ''Google.com in English'']

Google has some strange html, href without quotation marks: <a
href=http://www.google.com/ncr>Google.com in English</a>


George写道:
如何解析HTML文件并仅收集A标签。我有一个代码启动,但无法弄清楚如何完成代码。
HTML_parse从URL文档中获取数据。谢谢你的帮助
How can I parse an HTML file and collect only that the A tags. I have a
start for the code but an unable to figure out how to finish the code.
HTML_parse gets the data from the URL document. Thanks for the help




你尝试过使用美味的汤吗?

http://www.crummy.com/software/BeautifulSoup/


" ; beza1e1" <一个************* @ googlemail.com>写道:
"beza1e1" <an*************@googlemail.com> writes:
我真的不知道,你想做什么。从html文件的
标签获取网址?我认为最简单的方法是定期的表达方式。


我认为在困难的一天

hacks列表中排名第二。是的,编写一个可以处理大部分

时间的RE很简单。写一个适用于所有合法的b $ b b案件的PITA是一个主要的PITA。得到一个也可以处理在野外看到的所有情况的一个是

该死的几乎不可能。
I do not really know, what you want to do. Getting he urls from the a
tags of a html file? I think the easiest method would be a regular
expression.
I think this ranks as #2 on the list of "difficult one-day
hacks". Yeah, it''s simple to write an RE that works most of the
time. It''s a major PITA to write one that works in all the legal
cases. Getting one that also handles all the cases seen in the wild is
damn near impossible.
import urllib,sre
html = urllib.urlopen(" http://www.google.com")。read()
sre.findall(''href ="([ ^>] +)"'',html)
import urllib, sre
html = urllib.urlopen("http://www.google.com").read()
sre.findall(''href="([^>]+)"'', html)




在许多情况下失败。 =周围的空白区域签到

attibutes。引用标记中的其他属性(

XHTML所需)。网址中的''>''(合法,但被推荐)。属性引用

用单引号而不是双引号,或者只是unqouted。它
错过了IMG SRC属性。它将相对的URL交回原样,

而不是将它们解析为绝对URL(这需要检查

作为HEAD中的基本URL),这可能会也可能不会可以接受。

Google有一些奇怪的html,href没有引号:< a
href = http://www.google.com/ncr> Google.com in English< / a>



This fails in a number of cases. Whitespace around the "=" sign for
attibutes. Quotes around other attributes in the tag (required by
XHTML). ''>'' in the URL (legal, but disrecommended). Attributes quoted
with single quotes instead of double quotes, or just unqouted. It
misses IMG SRC attributes. It hands back relative URLs as such,
instead of resolving them to the absolute URL (which requires checking
for the base URL in the HEAD), which may or may not be acceptable.
Google has some strange html, href without quotation marks: <a
href=http://www.google.com/ncr>Google.com in English</a>




这并不奇怪。这有点不寻常。完全合法,但是没有处理它的任何浏览器(或其他html处理器)

坏了。


< ;迈克

-

Mike Meyer< mw*@mired.org> http://www.mired.org/home/mwm/

独立的WWW / Perforce / FreeBSD / Unix顾问,电子邮件以获取更多信息。



That''s not strange. That''s just a bit unusual. Perfectly legal, though
- any browser (or other html processor) that fails to handle it is
broken.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.


这篇关于解析HTML标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆