使用Beautiful Soup来纠缠bookmarks.html [英] Using Beautiful Soup to entangle bookmarks.html

查看:87
本文介绍了使用Beautiful Soup来纠缠bookmarks.html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述




我正在尝试使用Beautiful Soup包解析

" bookmarks.html" Firefox将您的所有书签导出的文件。

我一直在努力查找文件,试图弄清楚如何提取所有网址。有没有人有几个更长的例子使用

美丽的汤我可以玩?


谢谢,

马丁。

Hi,

I''m trying to use the Beautiful Soup package to parse through the
"bookmarks.html" file which Firefox exports all your bookmarks into.
I''ve been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?

Thanks,
Martin.

推荐答案

Francach schrieb:
Francach schrieb:




我正在尝试使用Beautiful Soup包解析

" bookmarks.html" Firefox将您的所有书签导出的文件。

我一直在努力查找文件,试图弄清楚如何提取所有网址。有没有人使用

我可以玩的美味汤?
Hi,

I''m trying to use the Beautiful Soup package to parse through the
"bookmarks.html" file which Firefox exports all your bookmarks into.
I''ve been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?



为什么要使用BeautifulSoup?它是生成的内容,而且我认为它是格式良好的,甚至可能是xml。所以在这里使用一个标准的

解析器,更好的是像lxml / elementtree那样的东西


Diez

Why do you use BeautifulSoup on that? It''s generated content, and I
suppose it is well-formed, most probably even xml. So use a standard
parser here, better yet somthing like lxml/elementtree

Diez




Diez B. Roggisch写道:

Diez B. Roggisch wrote:

假设它格式正确,很可能甚至是xml。
suppose it is well-formed, most probably even xml.



也许不是。否则,为什么会有像这样的脚本[1]?

无论如何,我发现和其他脚本一起使用firefox

bookmarks.html文件快速搜索[2]。也许你会发现

那里有用的东西。


[1]:
http://www.physic.ut.ee/~kkannike/ en ... e / bookmarks.py

[2]: http://www.google.com/search?q=firef...ks.html+python


Waylan

Maybe not. Otherwise, why would there be a script like this one[1]?
Anyway, I found that and other scripts that work with firefox
bookmarks.html files with a quick search [2]. Perhaps you will find
something there that is helpful.

[1]:
http://www.physic.ut.ee/~kkannike/en...e/bookmarks.py
[2]: http://www.google.com/search?q=firef...ks.html+python

Waylan


Diez B. Roggisch写道:
Diez B. Roggisch wrote:

Francach schrieb:
Francach schrieb:

>

我正在尝试使用Beautiful Soup包来解析
bookmarks.html。 Firefox导出所有书签的文件。
我一直在努力弄清楚如何提取所有网址的文档。有没有人使用
美丽的汤我可以玩的几个更长的例子?
>Hi,

I''m trying to use the Beautiful Soup package to parse through the
"bookmarks.html" file which Firefox exports all your bookmarks into.
I''ve been struggling with the documentation trying to figure out how to
extract all the urls. Has anybody got a couple of longer examples using
Beautiful Soup I could play around with?




为什么要使用BeautifulSoup?它是生成的内容,而且我认为它是格式良好的,甚至可能是xml。所以在这里使用一个标准的

解析器,更好的是像lxml / elementtree这样的东西


Diez



Why do you use BeautifulSoup on that? It''s generated content, and I
suppose it is well-formed, most probably even xml. So use a standard
parser here, better yet somthing like lxml/elementtree

Diez



很久以前我为自己的目的写了一些关于这个

主题的代码,所以也许它可以用作启动器(测试一下,但是

考虑它作为一种alpha版本的状态):


< code>

来自urllib import urlopen

来自sgmllib import SGMLParser


class mySGMLParserClassProvidingListOf_HREFs(SGMLParser):

#仅提供HREF< a href =" someURL"用于指向其他网页的链接跳过

#引用:

# - 同一页面上的内部链接:"#..."

# - 电子邮件地址:" mailto:.. 。"

#并使用附加的内部链接信息跳过部分,例如:

# - " LinkSpec#internalLinkID"将被列为LinkSpec只有

#---

#reset()会覆盖SGMLParser类中可用的空函数

def reset(self):

SGMLParser.reset(个体经营)

self.A_HREFs = []

#:def reset(self)


#start_a()覆盖SGMLParser类中可用的空函数

#,派生此类。每次调用start_a()

时间#b $ b#SGMLParser在feed(ed)HTML文档中检测到< a ...标记:

def start_a(self,tagAttributes_asListOfNameValuePairs):
attrName的
,tagAttributes_asListOfNameValuePairs的attrValue:

如果attrName ==''href'':

如果attrValue [0]!=''#''和attrValue [:7]!=''mailto:'':

如果是attrValue.find(''#'')> = 0:

attrValue = attrValue [:attrValue.find(''#'')]

#:if

self.A_HREFs。追加(attrValue)

#:if

#:if

#:for

#:def start_a( self,attributes_NamesAndValues_AsListOfTuples)

#:class mySGMLParserClassProvidingListOf_HREFs(SGMLParser)



-------------- -------------------------------------------------- --------------

#---

#执行区块:

fileLikeObjFrom_urlopen = urlopen( '' www.google.com '') #set URL

mySGMLParserClassObj_withListOfHREFs =

mySGMLParserClassProvidingListOf_HREFs()

mySGMLParserClassObj_withListOfHREFs.feed(fileLike ObjFrom_urlopen.read())

mySGMLParserClassObj_withListOfHREFs.close()

fileLikeObjFrom_urlopen.close()


mySGMLParserClassObj_withListOfHREFs.A_HREFs中的href:

print href

#:for

< / code>


Claudio Grondi

Once upon a time I have written for my own purposes some code on this
subject, so maybe it can be used as a starter (tested a bit, but
consider its status as a kind of alpha release):

<code>
from urllib import urlopen
from sgmllib import SGMLParser

class mySGMLParserClassProvidingListOf_HREFs(SGMLParser) :
# provides only HREFs <a href="someURL"for links to another pages skipping
# references to:
# - internal links on same page : "#..."
# - email adresses : "mailto:..."
# and skipping part with appended internal link info, so that e.g.:
# - "LinkSpec#internalLinkID" will be listed as "LinkSpec" only
# ---
# reset() overwrites an empty function available in SGMLParser class
def reset(self):
SGMLParser.reset(self)
self.A_HREFs = []
#: def reset(self)

# start_a() overwrites an empty function available in SGMLParser class
# from which this class is derived. start_a() will be called each
time the
# SGMLParser detects an <a ...tag within the feed(ed) HTML document:
def start_a(self, tagAttributes_asListOfNameValuePairs):
for attrName, attrValue in tagAttributes_asListOfNameValuePairs:
if attrName==''href'':
if attrValue[0] != ''#'' and attrValue[:7] !=''mailto:'':
if attrValue.find(''#'') >= 0:
attrValue = attrValue[:attrValue.find(''#'')]
#: if
self.A_HREFs.append(attrValue)
#: if
#: if
#: for
#: def start_a(self, attributes_NamesAndValues_AsListOfTuples)
#: class mySGMLParserClassProvidingListOf_HREFs(SGMLParser)
#
------------------------------------------------------------------------------
# ---
# Execution block:
fileLikeObjFrom_urlopen = urlopen(''www.google.com'') # set URL
mySGMLParserClassObj_withListOfHREFs =
mySGMLParserClassProvidingListOf_HREFs()
mySGMLParserClassObj_withListOfHREFs.feed(fileLike ObjFrom_urlopen.read())
mySGMLParserClassObj_withListOfHREFs.close()
fileLikeObjFrom_urlopen.close()

for href in mySGMLParserClassObj_withListOfHREFs.A_HREFs:
print href
#: for
</code>

Claudio Grondi


这篇关于使用Beautiful Soup来纠缠bookmarks.html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆