HTMLParser问题。 [英] HTMLParser problems.

查看:75
本文介绍了HTMLParser问题。的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用具有nxn条目表(公交时间)的网页和

将其转换为2D数组(列表列表)。最初这很简单但是我需要能够访问数据的整个列,因此2D数组不能稀释,但在HTML文件中我是在解析时,可以存在稀疏的条目,其中

在表格中作为& nbsp实体被复制。稀疏输出打破了我使用整列的能力并使条目正确对应。


当你看到& amp时,有一种简单的方法可以告诉解析器表中的数据

数据返回说..." -1"或NaN?

HTMLParser文档有点......简洁。我正在考虑使用

handle_entityref()方法,但我认为此时数据已经解析了




我可以尝试:

def handle_entityref(self,entity):

if self.in_td == 1:

if entity ==" ; nbsp":

self.row.append(-1)


但这似乎很难......(评论?)。


这里的例子是我正在使用的部分代码和部分输出:


#!/ usr / local / bin / python

import htmllib,os,string,urllib

来自HTMLParser导入HTMLParser


class foo(HTMLParser):

def __init __(个体经营):

self.in_td = 0

self.in_tr = 0

self.matrix = []

self.row = []

self.reset()


def handle_starttag(self,tag,attrs):

if tag ==" td":

self.in_td = 1

elif tag ==" tr":

self .in_tr = 1

def h andle_data(self,data):

if self.in_td == 1:

data = string.lstrip(data)

if data! ="":

self.row.append(data)


def handle_endtag(self,tag):

if tag ==" td":

self.in_td = 0

elif tag ==" tr":

self.in_tr = 0

if self.row!= []:

self.matrix.append(self.row)

self.row = [ ]


parser = foo()

socket =

urllib.urlopen(" http://winnipegtransit.com/ TIMETABLE / TODAY / STOPS / 105413botto

m.html")

parser.feed(socket.read())

socket.close( )

parser.close()

for parser.matrix中的行:

打印行


上述代码的部分输出是:

[''5:12 C'',''5:52 W'']

[''5: 34 C'']

[''5:50 P'']

[''6:01 P'',''6:10 G'', ''6:09 S'',''6 :59 U'']

[''6:10 P'',''6:26 G'',''6:23 C'']

[''6:23 P'',''6:42 G'',''6:35 W'']

[''6:34 P'',''6: 54 G'',''6:47 S'']

[''6:46 P'',''6:59 C'']


任何提示或建议或评论都会受到极大关注,


-

Sean

ps如果我已经回答了我的问题,那就太好了,但是对于

未来中存在类似问题的人来说,这将在群组档案中有这个很好。

。 />

解决方案



" Sean Cody" <肖恩@ - [NOSPAMPLEASE] -tfh.ca>在消息中写道

news:kwfob.10197


f7.552358@localhost ...

我可以试试:
def handle_entityref(self,entity):
如果self.in_td == 1:
if entity ==" nbsp:
self.row.append(-1)

但这似乎很难......(评论?)。




这有用吗?对我来说,这是第一位的。


tjr


> >我可以尝试:

def handle_entityref(self,entity):
if self.in_td == 1:
if entity ==" nbsp:
self.row.append(-1)

但这似乎很难......(评论?)。



这是工作?对我来说,这是第一次。



其实是的确如此。


我想知道是否有更好的方法,因为我只是绊倒

HTMLParser类。

关于python的最好的事情就是完成任务的绊脚石

并不像它那样痛苦用其他语言。


我使用了很多成员变量。有没有办法不必通过self.member引用

成员。回到pascal的那一天,你可以做一些像

" self self do_stuff(member_variable);端;"这非常有用

用于大型''记录。''


-

Sean


I''m trying to take a webpage that has a nxn table of entries (bus times) and
convert it to a 2D array (list of lists). Initially this was simple but I
need to be able to access whole ''columns'' of data so the 2D array cannot be
sparse but in the HTML file I''m parsing there can be sparse entries which
are repsented in the table as &nbsp entities. The sparse output breaks my
ability to use entire columns and have entries correspond properly.

Is there a simple way to tell the parser whenever you see a &nbsp in table
data return say... "-1" or "NaN"?
The HTMLParser documentation is a bit.... terse. I was considering using
the handle_entityref() method but I would assume the data has already been
parsed at that point.

I could try:
def handle_entityref(self,entity):
if self.in_td == 1:
if entity == "nbsp":
self.row.append(-1)

But that seems ulgy... (comments?).

As an example here is some code I''m using and partial output:

#!/usr/local/bin/python
import htmllib,os,string,urllib
from HTMLParser import HTMLParser

class foo(HTMLParser):
def __init__(self):
self.in_td = 0
self.in_tr = 0
self.matrix = []
self.row = []
self.reset()

def handle_starttag(self,tag,attrs):
if tag == "td":
self.in_td = 1
elif tag == "tr":
self.in_tr = 1

def handle_data(self,data):
if self.in_td == 1:
data = string.lstrip(data)
if data != "":
self.row.append(data)

def handle_endtag(self,tag):
if tag == "td":
self.in_td = 0
elif tag == "tr":
self.in_tr = 0
if self.row != []:
self.matrix.append(self.row)
self.row=[]

parser = foo()
socket =
urllib.urlopen("http://winnipegtransit.com/TIMETABLE/TODAY/STOPS/105413botto
m.html")
parser.feed(socket.read())
socket.close()
parser.close()
for row in parser.matrix:
print row

A partial output of the above code is:
[''5:12 C'', ''5:52 W'']
[''5:34 C'']
[''5:50 P'']
[''6:01 P'', ''6:10 G'', ''6:09 S'', ''6:59 U'']
[''6:10 P'', ''6:26 G'', ''6:23 C'']
[''6:23 P'', ''6:42 G'', ''6:35 W'']
[''6:34 P'', ''6:54 G'', ''6:47 S'']
[''6:46 P'', ''6:59 C'']

Any tips or suggestions or comments would be greatly appriciated,

--
Sean
p.s. If I already answered my question that''s great but it would be nice to
have this in the groups archive for people with similar problems in the
future.

解决方案


"Sean Cody" <sean@-[NOSPAMPLEASE]-tfh.ca> wrote in message
news:kwfob.10197


f7.552358@localhost...

I could try:
def handle_entityref(self,entity):
if self.in_td == 1:
if entity == "nbsp":
self.row.append(-1)

But that seems ulgy... (comments?).



Does this work? For me, that comes first.

tjr


> > I could try:

def handle_entityref(self,entity):
if self.in_td == 1:
if entity == "nbsp":
self.row.append(-1)

But that seems ulgy... (comments?).



Does this work? For me, that comes first.


Actually yes it does.

I wonder if there is a better way as I''m just stumbling through the
HTMLParser class.
The best thing about python is the stumbling through getting things done is
not as painful as it would be in other languages.

I use a lot of member variables. Is there a way to not have to reference
members by self.member. Back in the day in pascal you could do stuff like
"with self begin do_stuff(member_variable); end;" which was extremely useful
for large ''records.''

--
Sean


这篇关于HTMLParser问题。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆