使用 python3.1 urllib.request 从 html 文件中提取源代码 [英] Extracting source code from html file using python3.1 urllib.request

查看:37
本文介绍了使用 python3.1 urllib.request 从 html 文件中提取源代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过实现以下代码使用正则表达式从 html 文件获取数据:

I'm trying to obtain data using regular expressions from a html file, by implementing the following code:

import urllib.request
def extract_words(wdict, urlname):
  uf = urllib.request.urlopen(urlname)
  text = uf.read()
  print (text)
  match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)

返回错误:

File "extract.py", line 33, in extract_words
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)
File "/usr/lib/python3.1/re.py", line 192, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object

在 IDLE 中进一步试验后,我注意到 uf.read() 确实在我第一次调用它时返回了 html 源代码.但随后,它返回 a - b''.有什么办法可以解决这个问题吗?

Upon experimenting further in the IDLE, I noticed that the uf.read() indeed returns the html source code the first time I invoke it. But then onwards, it returns a - b''. Is there any way to get around this?

推荐答案

uf.read() 只会读取一次内容.然后你必须关闭它并重新打开它才能再次阅读.对于任何类型的流都是如此.然而,这不是问题.

uf.read() will only read the contents once. Then you have to close it and reopen it to read it again. This is true for any kind of stream. This is however not the problem.

问题是从任何类型的二进制源(例如文件或网页)读取数据都将返回 bytes 类型的数据,除非您指定编码.但是您的正则表达式未指定为 bytes 类型,而是指定为 unicode str.

The problem is that reading from any kind of binary source, such as a file or a webpage, will return the data as a bytes type, unless you specify an encoding. But your regexp is not specified as a bytes type, it's specified as a unicode str.

re 模块将相当合理地拒绝在字节数据上使用 unicode 模式,反之亦然.

The re module will quite reasonably refuse to use unicode patterns on byte data, and the other way around.

解决方案是使正则表达式模式成为字节字符串,您可以通过在它前面放置一个 b 来实现.因此:

The solution is to make the regexp pattern a bytes string, which you do by putting a b in front of it. Hence:

match = re.findall(b"<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)

应该可以.另一种选择是对文本进行解码,因此它也是一个 unicode str:

Should work. Another option is to decode the text so it also is a unicode str:

encoding = uf.headers.getparam('charset')
text = text.decode(encoding)
match = re.findall("<tr>\s*<td>([\w\s.;'(),-/]+)</td>\s+<td>([\w\s.,;'()-/]+)</td>\s*</tr>", text)

(另外,为了从 HTML 中提取数据,我会说 lxml 是更好的选择).

(Also, to extract data from HTML, I would say that lxml is a better option).

这篇关于使用 python3.1 urllib.request 从 html 文件中提取源代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆