TypeError:在python中使用正则表达式时,预期的字符串或缓冲区 [英] TypeError: expected string or buffer while using regular expression in python
问题描述
我编写了这段代码,以删除匹配这样的标签
I wrote this code to remove the tags that match like this
<p><b>See also:</b> <a href=\"(.*?)\">(.*)</a>(.*)</p>
代码:
import mechanize
import urllib2
from bs4 import BeautifulSoup
import re
med = 'paracetamol'
listiterator = []
listiterator[:] = range(2,16)
br = mechanize.Browser()
br.set_handle_robots(False)
r=br.open("http://www.drugs.com/search-wildcard-phonetic.html")
br.select_form(nr=0)
br.form['searchterm'] = med
br.submit()
url = br.response().geturl()
print url
mainurl = urllib2.urlopen(url).read()
subpages = re.findall("<a href=\"(.*?).html\">[^>]*>", mainurl)
for sub in subpages:
if sub.startswith("http:"):
soup = BeautifulSoup(urllib2.urlopen(sub).read())
m = soup.find_all("div", {"class":"contentBox"})
head = m[0].find_all(["h2","p"])
for i in head:
m = re.match("<p><b>See also:</b> <a href=\"(.*?)\">(.*)</a>(.*)</p>").group()
if not m:
print i
break
我收到此错误:
m = re.match("<p><b>See also:</b> <a href=\"(.*?)\">(.*)</a>(.*)</p>",i).group()
File "/usr/lib/python2.7/re.py", line 137, in match
return _compile(pattern, flags).match(string)
TypeError: expected string or buffer
推荐答案
您会收到此错误,因为变量i
的类型为<class 'bs4.element.Tag'>
,而match
需要缓冲区或字符串.其次,如果模式不匹配,则.match
调用将返回None
,因此您的.group
将是空指针异常.
You get that error because the type of the variable i
is <class 'bs4.element.Tag'>
, and match
needs a buffer or string. Secondly, if the pattern doesn't match, then the .match
call will return None
, so your .group
will be a null pointer exception.
这是一个又快又脏的解决方案",我不建议:
Here's a quick and dirty "solution" I don't recommend:
m = re.match("<p><b>See also:</b> <a href=\"(.*?)\">(.*)</a>(.*)</p>", str(i))
if not m:
print i
更好的解决方案是重写而无需尝试自己解析HTML,让BeautifulSoup发挥作用.例如,代替您的正则表达式模式,排除包含文本See also
和锚标记的项目:
A better solution would be to rewrite without trying to parse HTML yourself, letting BeautifulSoup do its job. For example, instead of your regex pattern, exclude the items that contain the text See also
and an anchor tag:
if i.find(text='See also:') and i.find('a'):
continue
print i
这篇关于TypeError:在python中使用正则表达式时,预期的字符串或缓冲区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!