TypeError:在python中使用正则表达式时,预期的字符串或缓冲区 [英] TypeError: expected string or buffer while using regular expression in python

查看:81
本文介绍了TypeError:在python中使用正则表达式时,预期的字符串或缓冲区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了这段代码,以删除匹配这样的标签

I wrote this code to remove the tags that match like this

<p><b>See also:</b> <a href=\"(.*?)\">(.*)</a>(.*)</p>

代码:

import mechanize
import urllib2
from bs4 import BeautifulSoup
import re
med = 'paracetamol'
listiterator = []
listiterator[:] = range(2,16)
br = mechanize.Browser()
br.set_handle_robots(False)
r=br.open("http://www.drugs.com/search-wildcard-phonetic.html")
br.select_form(nr=0)
br.form['searchterm'] = med
br.submit()
url = br.response().geturl()
print url
mainurl = urllib2.urlopen(url).read()
subpages = re.findall("<a href=\"(.*?).html\">[^>]*>", mainurl)
for sub in subpages:
    if sub.startswith("http:"):
        soup = BeautifulSoup(urllib2.urlopen(sub).read())
        m = soup.find_all("div", {"class":"contentBox"})
        head = m[0].find_all(["h2","p"])
        for i in head:
            m = re.match("<p><b>See also:</b> <a href=\"(.*?)\">(.*)</a>(.*)</p>").group()
            if not m:
                print i         
        break

我收到此错误:

m = re.match("<p><b>See also:</b> <a href=\"(.*?)\">(.*)</a>(.*)</p>",i).group()
  File "/usr/lib/python2.7/re.py", line 137, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or buffer

推荐答案

您会收到此错误,因为变量i的类型为<class 'bs4.element.Tag'>,而match需要缓冲区或字符串.其次,如果模式不匹配,则.match调用将返回None,因此您的.group将是空指针异常.

You get that error because the type of the variable i is <class 'bs4.element.Tag'>, and match needs a buffer or string. Secondly, if the pattern doesn't match, then the .match call will return None, so your .group will be a null pointer exception.

这是一个又快又脏的解决方案",我不建议:

Here's a quick and dirty "solution" I don't recommend:

m = re.match("<p><b>See also:</b> <a href=\"(.*?)\">(.*)</a>(.*)</p>", str(i))
if not m:
    print i

更好的解决方案是重写而无需尝试自己解析HTML,让BeautifulSoup发挥作用.例如,代替您的正则表达式模式,排除包含文本See also和锚标记的项目:

A better solution would be to rewrite without trying to parse HTML yourself, letting BeautifulSoup do its job. For example, instead of your regex pattern, exclude the items that contain the text See also and an anchor tag:

if i.find(text='See also:') and i.find('a'):
    continue
print i

这篇关于TypeError:在python中使用正则表达式时,预期的字符串或缓冲区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆