lxml在查找链接时错误地解析了Doctype [英] lxml incorrectly parsing the Doctype while looking for links

查看:127
本文介绍了lxml在查找链接时错误地解析了Doctype的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个BeautifulSoup4(4.2.1)解析器,它从我们的模板文件中收集所有的 href 属性,直到现在它已经完美无缺。但是安装了lxml,我们的一个人现在正在获得一个;



TypeError:字符串索引必须是整数



我在我的Linux Mint虚拟机上设法复制了它,唯一的区别似乎是lxml,因此我假设当bs4使用该html解析器时发生问题。



问题函数是:

pre $ def $ collectplantemplateurls(templatedir,urlslist):

使用BeautifulSoup从模板目录提取所有外部URL

@return:URL列表

for(dirpath ,dirs,files)在os.walk(templatedir)中:
用于路径(路径(dirpath,f)用于​​文件中的f):
如果path.endswith(。html):
为BeautifulSoup中的链接(
open(path).read(),
parse_only = SoupStrainer(target =_ blank)
):
如果link [href ] .startswith('http://'):
对于re.findall('(http)中的l,startswith('{{'):
。 ://(?:。*?))',link [href]):
urlslist.append(l)

返回地址列表

所以对于这个人来说,如果link [href] .startswith('http:// '):给出Type错误,因为BS4认为html Doctype是一个链接。



任何人都可以解释这里的问题可能是因为没有人否则可以重新创建它?



当使用这样的SoupStrainer时,我看不到这会发生。我认为它在某种程度上与系统设置问题有关。



对于我们的Doctype,我看不到什么特别的东西;

 <!DOCTYPE html PUBLIC -  // W3C // DTD XHTML 1.0 Transitional // EN
http://www.w3.org/TR/ XHTML1 / DTD / XHTML1-transitional.dtd>
< html xmlns =http://www.w3.org/1999/xhtmllang =en-gb>

< head>


解决方案

SoupStrainer 不会过滤掉文档类型;它会过滤文档中剩余的元素,但是doc类型会保留,因为它是过滤元素的容器的一部分。您正在遍历文档中的所有元素,因此您遇到的第一个元素是 DocType 对象。



在'strained'文档中使用 .find_all()

  document = BeautifulSoup(open(path).read(),parse_only = SoupStrainer(target =_ blank))
用于链接documen.find_all(target =_ blank):

或过滤掉 DocType 对象:

  from bs4 import DocType 

用于BeautifulSoup中的链接(
open(path).read(),
parse_only = SoupStrainer(target =_ blank)
):
if isinstance(link,Doctype):continue


I've got a BeautifulSoup4 (4.2.1) parser which collects all href attributes from our template files, and until now it has been just perfect. But with lxml installed, one of our guys is now getting a;

TypeError: string indices must be integers.

I managed to replicate this on my Linux Mint VM and the only difference appears to be lxml so I assume when bs4 uses that html parser the issue occurs.

The problem function is;

def collecttemplateurls(templatedir, urlslist):
    """
    Uses BeautifulSoup to extract all the external URLs from the templates dir.

    @return: list of URLs
    """
    for (dirpath, dirs, files) in os.walk(templatedir):
        for path in (Path(dirpath, f) for f in files):
            if path.endswith(".html"):
                for link in BeautifulSoup(
                        open(path).read(),
                        parse_only=SoupStrainer(target="_blank")
                ):
                    if link["href"].startswith('http://'):
                        urlslist.append(link['href'])

                    elif link["href"].startswith('{{'):
                        for l in re.findall("'(http://(?:.*?))'", link["href"]):
                            urlslist.append(l)

    return urlslist

So for this one guy, the line if link["href"].startswith('http://'): gives the Type error because BS4 thinks the html Doctype is a link.

Can anyone explain what the problem here might be because nobody else can recreate it?

I can't see how this could happen when using SoupStrainer like this. I assume it's somehow related to a system setup issue.

I can't see anything particularly special about our Doctype;

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">

<head>

解决方案

SoupStrainer will not filter out the document type; it filters what elements remain in document, but the doc-type is retained as it is part of the 'container' for the filtered elements. You are looping over all elements in the document, so the first element you encounter is the DocType object.

Use .find_all() on the 'strained' document:

document = BeautifulSoup(open(path).read(), parse_only=SoupStrainer(target="_blank"))
for link in documen.find_all(target="_blank"):

or filter out the DocType object:

from bs4 import DocType

for link in BeautifulSoup(
        open(path).read(),
        parse_only=SoupStrainer(target="_blank")
):
    if isinstance(link, Doctype): continue 

这篇关于lxml在查找链接时错误地解析了Doctype的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆