lxml在查找链接时错误地解析了Doctype [英] lxml incorrectly parsing the Doctype while looking for links
问题描述
我有一个BeautifulSoup4(4.2.1)解析器,它从我们的模板文件中收集所有的 href
属性,直到现在它已经完美无缺。但是安装了lxml,我们的一个人现在正在获得一个;
TypeError:字符串索引必须是整数
。
我在我的Linux Mint虚拟机上设法复制了它,唯一的区别似乎是lxml,因此我假设当bs4使用该html解析器时发生问题。
问题函数是:
pre $ def $ collectplantemplateurls(templatedir,urlslist):
使用BeautifulSoup从模板目录提取所有外部URL
@return:URL列表
for(dirpath ,dirs,files)在os.walk(templatedir)中:
用于路径(路径(dirpath,f)用于文件中的f):
如果path.endswith(。html):
为BeautifulSoup中的链接(
open(path).read(),
parse_only = SoupStrainer(target =_ blank)
):
如果link [href ] .startswith('http://'):
对于re.findall('(http)中的l,startswith('{{'):
。 ://(?:。*?))',link [href]):
urlslist.append(l)
返回地址列表
所以对于这个人来说,如果link [href] .startswith('http:// '):
给出Type错误,因为BS4认为html Doctype是一个链接。
任何人都可以解释这里的问题可能是因为没有人否则可以重新创建它?
当使用这样的SoupStrainer时,我看不到这会发生。我认为它在某种程度上与系统设置问题有关。
对于我们的Doctype,我看不到什么特别的东西;
<!DOCTYPE html PUBLIC - // W3C // DTD XHTML 1.0 Transitional // EN
http://www.w3.org/TR/ XHTML1 / DTD / XHTML1-transitional.dtd>
< html xmlns =http://www.w3.org/1999/xhtmllang =en-gb>
< head>
SoupStrainer
不会过滤掉文档类型;它会过滤文档中剩余的元素,但是doc类型会保留,因为它是过滤元素的容器的一部分。您正在遍历文档中的所有元素,因此您遇到的第一个元素是 DocType
对象。
在'strained'文档中使用 .find_all()
:
document = BeautifulSoup(open(path).read(),parse_only = SoupStrainer(target =_ blank))
用于链接documen.find_all(target =_ blank):
或过滤掉 DocType
对象:
from bs4 import DocType
用于BeautifulSoup中的链接(
open(path).read(),
parse_only = SoupStrainer(target =_ blank)
):
if isinstance(link,Doctype):continue
I've got a BeautifulSoup4 (4.2.1) parser which collects all href
attributes from our template files, and until now it has been just perfect. But with lxml installed, one of our guys is now getting a;
TypeError: string indices must be integers
.
I managed to replicate this on my Linux Mint VM and the only difference appears to be lxml so I assume when bs4 uses that html parser the issue occurs.
The problem function is;
def collecttemplateurls(templatedir, urlslist):
"""
Uses BeautifulSoup to extract all the external URLs from the templates dir.
@return: list of URLs
"""
for (dirpath, dirs, files) in os.walk(templatedir):
for path in (Path(dirpath, f) for f in files):
if path.endswith(".html"):
for link in BeautifulSoup(
open(path).read(),
parse_only=SoupStrainer(target="_blank")
):
if link["href"].startswith('http://'):
urlslist.append(link['href'])
elif link["href"].startswith('{{'):
for l in re.findall("'(http://(?:.*?))'", link["href"]):
urlslist.append(l)
return urlslist
So for this one guy, the line if link["href"].startswith('http://'):
gives the Type error because BS4 thinks the html Doctype is a link.
Can anyone explain what the problem here might be because nobody else can recreate it?
I can't see how this could happen when using SoupStrainer like this. I assume it's somehow related to a system setup issue.
I can't see anything particularly special about our Doctype;
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">
<head>
SoupStrainer
will not filter out the document type; it filters what elements remain in document, but the doc-type is retained as it is part of the 'container' for the filtered elements. You are looping over all elements in the document, so the first element you encounter is the DocType
object.
Use .find_all()
on the 'strained' document:
document = BeautifulSoup(open(path).read(), parse_only=SoupStrainer(target="_blank"))
for link in documen.find_all(target="_blank"):
or filter out the DocType
object:
from bs4 import DocType
for link in BeautifulSoup(
open(path).read(),
parse_only=SoupStrainer(target="_blank")
):
if isinstance(link, Doctype): continue
这篇关于lxml在查找链接时错误地解析了Doctype的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!