BS4破坏了试图修复它的HTML [英] BS4 breaks HTML trying to repair it
问题描述
BS4更正了错误的html.通常这不是问题.我尝试解析,更改和保存此页面的html:ulisses-regelwiki.de/index.php/sonderfertigkeiten.html在这种情况下,修理会更改表示形式.修复后,页面的许多行不再居中,而是左对齐.由于必须处理上述页面的损坏的html,因此我无法简单地修复html代码.
BS4 corrects faulty html. Usually this is not a problem. I tried parsing, altering and saving the html of this page: ulisses-regelwiki.de/index.php/sonderfertigkeiten.html In this case the repairing changes the representation. After the repairing many lines of the page are no longer centered, but leftaligned instead. Since I have to work with the broken html of said page, I cannot simply repair the html code.
如何防止bs4修复html或解决更正"问题?不知何故?
How can I prevent bs4 from repairing the html or fix the "correction" somehow?
(此最小示例仅显示bs4修复损坏的html代码;我无法创建一个最小示例,其中bs4以错误的方式(如上述页面)执行此操作)
(this minimal example just shows bs4 repairing broken html-code; I couldn't create a minimal example where bs4 does this in a wrong way like with the page mentioned above)
#!/usr/bin/env python3
from bs4 import BeautifulSoup
html = '''
<!DOCTYPE html>
<center>
Some Test content
<!-- A comment -->
<center>
'''
def is_string_only(t):
return type(t) is NavigableString
soup = BeautifulSoup(html, 'lxml') #or html.parse
print(str(soup))
推荐答案
尝试使用此库.
from simplified_scrapy import SimplifiedDoc
html = '''
<!DOCTYPE html>
<center>
Some Test content
<!-- A comment -->
<center>
'''
doc = SimplifiedDoc(html)
print (doc.html)
还有更多示例: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
这篇关于BS4破坏了试图修复它的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!