BS4破坏了试图修复它的HTML [英] BS4 breaks HTML trying to repair it

查看：42 发布时间：2021/4/15 19:15:57 python html parsing web-scraping beautifulsoup

本文介绍了BS4破坏了试图修复它的HTML的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

BS4更正了错误的html.通常这不是问题.我尝试解析，更改和保存此页面的html:ulisses-regelwiki.de/index.php/sonderfertigkeiten.html在这种情况下，修理会更改表示形式.修复后，页面的许多行不再居中，而是左对齐.由于必须处理上述页面的损坏的html，因此我无法简单地修复html代码.

BS4 corrects faulty html. Usually this is not a problem. I tried parsing, altering and saving the html of this page: ulisses-regelwiki.de/index.php/sonderfertigkeiten.html In this case the repairing changes the representation. After the repairing many lines of the page are no longer centered, but leftaligned instead. Since I have to work with the broken html of said page, I cannot simply repair the html code.

如何防止bs4修复html或解决更正"问题?不知何故?

How can I prevent bs4 from repairing the html or fix the "correction" somehow?

(此最小示例仅显示bs4修复损坏的html代码；我无法创建一个最小示例，其中bs4以错误的方式(如上述页面)执行此操作)

(this minimal example just shows bs4 repairing broken html-code; I couldn't create a minimal example where bs4 does this in a wrong way like with the page mentioned above)

#!/usr/bin/env python3
from bs4 import BeautifulSoup


html = '''
<!DOCTYPE html>
<center>
Some Test content
<!-- A comment -->
<center>
'''

def is_string_only(t):
    return type(t) is NavigableString

soup = BeautifulSoup(html, 'lxml') #or html.parse

print(str(soup))

推荐答案

尝试使用此库.

from simplified_scrapy import SimplifiedDoc

html = '''
<!DOCTYPE html>
<center>
Some Test content
<!-- A comment -->
<center>
'''
doc = SimplifiedDoc(html)
print (doc.html)

这篇关于BS4破坏了试图修复它的HTML的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

BS4破坏了试图修复它的HTML [英] BS4 breaks HTML trying to repair it

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

BS4破坏了试图修复它的HTML [英] BS4 breaks HTML trying to repair it

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭