BS4破坏了试图修复它的HTML [英] BS4 breaks HTML trying to repair it

查看:42
本文介绍了BS4破坏了试图修复它的HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

BS4更正了错误的html.通常这不是问题.我尝试解析,更改和保存此页面的html:ulisses-regelwiki.de/index.php/sonderfertigkeiten.html在这种情况下,修理会更改表示形式.修复后,页面的许多行不再居中,而是左对齐.由于必须处理上述页面的损坏的html,因此我无法简单地修复html代码.

BS4 corrects faulty html. Usually this is not a problem. I tried parsing, altering and saving the html of this page: ulisses-regelwiki.de/index.php/sonderfertigkeiten.html In this case the repairing changes the representation. After the repairing many lines of the page are no longer centered, but leftaligned instead. Since I have to work with the broken html of said page, I cannot simply repair the html code.

如何防止bs4修复html或解决更正"问题?不知何故?

How can I prevent bs4 from repairing the html or fix the "correction" somehow?

(此最小示例仅显示bs4修复损坏的html代码;我无法创建一个最小示例,其中bs4以错误的方式(如上述页面)执行此操作)

(this minimal example just shows bs4 repairing broken html-code; I couldn't create a minimal example where bs4 does this in a wrong way like with the page mentioned above)

#!/usr/bin/env python3
from bs4 import BeautifulSoup


html = '''
<!DOCTYPE html>
<center>
Some Test content
<!-- A comment -->
<center>
'''

def is_string_only(t):
    return type(t) is NavigableString

soup = BeautifulSoup(html, 'lxml') #or html.parse

print(str(soup))

推荐答案

尝试使用此库.

from simplified_scrapy import SimplifiedDoc

html = '''
<!DOCTYPE html>
<center>
Some Test content
<!-- A comment -->
<center>
'''
doc = SimplifiedDoc(html)
print (doc.html)

还有更多示例: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

这篇关于BS4破坏了试图修复它的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆