如何美化HTML以便标记属性将保留在一行中? [英] How to prettify HTML so tag attributes will remain in one single line?

查看:152
本文介绍了如何美化HTML以便标记属性将保留在一行中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到了这一小段代码:

  text =< html>< head>< / head>< body> 
< h1 style =
text-align:center;
>主要网站< / h1>
< div>
< p style =
color:blue;
text-align:center;
> text1
< / p>
< p style =
color:blueviolet;
text-align:center;
> text2
< / p>
< / div>
< div>
< p style =text-align:center >
< img src =./ foo / test.jpgalt =测试静态图片style =
>
< / p>
< / div>
< / body>< / html>


导入sys
导入re
导入bs4

$ b $ def prettify(soup,indent_width = 4):
r = re.compile(r'^(\s *)',re.MULTILINE)
return r .sub(r'\\1'* indent_width,soup.prettify())

soup = bs4.BeautifulSoup(text,html.parser)
print(prettify(soup) )

现在上面的代码片段的输出是:

 < html> 
< head>
< / head>
< body>
< h1 style =
text-align:center;
>
主要网站
< / h1>
< div>
< p style =
color:blue;
text-align:center;
>
text1
< / p>
< p style =
color:blueviolet;
text-align:center;
>
text2
< / p>
< / div>
< div>
< p style =text-align:center>
< img alt =测试静态图片src =./ foo / test.jpgstyle =
/>
< / p>
< / div>
< / body>
< / html>

我想知道如何格式化输出,所以它变成了这个:

 < html> 
< head>
< / head>
< body>
< h1 style =text-align:center;>
主要网站
< / h1>
< div>
< p style =color:blue; text-align:center;>
text1
< / p>
< p style =color:blueviolet; text-align:center;>
text2
< / p>
< / div>
< div>
< p style =text-align:center>
< / p>
< / div>
< / body>
< / html>

换句话说,我想保留html语句,比如<如果可能的话,在一行中标记attrib1 = value1 attrib2 = value2 ... attribn = valuen> 。当我说如果可能的时候,我的意思是没有搞砸属性本身的价值(value1,value2,...,valuen)。

这是可能实现的吗?与beautifulsoup4?据我所阅读的文档看来,您可以使用自定义格式化程序,但我不知道如何使用自定义格式化程序,以便完成所描述的要求。



编辑:

@alecxe解决方案非常简单,不幸的是在一些更复杂的情况下会失败,例如:

  test1 =
< div id =dialer-capmaign-consoleclass =fill-verticalstyle =flex:1 1 auto;>

{field:'dialerSession.startTime',format:'{0:G}',title:'开始时间',宽度: 122},
{field:'dialerSession.endTime',格式:'{0:G}',标题:'End time',width:122,attributes:{class:'tooltip-column'}},
{field:'conversationStartTime',模板:cty.ui.gct.duration_dialerSession_conversationStartTime _endTime,title:'Duration',width:80},
{field:'dialerSession.caller.lastName',template:cty.ui.gct.person_dialerSession_caller_link,title:'Caller',width:160},
{field:'noteType',template:cty.ui.gct.nameDescription_noteType,title:'Note type',width:150,attributes:{class:'tooltip-column'}},
{field :'笔记',标题:'笔记'}
]>
< / div>
< / div>


from bs4 import BeautifulSoup
import re


def prettify(soup,indent_width = 4,single_lines = True) :
如果single_lines:
用于汤中的标记():
用于at.rtrs中的attr:
print(tag.attrs [attr],tag.attrs [attr]。 __class__)
tag.attrs [attr] =.join(
tag.attrs [attr] .replace(\\\
,).split())

r = re.compile(r'^(\s *)',re.MULTILINE)
return r.sub(r'\1'* indent_width,soup.prettify())


def html_beautify(text):
soup = BeautifulSoup(text,html.parser)
返回美化(汤)

print( html_beautify(test1))

TRACEBACK: $ b

  dialer-capmaign-console< class'str'> 
['fill-vertical']< class'list'>
Traceback(最近一次调用最后一次):
文件d:\ mcve\x.py,行35,在< module>
print(html_beautify(test1))
在html_beautify
中返回文件d:\mcve\x.py,第33行返回prettify(汤)
文件d:\ mcve\x.py,第25行,用于美化
tag.attrs [attr] .replace(\\\
,).split())
AttributeError:'list'object has no attribute'replace'


解决方案 div> BeautifulSoup 试图保留输入HTML中属性值中的换行符和多个空格。



这里的一个解决方法是遍历元素属性并在美化前清理它们 - 删除换行符并用一个空格替换多个连续的空格:

 用于标记汤():
用于标记.attrs中的attr:
tag.attrs [attr] =.join( tag.attrs [attr] .replace(\\\
,).split())

print(soup.prettify())

打印:

 < html> 
< head>
< / head>
< body>
< h1 style =text-align:center;>
主要网站
< / h1>
< div>
< p style =color:blue; text-align:center;>
text1
< / p>
< p style =color:blueviolet; text-align:center;>
text2
< / p>
< / div>
< div>
< p style =text-align:center>
< / p>
< / div>
< / body>
< / html>






更新 class ):



您只需要添加对属性为 list 类型的情况添加特殊处理的轻微修改:

 < ():
tag.attrs = {
attr:[.join(attr_value.replace(\ n,).split() )for attr_value in value]
if isinstance(value,list)
else.join(value.replace(\ n,).split())
for attr,value in tag.attrs.items()
}


I got this little piece of code:

text = """<html><head></head><body>
    <h1 style="
    text-align: center;
">Main site</h1>
    <div>
        <p style="
    color: blue;
    text-align: center;
">text1
        </p>
        <p style="
    color: blueviolet;
    text-align: center;
">text2
        </p>
    </div>
    <div>
        <p style="text-align:center">
            <img src="./foo/test.jpg" alt="Testing static images" style="
">
        </p>
    </div>
</body></html>
"""

import sys
import re
import bs4


def prettify(soup, indent_width=4):
    r = re.compile(r'^(\s*)', re.MULTILINE)
    return r.sub(r'\1' * indent_width, soup.prettify())

soup = bs4.BeautifulSoup(text, "html.parser")
print(prettify(soup))

The output of the above snippet right now is:

<html>
    <head>
    </head>
    <body>
        <h1 style="
                text-align: center;
">
            Main site
        </h1>
        <div>
            <p style="
                color: blue;
                text-align: center;
">
                text1
            </p>
            <p style="
                color: blueviolet;
                text-align: center;
">
                text2
            </p>
        </div>
        <div>
            <p style="text-align:center">
                <img alt="Testing static images" src="./foo/test.jpg" style="
"/>
            </p>
        </div>
    </body>
</html>

I'd like to figure out how to format the output so it becomes this instead:

<html>
    <head>
    </head>
    <body>
        <h1 style="text-align: center;">
            Main site
        </h1>
        <div>
            <p style="color: blue;text-align: center;">
                text1
            </p>
            <p style="color: blueviolet;text-align: center;">
                text2
            </p>
        </div>
        <div>
            <p style="text-align:center">
                <img alt="Testing static images" src="./foo/test.jpg" style=""/>
            </p>
        </div>
    </body>
</html>

Said otherwise, I'd like to keep html statements such as <tag attrib1=value1 attrib2=value2 ... attribn=valuen> in one single line if possible. When I say "if possible" I mean without screwing up the value of the attributes themselves (value1, value2, ..., valuen).

Is this possible to achieve with beautifulsoup4? As far I've read in the docs it seems you can use a custom formatter but I don't know how I could have a custom formatter so it can accomplish the described requirements.

EDIT:

@alecxe solution is quite simple, unfortunately fails in some more complex cases like the below one, ie:

test1 = """
<div id="dialer-capmaign-console" class="fill-vertically" style="flex: 1 1 auto;">
    <div id="sessionsGrid" data-columns="[
        { field: 'dialerSession.startTime', format:'{0:G}', title:'Start time', width:122 },
        { field: 'dialerSession.endTime', format:'{0:G}', title:'End time', width:122, attributes: {class:'tooltip-column'}},
        { field: 'conversationStartTime', template: cty.ui.gct.duration_dialerSession_conversationStartTime_endTime, title:'Duration', width:80},
        { field: 'dialerSession.caller.lastName',template: cty.ui.gct.person_dialerSession_caller_link, title:'Caller', width:160 },
        { field: 'noteType',template:cty.ui.gct.nameDescription_noteType, title:'Note type', width:150, attributes: {class:'tooltip-column'}},
        { field: 'note', title:'Note'}
        ]">
</div>
</div>
"""

from bs4 import BeautifulSoup
import re


def prettify(soup, indent_width=4, single_lines=True):
    if single_lines:
        for tag in soup():
            for attr in tag.attrs:
                print(tag.attrs[attr], tag.attrs[attr].__class__)
                tag.attrs[attr] = " ".join(
                    tag.attrs[attr].replace("\n", " ").split())

    r = re.compile(r'^(\s*)', re.MULTILINE)
    return r.sub(r'\1' * indent_width, soup.prettify())


def html_beautify(text):
    soup = BeautifulSoup(text, "html.parser")
    return prettify(soup)

print(html_beautify(test1))

TRACEBACK:

dialer-capmaign-console <class 'str'>
['fill-vertically'] <class 'list'>
Traceback (most recent call last):
  File "d:\mcve\x.py", line 35, in <module>
    print(html_beautify(test1))
  File "d:\mcve\x.py", line 33, in html_beautify
    return prettify(soup)
  File "d:\mcve\x.py", line 25, in prettify
    tag.attrs[attr].replace("\n", " ").split())
AttributeError: 'list' object has no attribute 'replace'

解决方案

BeautifulSoup tried to preserve the newlines and multiple spaces you had in the attribute values in the input HTML.

One workaround here would be to iterate over the element attributes and clean them up prior to prettifying - removing the newlines and replacing multiple consecutive spaces with a single space:

for tag in soup():
    for attr in tag.attrs:
        tag.attrs[attr] = " ".join(tag.attrs[attr].replace("\n", " ").split())

print(soup.prettify())

Prints:

<html>
 <head>
 </head>
 <body>
  <h1 style="text-align: center;">
   Main site
  </h1>
  <div>
   <p style="color: blue; text-align: center;">
    text1
   </p>
   <p style="color: blueviolet; text-align: center;">
    text2
   </p>
  </div>
  <div>
   <p style="text-align:center">
    <img alt="Testing static images" src="./foo/test.jpg" style=""/>
   </p>
  </div>
 </body>
</html>


Update (to address the multi-valued attributes like class):

You just need to add a slight modification adding special handling for the case when an attribute is of a list type:

for tag in soup():
    tag.attrs = {
        attr: [" ".join(attr_value.replace("\n", " ").split()) for attr_value in value] 
              if isinstance(value, list)
              else " ".join(value.replace("\n", " ").split())
        for attr, value in tag.attrs.items()
    }

这篇关于如何美化HTML以便标记属性将保留在一行中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆