使用 Beautiful Soup 将 CSS 属性转换为单个 HTML 属性? [英] Using Beautiful Soup to convert CSS attributes to individual HTML attributes?

查看:38
本文介绍了使用 Beautiful Soup 将 CSS 属性转换为单个 HTML 属性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个程序,该程序将采用 HTML 文件并使其对电子邮件更友好.现在所有的转换都是手动完成的,因为没有一个在线转换器完全满足我们的需求.

I'm trying to write a program that will take an HTML file and make it more email friendly. Right now all the conversion is done manually because none of the online converters do exactly what we need.

这听起来像是一个很好的机会,可以突破我的编程知识的极限并实际编写一些有用的东西,所以我提出尝试在业余时间编写一个程序,以帮助使过程更加自动化.

This sounded like a great opportunity to push the limits of my programming knowledge and actually code something useful so I offered to try to write a program in my spare time to help make the process more automated.

我对 HTML 或 CSS 不太了解,所以我主要依靠我的兄弟(他知道 HTML 和 CSS)来描述这个程序需要做哪些改变,所以如果我问一个愚蠢的人,请耐心等待问题.这对我来说是全新的领域.

I don't know much about HTML or CSS so I'm mostly relying on my brother (who does know HTML and CSS) to describe what changes this program needs to make, so please bear with me if I ask a stupid question. This is totally new territory for me.

大多数更改都是非常基本的——如果您看到标记/属性 X,则将其转换为标记/属性 Y.但是我在处理包含样式属性的 HTML 标记时遇到了麻烦.例如:

Most of the changes are pretty basic -- if you see tag/attribute X then convert it to tag/attribute Y. But I've run into trouble when dealing with an HTML tag containing a style attribute. For example:

<img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />

只要有可能,我想将样式属性转换为 HTML 属性(或将样式属性转换为对电子邮件更友好的内容).所以转换后应该是这样的:

Whenever possible I want to convert the style attributes into HTML attributes (or convert the style attribute to something more email friendly). So after the conversion it should look like this:

<img src="http://example.com/file.jpg" width="150" height="50" align="right"/>

现在我意识到并非所有 CSS 样式属性都具有等效的 HTML,所以现在我只想关注那些具有等效性的属性.我创建了一个可以进行这种转换的 Python 脚本:

Now I realize that not all CSS style attributes have an HTML equivalent, so right now I only want to focus on the ones that do. I whipped up a Python script that would do this conversion:

from bs4 import BeautifulSoup
import re

class Styler(object):

    img_attributes = {'float' : 'align'}

    def __init__(self, soup):
        self.soup = soup

    def format_factory(self):
        self.handle_image()

    def handle_image(self):
        tag = self.soup.find_all("img", style = re.compile('.'))
        print tag
        for i in xrange(len(tag)):
            old_attributes = tag[i]['style']
            tokens = [s for s in re.split(r'[:;]+|px', str(old_attributes)) if s]
            del tag[i]['style']
            print tokens

            for j in xrange(0, len(tokens), 2):
                if tokens[j] in Styler.img_attributes:
                    tokens[j] = Styler.img_attributes[tokens[j]]

                tag[i][tokens[j]] = tokens[j+1]

if __name__ == '__main__':
    html = """
    <body>hello</body>
    <img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />
    <blockquote>my blockquote text</blockquote>
    <div style="padding-left:25px; padding-right:25px;">text here</div>
    <body>goodbye</body>
    """
    soup = BeautifulSoup(html)
    s = Styler(soup)
    s.format_factory()

现在这个脚本可以很好地处理我的特定示例,但它不是很健壮,我意识到当面对现实世界的示例时,它很容易崩溃.我的问题是,我怎样才能使它更健壮?据我所知,Beautiful Soup 无法更改或提取样式属性的各个部分.我想这就是我想要做的.

Now this script will handle my particular example just fine, but it's not very robust and I realize that when put up against real world examples it will easily break. My question is, how can I make this more robust? As far as I can tell Beautiful Soup doesn't have a way to change or extract individual pieces of a style attribute. I guess that's what I'm looking to do.

推荐答案

对于这种类型的事情,我建议将 HTML 解析器(如 BeautifulSoup 或 lxml)与专门的 CSS 解析器结合使用.我已经成功使用 cssutils 包.与尝试提出正则表达式以匹配您可能在野外找到的任何可能的 CSS 相比,您将拥有更轻松的时间.

For this type of thing, I'd recommend an HTML parser (like BeautifulSoup or lxml) in conjunction with a specialized CSS parser. I've had success with the cssutils package. You'll have a much easier time than trying to come up with regular expressions to match any possible CSS you might find in the wild.

例如:

>>> import cssutils
>>> css = 'width:150px;height:50px;float:right;'
>>> s = cssutils.parseStyle(css)
>>> s.width
u'150px'
>>> s.height
u'50px'
>>> s.keys()
[u'width', u'height', u'float']
>>> s.cssText
u'width: 150px;
height: 50px;
float: right'
>>> del s['width']
>>> s.cssText
u'height: 50px;
float: right'

因此,使用它,您可以非常轻松地提取和操作所需的 CSS 属性,并使用 BeautifulSoup 将它们直接插入到 HTML 中.不过要小心 cssText 属性中弹出的换行符.我认为 cssutils 更适合将内容格式化为独立的 CSS 文件,但它足够灵活,可以主要用于您在这里所做的工作.

So, using this you can pretty easily extract and manipulate the CSS properties you want and plug them into the HTML directly with BeautifulSoup. Be a little careful of the newline characters that pop up in the cssText attribute, though. I think cssutils is more designed for formatting things as standalone CSS files, but it's flexible enough to mostly work for what you're doing here.

这篇关于使用 Beautiful Soup 将 CSS 属性转换为单个 HTML 属性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆