用美丽的汤转换CSS属性个别HTML属性? [英] Using Beautiful Soup to convert CSS attributes to individual HTML attributes?

查看:151
本文介绍了用美丽的汤转换CSS属性个别HTML属性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想编写一个程序,将一个HTML文件,使之更加友好的电子邮件。现在所有的转换做手工,因为没有网络转换器做的正是我们所需要的。

I'm trying to write a program that will take an HTML file and make it more email friendly. Right now all the conversion is done manually because none of the online converters do exactly what we need.

这听起来像一个伟大的机会,推我的编程知识的限制,实际上$所以我愿意尝试在我的业余时间写一个程序,以帮助使这一过程更加自动化C $Ç一些有用的东西。

This sounded like a great opportunity to push the limits of my programming knowledge and actually code something useful so I offered to try to write a program in my spare time to help make the process more automated.

我不知道很多关于HTML或CSS,所以我主要是靠我的兄弟(谁做知道​​HTML和CSS)来描述什么样的变化这一计划需要做,所以请原谅我问一个愚蠢的题。这完全是一个新的领域对我来说。

I don't know much about HTML or CSS so I'm mostly relying on my brother (who does know HTML and CSS) to describe what changes this program needs to make, so please bear with me if I ask a stupid question. This is totally new territory for me.

大部分的变化是pretty基础 - 如果你看到标签/属性x上然后将其转换为标记/属性Y.但我已经包含一个样式属性的HTML标记打交道时遇到麻烦。例如:

Most of the changes are pretty basic -- if you see tag/attribute X then convert it to tag/attribute Y. But I've run into trouble when dealing with an HTML tag containing a style attribute. For example:

<img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />

只要有可能我要转换的样式属性到HTML属性(或style属性转换为更多的东西电子邮件型)。所以,在转换后应该是这样的:

Whenever possible I want to convert the style attributes into HTML attributes (or convert the style attribute to something more email friendly). So after the conversion it should look like this:

<img src="http://example.com/file.jpg" width="150" height="50" align="right"/>

现在我意识到,并非所有的CSS样式属性有一个HTML等价的,所以现在我只想专注于做的。我刮起了Python脚本,会做这种转换:

Now I realize that not all CSS style attributes have an HTML equivalent, so right now I only want to focus on the ones that do. I whipped up a Python script that would do this conversion:

from bs4 import BeautifulSoup
import re

class Styler(object):

    img_attributes = {'float' : 'align'}

    def __init__(self, soup):
        self.soup = soup

    def format_factory(self):
        self.handle_image()

    def handle_image(self):
        tag = self.soup.find_all("img", style = re.compile('.'))
        print tag
        for i in xrange(len(tag)):
            old_attributes = tag[i]['style']
            tokens = [s for s in re.split(r'[:;]+|px', str(old_attributes)) if s]
            del tag[i]['style']
            print tokens

            for j in xrange(0, len(tokens), 2):
                if tokens[j] in Styler.img_attributes:
                    tokens[j] = Styler.img_attributes[tokens[j]]

                tag[i][tokens[j]] = tokens[j+1]

if __name__ == '__main__':
    html = """
    <body>hello</body>
    <img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />
    <blockquote>my blockquote text</blockquote>
    <div style="padding-left:25px; padding-right:25px;">text here</div>
    <body>goodbye</body>
    """
    soup = BeautifulSoup(html)
    s = Styler(soup)
    s.format_factory()

现在该脚本将处理我的具体例子就好了,但它不是非常强大,我意识到,当一次反现实世界的例子,它会轻松突破。我的问题是,我怎么能做出这种更强大的?至于我可以告诉美丽的汤没有办法改变或提取样式属性的各个部分。我想这就是我期待的事情。

Now this script will handle my particular example just fine, but it's not very robust and I realize that when put up against real world examples it will easily break. My question is, how can I make this more robust? As far as I can tell Beautiful Soup doesn't have a way to change or extract individual pieces of a style attribute. I guess that's what I'm looking to do.

推荐答案

有关这种类型的事情,我会联合推荐一个HTML解析器(如BeautifulSoup或LXML)有一个专门的CSS解析器。我已经成功与的cssutils包。您将有一个不是试图拿出正规前pressions匹配任何可能的CSS您可以在野外找到容易得多。

For this type of thing, I'd recommend an HTML parser (like BeautifulSoup or lxml) in conjunction with a specialized CSS parser. I've had success with the cssutils package. You'll have a much easier time than trying to come up with regular expressions to match any possible CSS you might find in the wild.

例如:

>>> import cssutils
>>> css = 'width:150px;height:50px;float:right;'
>>> s = cssutils.parseStyle(css)
>>> s.width
u'150px'
>>> s.height
u'50px'
>>> s.keys()
[u'width', u'height', u'float']
>>> s.cssText
u'width: 150px;\nheight: 50px;\nfloat: right'
>>> del s['width']
>>> s.cssText
u'height: 50px;\nfloat: right'

所以,用这个你可以pretty方便地提取和操纵你想要的CSS属性,并将其插入直接用HTML BeautifulSoup。小心一点,在 cssText 属性弹出换行符,虽然。我认为cssutils用于格式化的东西作为独立的CSS文件的详细设计,但它足够灵活,能够多为你做什么这里工作。

So, using this you can pretty easily extract and manipulate the CSS properties you want and plug them into the HTML directly with BeautifulSoup. Be a little careful of the newline characters that pop up in the cssText attribute, though. I think cssutils is more designed for formatting things as standalone CSS files, but it's flexible enough to mostly work for what you're doing here.

这篇关于用美丽的汤转换CSS属性个别HTML属性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆