Python如何搜索和纠正HTML标签和属性? [英] Python how to search and correct html tags and attributes?

查看:82
本文介绍了Python如何搜索和纠正HTML标签和属性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须修正< img> 标记的所有结束标记,如下面的文字所示。它不应使用> 关闭< img> ,而应该以 / >

I have to fix all the closing tags of the <img> tag as shown in the text below. Instead of closing the <img> with a >, it should close with />.

有没有简单的方法可以搜寻所有< img> 在本文中修正>

Is there any easy way to search for all the <img> in this text and fix the > ?

(如果用<$ c $关闭

(If it is closed with a /> already then there is no action required).

其他问题,如果没有宽度或高度到指定的< img> ,解决问题的最佳方法是什么?

Other question, if there is no "width" or "height" to the <img> specified, what is the best way to solve the issue?

下载所有图像并获取宽度和高度的相应属性,然后将它们添加回字符串中?

Download all the images and get the corresponding attributes of width and height, then add them back to the string?

正确的< img> 标记是一个以 />< / code>结束并且有效宽度& ;身高。

The correct <img> tag is the one that closes with /> and have the valid width & height.

<a href="http://www.cultofmac.com/daily-deals749-mac-mini-1199-3-0ghz-imac-new-mac-pros/52674"><img align="left" hspace="5" width="150" src="http://s3.dlnws.com/images/products/images/749000/749208-large" alt="" title=""></a>
Apple today unleashed a number of goodies, including giving iMacs and Mac Pros more oomph with new processors and increased storage options. We have those deals today, along with many more items for the Mac lover. Along with the refreshed line of iMacs and Mac Pros, we’ll also look at a number of software deals [...]
<p><a href="http://feedads.g.doubleclick.net/~a/DL_-gOGSR1JMzKDbErt1EG3re3I/0/da"><img src="http://feedads.g.doubleclick.net/~a/DL_-gOGSR1JMzKDbErt1EG3re3I/0/di" border="0" ismap></a><br>
<a href="http://feedads.g.doubleclick.net/~a/DL_-gOGSR1JMzKDbErt1EG3re3I/1/da"><img src="http://feedads.g.doubleclick.net/~a/DL_-gOGSR1JMzKDbErt1EG3re3I/1/di" border="0" ismap></a></p><img src="http://feeds.feedburner.com/~r/cultofmac/bFow/~4/Mq5iLOaT50k" height="1" width="1">

我真的需要 width height ,因为它将用作其他解析器的输入。该解析器表示< img 标签必须以 /> 关闭。我没有使用输出在网页上查看。请提出一个简单的解决方案来实现这个目的!

I really need to have width and height in the output, because it will be used as the input to other parser. And that parser says that the <img tag MUST close with a />. I am not using the output to view on the web page. Please suggest a simple solution to achieve this!

推荐答案

为了简单起见,我会将解析潜在的刺激性问题X)HTML到一个专用的库:

For the sake of simplicity, I would outsource the potentially irritating issues around parsing (X)HTML to a dedicated library:

下面是一个简单的例子: lxml.html

Here is a simple example with lxml.html:

import lxml.html

page = """<html>...</html>"""
page = lxml.html.document_fromstring(page)
lxml.html.tostring(page)

lxml.html 有一个非常方便的模块 clean ,旨在清除恶意代码。这很简单:

lxml.html has a really handy module clean, designed to remove malicious code. It's simple as well:

from lxml.html.clean import clean_html
clean_html(page)

这篇关于Python如何搜索和纠正HTML标签和属性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆