使用正则表达式解析XML中的模板标签 [英] Regular Expressions to parse template tags in XML
问题描述
我需要解析一些XML才能提取出嵌入的模板标签以进行进一步的解析.不过,我似乎无法弯曲Python的正则表达式来执行我想要的事情.
I need to parse some XML to pull out embedded template tags for further parsing. I can't seem to bend Python's regular expressions to do what I want, though.
英语:当行中任何位置包含模板标记时,请删除该特定行的所有XML,并仅将模板标记保留在其位置.
In English: when a template tag is contained anywhere in the row, remove all the XML for that specific row and leave only the template tag in its place.
我整理了一个测试用例进行演示.这是原始XML:
I put together a test case to demonstrate. Here's the original XML:
<!-- regex_trial.xml -->
<w:tbl>
<w:tr>
<w:tc><w:t>Header 1</w:t></w:tc>
<w:tc><w:t>Header 2</w:t></w:tc>
<w:tc><w:t>Header 3</w:t></w:tc>
</w:tr>
<w:tr>
<w:tc><w:t>{% for i in items %}</w:t></w:tc>
<w:tc><w:t></w:t></w:tc>
<w:tc><w:t></w:t></w:tc>
</w:tr>
<w:tr>
<w:tc><w:t>{{ i.field1 }}</w:t></w:tc>
<w:tc><w:t>{{ i.field2 }}</w:t></w:tc>
<w:tc><w:t>{{ i.field3 }}</w:t></w:tc>
</w:tr>
<w:tr>
<w:tc><w:t>{% endfor %}</w:t></w:tc>
<w:tc><w:t></w:t></w:tc>
<w:tc><w:t></w:t></w:tc>
</w:tr>
</w:tbl>
这是所需结果:
<!-- regex_desired_result.xml -->
<w:tbl>
<w:tr>
<w:tc><w:t>Header 1</w:t></w:tc>
<w:tc><w:t>Header 2</w:t></w:tc>
<w:tc><w:t>Header 3</w:t></w:tc>
</w:tr>
{% for i in items %}
<w:tr>
<w:tc><w:t>{{ i.field1 }}</w:t></w:tc>
<w:tc><w:t>{{ i.field2 }}</w:t></w:tc>
<w:tc><w:t>{{ i.field3 }}</w:t></w:tc>
</w:tr>
{% endfor %}
</w:tbl>
以下是一些我正在测试的python代码:
Here is some python code I am using to test:
#!/usr/bin/env python
import re
f = open( 'regex_trial.xml', 'r' )
orig_xml = f.read()
f.close()
p = re.compile( '<w:tr.*?(?P<tag>{%.*?%}).*?</w:tr>', re.DOTALL )
new_xml = p.sub( '\g<tag>', orig_xml, 0 )
print new_xml
此正则表达式的实际结果为:
The actual result of this regex is:
<!-- regex_trial.xml -->
<w:tbl>
{% for i in items %}
{% endfor %}
</w:tbl>
任何帮助将不胜感激!如果能够弄清楚这一点,我们将能够从Django支持的站点动态地动态生成MS Word docx文件.谢谢!
Any help is greatly appreciated! If we can figure this out, we will be able to dynamically generate MS Word docx files on the fly from Django-powered sites. Thanks!!
更新:这是我使用的最终代码
from xml.etree import ElementTree
import cStringIO as StringIO
TEMPLATE_TAG = 'template_text'
tree = ElementTree.parse( 'regex_trial.xml' )
rows = tree.getiterator('tr')
for row in rows:
for cell in row.getiterator('t'):
if cell.text and cell.text.find( '{%' ) >= 0:
template_tag = cell.text
row.clear()
row.tag = TEMPLATE_TAG
row.text = template_tag
break
output = StringIO.StringIO()
tree.write( output )
xml = output.getvalue()
xml = xml.replace('<%s>' % TEMPLATE_TAG, '')
xml = xml.replace('</%s>' % TEMPLATE_TAG, '')
print xml
感谢所有帮助!
推荐答案
对于此问题,请不要使用正则表达式.
Please don't use regular expressions for this problem.
我是认真的,用正则表达式解析XML很困难,这使您的代码无法被其他人维护50倍.
I'm serious, parsing XML with a regex is hard, and it makes your code 50x less maintainable by anyone else.
lxml 是pythonistas用来解析XML的事实上的工具...请看此答案,应该应该是已接受的答案.
lxml is the defacto tool that pythonistas use to parse XML... take a look at this article on Stack Overflow for sample usage. Or consider this answer, which should have been the answer that was accepted.
我将其作为一个快速演示进行了破解...它搜索具有非空<w:t>
子级的<w:tc>
,并在每个元素旁边打印出良好的文字.
I hacked this up as a quick demo... it searches for <w:tc>
with non-empty <w:t>
children and prints good next to each element.
import lxml.etree as ET
from lxml.etree import XMLParser
def worthy(elem):
for child in elem.iterchildren():
if (child.tag == 't') and (child.text is not None):
return True
return False
def dump(elem):
for child in elem.iterchildren():
print "Good", child.tag, child.text
parser = XMLParser(ns_clean=True, recover=True)
etree = ET.parse('regex_trial.xml', parser)
for thing in etree.findall("//"):
if thing.tag == 'tc' and worthy(thing):
dump(thing)
产量...
Good t Header 1
Good t Header 2
Good t Header 3
Good t {% for i in items %}
Good t {{ i.field1 }}
Good t {{ i.field2 }}
Good t {{ i.field3 }}
Good t {% endfor %}
这篇关于使用正则表达式解析XML中的模板标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!