如何将.txt文件解析为.xml? [英] How to parse a .txt file into .xml?

查看:107
本文介绍了如何将.txt文件解析为.xml?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的txt文件:

In File Name:   C:\Users\naqushab\desktop\files\File 1.m1
Out File Name:  C:\Users\naqushab\desktop\files\Output\File 1.m2
In File Size:   Low:    22636   High:   0
Total Process time: 1.859000
Out File Size:  Low:    77619   High:   0

In File Name:   C:\Users\naqushab\desktop\files\File 2.m1
Out File Name:  C:\Users\naqushab\desktop\files\Output\File 2.m2
In File Size:   Low:    20673   High:   0
Total Process time: 3.094000
Out File Size:  Low:    94485   High:   0

In File Name:   C:\Users\naqushab\desktop\files\File 3.m1
Out File Name:  C:\Users\naqushab\desktop\files\Output\File 3.m2
In File Size:   Low:    66859   High:   0
Total Process time: 3.516000
Out File Size:  Low:    217268  High:   0

我正在尝试将其解析为这样的XML格式:

I am trying to parse this to an XML format like this:

<?xml version='1.0' encoding='utf-8'?>
<root>
    <filedata>
        <InFileName>File 1.m1</InFileName>
        <OutFileName>File 1.m2</OutFileName>
        <InFileSize>22636</InFileSize>
        <OutFileSize>77619</OutFileSize>
        <ProcessTime>1.859000</ProcessTime>
    </filedata>
    <filedata>
        <InFileName>File 2.m1</InFileName>
        <OutFileName>File 2.m2</OutFileName>
        <InFileSize>20673</InFileSize>
        <OutFileSize>94485</OutFileSize>
        <ProcessTime>3.094000</ProcessTime>
    </filedata>
    <filedata>
        <InFileName>File 3.m1</InFileName>
        <OutFileName>File 3.m2</OutFileName>
        <InFileSize>66859</InFileSize>
        <OutFileSize>217268</OutFileSize>
        <ProcessTime>3.516000</ProcessTime>
    </filedata>
</root>

这是我正在尝试实现的代码(我正在使用Python 2):

Here is the code (I am using Python 2) in which I am trying to achieve that:

import re
import xml.etree.ElementTree as ET

rex = re.compile(r'''(?P<title>In File Name:
                       |Out File Name:
                       |In File Size:   Low:
                       |Total Process time:
                       |Out File Size:  Low:
                     )
                     (?P<value>.*)
                     ''', re.VERBOSE)

root = ET.Element('root')
root.text = '\n'    # newline before the celldata element

with open('Performance.txt') as f:
    celldata = ET.SubElement(root, 'filedata')
    celldata.text = '\n'    # newline before the collected element
    celldata.tail = '\n\n'  # empty line after the celldata element
    for line in f:
        # Empty line starts new celldata element (hack style, uggly)
        if line.isspace():
            celldata = ET.SubElement(root, 'filedata')
            celldata.text = '\n'
            celldata.tail = '\n\n'

        # If the line contains the wanted data, process it.
        m = rex.search(line)
        if m:
            # Fix some problems with the title as it will be used
            # as the tag name.
            title = m.group('title')
            title = title.replace('&', '')
            title = title.replace(' ', '')

            e = ET.SubElement(celldata, title.lower())
            e.text = m.group('value')
            e.tail = '\n'

# Display for debugging
ET.dump(root)

# Include the root element to the tree and write the tree
# to the file.
tree = ET.ElementTree(root)
tree.write('Performance.xml', encoding='utf-8', xml_declaration=True)

但是我得到的是空值,是否可以将此txt解析为XML?

But I am getting empty values, is it possible to parse this txt to XML?

推荐答案

正则表达式的更正:应该是

A correction with your regex: It should be

m = re.search('(?P<title>(In File Name)|(Out File Name)|(In File Size: *Low)|(Total Process time)|(Out File Size: *Low)):(?P<value>.*)',line)

,而不是您所提供的.因为在您的正则表达式中,In File Name|Out File Name表示它将检查In File Nam后跟,但先检查eO后跟ut File Name,依此类推.

and not as what you've given. Because in your regex, In File Name|Out File Name means, it will check for In File Nam followed but e or O followed by ut File Name and so on.

建议

您可以在不使用正则表达式的情况下进行操作. xml.dom.minidom 可用于美化您的xml字符串.

You can do it without using regex. xml.dom.minidom can be used for prettifying your xml string.

我已添加内联注释以便更好地理解!

I've added the comments inline for better understanding!

Node.toprettyxml([indent =" [,newl =" [,encoding ="]]])

返回文档的精美打印版本. indent指定缩进字符串,默认为制表符; newl指定每行末尾发出的字符串,默认为

Return a pretty-printed version of the document. indent specifies the indentation string and defaults to a tabulator; newl specifies the string emitted at the end of each line and defaults to

修改

import itertools as it
[line[0] for line in it.groupby(lines)]

您可以使用itertools软件包的groupby在列表行中对连续的dedup进行分组

you can use groupby of itertools package to group consucutive dedup in list lines

所以

import xml.etree.ElementTree as ET
root = ET.Element('root')

with open('file1.txt') as f:
    lines = f.read().splitlines()

#add first subelement
celldata = ET.SubElement(root, 'filedata')

import itertools as it
#for every line in input file
#group consecutive dedup to one 
for line in it.groupby(lines):
    line=line[0]
    #if its a break of subelements  - that is an empty space
    if not line:
        #add the next subelement and get it as celldata
        celldata = ET.SubElement(root, 'filedata')
    else:
        #otherwise, split with : to get the tag name
        tag = line.split(":")
        #format tag name
        el=ET.SubElement(celldata,tag[0].replace(" ",""))
        tag=' '.join(tag[1:]).strip()

        #get file name from file path
        if 'File Name' in line:
            tag = line.split("\\")[-1].strip()
        elif 'File Size' in line:
            splist =  filter(None,line.split(" "))
            tag = splist[splist.index('Low:')+1]
            #splist[splist.index('High:')+1]
        el.text = tag

#prettify xml
import xml.dom.minidom as minidom
formatedXML = minidom.parseString(
                          ET.tostring(
                                      root)).toprettyxml(indent=" ",encoding='utf-8').strip()
# Display for debugging
print formatedXML

#write the formatedXML to file.
with open("Performance.xml","w+") as f:
    f.write(formatedXML)

输出: Performance.xml

<?xml version="1.0" encoding="utf-8"?>
<root>
 <filedata>
  <InFileName>File 1.m1</InFileName>
  <OutFileName>File 1.m2</OutFileName>
  <InFileSize>22636</InFileSize>
  <TotalProcesstime>1.859000</TotalProcesstime>
  <OutFileSize>77619</OutFileSize>
 </filedata>
 <filedata>
  <InFileName>File 2.m1</InFileName>
  <OutFileName>File 2.m2</OutFileName>
  <InFileSize>20673</InFileSize>
  <TotalProcesstime>3.094000</TotalProcesstime>
  <OutFileSize>94485</OutFileSize>
 </filedata>
 <filedata>
  <InFileName>File 3.m1</InFileName>
  <OutFileName>File 3.m2</OutFileName>
  <InFileSize>66859</InFileSize>
  <TotalProcesstime>3.516000</TotalProcesstime>
  <OutFileSize>217268</OutFileSize>
 </filedata>
</root>

希望有帮助!

这篇关于如何将.txt文件解析为.xml?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆