用Python解析多个xml文件 [英] Parse multiple xml files in Python

查看:142
本文介绍了用Python解析多个xml文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里遇到了一个问题。所以我想解析多个具有相同结构的XML文件。我已经能够获得每个文件的所有位置,并将它们保存到三个不同的列表中,因为有三种不同类型的xml结构。现在我想要创建三个函数(对于每个列表),它循环遍历列表并解析我需要的信息。不知何故,我无法做到这一点。这里的任何人谁可以给我一个提示如何做到这一点?

  import os 
导入glob
导入xml.etree.ElementTree作为ET
导入fnmatch
导入re
导入sys


####获取每个XML文件的位置并将其保存到列表中####

all_xml_list = []

def locate(pattern,root = os.curdir):
用于路径,目录,os.walk中的文件(os.path.abspath(root) ):
用于fnmatch.filter(files,pattern)中的文件名:
产生os.path.join(路径,文件名)

用于locate('*。xml ',r'C:\ Users \Lars\Documents\XML-Files'):
all_xml_list.append(files)


####创建由GameDay事件列表####


xml_GameDay_Player = [x在all_xml_list中为x,如果'Player'在x中]
xml_GameDay_Team = [xf或x在all_xml_list中,如果在x中为'Team']
xml_GameDay_Match = [x在all_xml_list中为x,如果'Match'in x]

XML文件如下所示:
$ b

< sports -content xmlns:imp =url>
< sports-title> player-statistics-165483< / sports-title>
< / sports-metadata>
< sports-event>
< team>
< team-metadata id =O_17team-key =17>
< / team-metadata>
< player>
< player-metadata player-key =33201uniform-number =1>
< name first =Maxlast =Mustermannfull =Max Mustermannnickname =Mäxchenimp:extensive =Name/>
< / player-metadata>
< player-stats stats-coverage =standarddate-coverage-type =eventminutes-played =90score =0>
< / player-stats-soccer>
< / player-stats>
< / player>
< / team>
< / sports-event>
< / sports-content>

我想提取player meta tag和player-stats coverage和球员状态足球标签。

解决方案

改进@ Gnudiff的答案,这是一个更具弹性的方法:

 导入os 
从glob导入glob $ b $从lxml导入etree

xml_GameDay = {
'Player':[ ],
'Team':[],
'匹配':[​​],
}

#将所有文件排序到右桶
文件名在glob(r'C:\Users\Lars\Documents\XML-Files\ * .xml):
for xml_GameDay.keys()中的键值:
if key in os.path.basename(filename):
xml_GameDay [key] .append(filename)
break

def select_first(context,path):
result = context .xpath(路径)
如果len(结果):
返回结果[0]
返回无

#从文件中提取数据
用于文件名在xml_GameDay ['Player']中:
tree = etree.parse(文件名)

用于tree.xpath('.// player')中的播放器:
player_data = {
'key':select_first(player,'./player-metadata/@player-key'),
'lastname':select_first(player,'./player-metadata/name/ @last'),
'firstname':select_first(player,'./player-metadata/name/@first'),
'nickname':select_first(player,'./player-metadata/名称/ @昵称'),
}
print(player_data)
#...

XML文件可以有多种字节编码形式,并以 XML声明为前缀,声明文件其余部分的编码。

 <?xml version =1.0encoding =UTF-8?> 

UTF-8是XML文件的常用编码(也是默认编码),但实际上它可以是任何东西。这是不可能的预测,这是非常糟糕的做法,硬编码你的程序期望一定的编码。

XML解析器旨在以透明的方式处理这个特性,所以您不必担心它,除非您做错了



这是一个很好的例子:

 <$ c $ b $#bad代码,不要使用
def file_get_contents(filename):
打开(filename)作为f:
返回f.read()

tree = etree.XML(file_get_contents('some_filename.xml'))


$ b


  1. Python将文件名作为文本文件 f

  2. f.read()返回一个字符串
  3. code> etree.XML()解析该字符串并创建一个DOM对象 tree

听起来不错,是吗?但是如果XML是这样的:

$ p $ <?xml version =1.0encoding =UTF-8?> ;
<玩家昵称=Mäxchen> ...< / Player>

那么您最终会得到的DOM将会是:

 玩家
@昵称=M x xchen

您刚刚销毁了这些数据。除非XML包含一个扩展字符,如ä,否则您甚至不会注意到这种方法是被禁止的。这可以很容易地进入生产unnoticed。

打开XML文件有一个正确的方法(它也比上面的代码更简单):给文件名解析器。

  tree = etree.parse('some_filename.xml')

通过这种方式,解析器可以在读取数据之前找出文件的编码,而不必关心这些细节。 b $ b

I am stuck with a problem here. So I want to parse multiple xml files with the same structure within it. I was already able to get all the locations for each file and save them into three different lists, since there are three different types of xml structures. Now I want to create three functions (for each list), which is looping through the lists and parse the information I need. Somehow I am not able to do it. Anybody here who could give me a hint how to do it?

import os
import glob
import xml.etree.ElementTree as ET
import fnmatch
import re
import sys


#### Get the location of each XML file and save them into a list ####

all_xml_list =[]                                                                                                                                       

def locate(pattern,root=os.curdir):
    for path, dirs, files in os.walk(os.path.abspath(root)):
        for filename in fnmatch.filter(files,pattern):
            yield os.path.join(path,filename)

for files in locate('*.xml',r'C:\Users\Lars\Documents\XML-Files'):
    all_xml_list.append(files)


#### Create lists by GameDay Events ####


xml_GameDay_Player   = [x for x in all_xml_list if 'Player' in x]                                                             
xml_GameDay_Team     = [x for x in all_xml_list if 'Team' in x]                                                             
xml_GameDay_Match    = [x for x in all_xml_list if 'Match' in x]  

The XML file looks like this:

<sports-content xmlns:imp="url">
  <sports-metadata date-time="20160912T000000+0200" doc-id="sports_event_" publisher="somepublisher" language="en_EN" document-class="player-statistics">
    <sports-title>player-statistics-165483</sports-title>
  </sports-metadata>
  <sports-event>
    <event-metadata id="E_165483" event-key="165483" event-status="post-event" start-date-time="20160827T183000+0200" start-weekday="saturday" heat-number="1" site-attendance="52183" />
    <team>
      <team-metadata id="O_17" team-key="17">
        <name full="TeamName" nickname="NicknameoftheTeam" imp:dfl-3-letter-code="NOT" official-3-letter-code="" />
      </team-metadata>
      <player>
        <player-metadata player-key="33201" uniform-number="1">
          <name first="Max" last="Mustermann" full="Max Mustermann" nickname="Mäxchen" imp:extensive="Name" />
        </player-metadata>
        <player-stats stats-coverage="standard" date-coverage-type="event" minutes-played="90" score="0">
          <rating rating-type="standard" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="5.6" imp:rating-value-mid-fielder="5.8" imp:rating-value-forward="5.0" />
          <rating rating-type="grade" rating-value="2.2" />
          <rating rating-type="index" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="3.7" imp:rating-value-mid-fielder="2.5" imp:rating-value-forward="1.2" />
          <rating rating-type="bemeister" rating-value="16.04086" />
          <player-stats-soccer imp:duels-won="1" imp:duels-won-ground="0" imp:duels-won-header="1" imp:duels-lost-ground="0" imp:duels-lost-header="0" imp:duels-lost="0" imp:duels-won-percentage="100" imp:passes-completed="28" imp:passes-failed="4" imp:passes-completions-percentage="87.5" imp:passes-failed-percentage="12.5" imp:passes="32" imp:passes-short-total="22" imp:balls-touched="50" imp:tracking-distance="5579.80" imp:tracking-average-speed="3.41" imp:tracking-max-speed="23.49" imp:tracking-sprints="0" imp:tracking-sprints-distance="0.00" imp:tracking-fast-runs="3" imp:tracking-fast-runs-distance="37.08" imp:tracking-offensive-runs="0" imp:tracking-offensive-runs-distance="0.00" dfl-distance="5579.80" dfl-average-speed="3.41" dfl-max-speed="23.49">
            <stats-soccer-defensive saves="5" imp:catches-punches-crosses="3" imp:catches-punches-corners="0" goals-against-total="1" imp:penalty-saves="0" imp:clear-cut-chance="0" />
            <stats-soccer-offensive shots-total="0" shots-on-goal-total="0" imp:shots-off-post="0" offsides="0" corner-kicks="0" imp:crosses="0" assists-total="0" imp:shot-assists="0" imp:freekicks="3" imp:miss-chance="0" imp:throw-in="0" imp:punt="2" shots-penalty-shot-scored="0" shots-penalty-shot-missed="0" dfl-assists-total="0" imp:shots-total-outside-box="0" imp:shots-total-inside-box="0" imp:shots-foot-inside-box="0" imp:shots-foot-outside-box="0" imp:shots-total-header="0" />
            <stats-soccer-foul fouls-commited="0" fouls-suffered="0" imp:yellow-red-cards="0" imp:red-cards="0" imp:yellow-cards="0" penalty-caused="0" />
          </player-stats-soccer>
        </player-stats>
      </player>
    </team>
  </sports-event>
</sports-content>

I want to extract everything which is within the "player meta tag" and "player-stats coverage" and "player stats soccer" tag.

解决方案

Improving on @Gnudiff's answer, here is a more resilient approach:

import os
from glob import glob
from lxml import etree

xml_GameDay = {
    'Player': [],
    'Team': [],
    'Match': [],
}

# sort all files into the right buckets
for filename in glob(r'C:\Users\Lars\Documents\XML-Files\*.xml'):
    for key in xml_GameDay.keys():
        if key in os.path.basename(filename):
            xml_GameDay[key].append(filename)
            break

def select_first(context, path):
    result = context.xpath(path)
    if len(result):
        return result[0]
    return None

# extract data from Player files
for filename in xml_GameDay['Player']:
    tree = etree.parse(filename)

    for player in tree.xpath('.//player'):        
        player_data = {
            'key': select_first(player, './player-metadata/@player-key'),
            'lastname': select_first(player, './player-metadata/name/@last'),
            'firstname': select_first(player, './player-metadata/name/@first'),
            'nickname': select_first(player, './player-metadata/name/@nickname'),
        }
        print(player_data)
        # ...

XML files can come in a variety of byte encodings and are prefixed by the XML declaration, which declares the encoding of the rest of the file.

<?xml version="1.0" encoding="UTF-8"?>

UTF-8 is a common encoding for XML files (it also is the default), but in reality it can be anything. It's impossible to predict and it's very bad practice to hard-code your program to expect a certain encoding.

XML parsers are designed to deal with this peculiarity in a transparent way, so you don't really have to worry about it, unless you do it wrong.

This is a good example of doing it wrong:

# BAD CODE, DO NOT USE
def file_get_contents(filename):
    with open(filename) as f:
        return f.read()

tree = etree.XML(file_get_contents('some_filename.xml'))

What happens here is this:

  1. Python opens filename as a text file f
  2. f.read() returns a string
  3. etree.XML() parses that string and creates a DOM object tree

Doesn't sound so wrong, does it? But if the XML is like this:

<?xml version="1.0" encoding="UTF-8"?>
<Player nickname="Mäxchen">...</Player>

then the DOM you will end up with will be:

Player
    @nickname="Mäxchen"

You have just destroyed the data. And unless the XML contained an "extended" character like ä, you would not even have noticed that this approach is borked. This can easily slip into production unnoticed.

There is exactly one correct way of opening an XML file (and it's also simpler than the code above): Give the file name to the parser.

tree = etree.parse('some_filename.xml')

This way the parser can figure out the file encoding before it reads the data and you don't have to care about those details.

这篇关于用Python解析多个xml文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆