使用Java中的REGEX解析XML [英] Parsing XML with REGEX in Java

查看:131
本文介绍了使用Java中的REGEX解析XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于以下XML代码段,我需要获取DataElements下每个子项的名称/值对列表。 XPath或XML解析器不能用于我无法控制的原因,因此我正在使用正则表达式。

Given the below XML snippet I need to get a list of name/value pairs for each child under DataElements. XPath or an XML parser cannot be used for reasons beyond my control so I am using regex.

<?xml version="1.0"?>
<StandardDataObject xmlns="myns">
  <DataElements>
    <EmpStatus>2.0</EmpStatus>
    <Expenditure>95465.00</Expenditure>
    <StaffType>11.A</StaffType>
    <Industry>13</Industry>
  </DataElements>
  <InteractionElements>
    <TargetCenter>92f4-MPA</TargetCenter>
    <Trace>7.19879</Trace>
  </InteractionElements>
</StandardDataObject>

我需要的输出是:
[{EmpStatus:2.0},{支出:95465.00 },{StaffType:11.A},{Industry:13}]

The output I need is: [{EmpStatus:2.0}, {Expenditure:95465.00}, {StaffType:11.A}, {Industry:13}]

DataElements下的标签名称是动态的,因此无法在正则表达式中逐字表示。标签名称TargetCenter和Trace是静态的,可以在正则表达式中,但如果有办法避免硬编码,那将是更好的选择。

The tag names under DataElements are dynamic and so cannot be expressed literally in the regex. The tag names TargetCenter and Trace are static and could be in the regex but if there is a way to avoid hardcoding that would be preferable.

"<([A-Za-z0-9]+?)>([A-Za-z0-9.]*?)</"

这是我构建的正则表达式,它有它在结果中错误地包含{Trace:719879}的问题。依赖于XML中的换行或任何其他明显的格式不是一种选择。

This is the regex I have constructed and it has the problem that it erroneously includes {Trace:719879} in the results. Relying on new-lines within the XML or any other apparent formatting is not an option.

下面是我正在使用的Java代码的近似值:

Below is an approximation of the Java code I am using:

private static final Pattern PATTERN_1 = Pattern.compile(..REGEX..);
private List<DataElement> listDataElements(CharSequence cs) {
    List<DataElement> list = new ArrayList<DataElement>();
    Matcher matcher = PATTERN_1.matcher(cs);
    while (matcher.find()) {
        list.add(new DataElement(matcher.group(1), matcher.group(2)));
    }
    return list;
}

如何将正则表达式更改为仅包含数据元素而忽略其余部分?

How can I change my regex to only include data elements and ignore the rest?

推荐答案

这应该在Java中工作,如果你可以假设在DataElements标签之间,一切都有表格值。即没有属性,也没有嵌套元素。

This should work in Java, if you can assume that between the DataElements tags, everything has the form value. I.e. no attributes, and no nested elements.

Pattern regex = Pattern.compile("<DataElements>(.*?)</DataElements>", Pattern.DOTALL);
Matcher matcher = regex.matcher(subjectString);
Pattern regex2 = Pattern.compile("<([^<>]+)>([^<>]+)</\\1>");
if (matcher.find()) {
    String DataElements = matcher.group(1);
    Matcher matcher2 = regex2.matcher(DataElements);
    while (matcher2.find()) {
        list.add(new DataElement(matcher2.group(1), matcher2.group(2)));
    } 
}

这篇关于使用Java中的REGEX解析XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆