在 Java 中使用 REGEX 解析 XML [英] Parsing XML with REGEX in Java

查看:27
本文介绍了在 Java 中使用 REGEX 解析 XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于以下 XML 片段,我需要为 DataElements 下的每个子项获取名称/值对列表.由于我无法控制的原因,无法使用 XPath 或 XML 解析器,因此我使用了正则表达式.

Given the below XML snippet I need to get a list of name/value pairs for each child under DataElements. XPath or an XML parser cannot be used for reasons beyond my control so I am using regex.

<?xml version="1.0"?>
<StandardDataObject xmlns="myns">
  <DataElements>
    <EmpStatus>2.0</EmpStatus>
    <Expenditure>95465.00</Expenditure>
    <StaffType>11.A</StaffType>
    <Industry>13</Industry>
  </DataElements>
  <InteractionElements>
    <TargetCenter>92f4-MPA</TargetCenter>
    <Trace>7.19879</Trace>
  </InteractionElements>
</StandardDataObject>

我需要的输出是:[{EmpStatus:2.0}, {Expenditure:95465.00}, {StaffType:11.A}, {Industry:13}]

The output I need is: [{EmpStatus:2.0}, {Expenditure:95465.00}, {StaffType:11.A}, {Industry:13}]

DataElements 下的标签名称是动态的,因此不能在正则表达式中按字面意思表达.标签名称 TargetCenter 和 Trace 是静态的,可以在正则表达式中,但如果有办法避免硬编码,那将是更可取的.

The tag names under DataElements are dynamic and so cannot be expressed literally in the regex. The tag names TargetCenter and Trace are static and could be in the regex but if there is a way to avoid hardcoding that would be preferable.

"<([A-Za-z0-9]+?)>([A-Za-z0-9.]*?)</"

这是我构建的正则表达式,它的问题是在结果中错误地包含了 {Trace:719879}.不能依赖 XML 中的换行符或任何其他明显的格式.

This is the regex I have constructed and it has the problem that it erroneously includes {Trace:719879} in the results. Relying on new-lines within the XML or any other apparent formatting is not an option.

以下是我使用的 Java 代码的近似值:

Below is an approximation of the Java code I am using:

private static final Pattern PATTERN_1 = Pattern.compile(..REGEX..);
private List<DataElement> listDataElements(CharSequence cs) {
    List<DataElement> list = new ArrayList<DataElement>();
    Matcher matcher = PATTERN_1.matcher(cs);
    while (matcher.find()) {
        list.add(new DataElement(matcher.group(1), matcher.group(2)));
    }
    return list;
}

如何将我的正则表达式更改为仅包含数据元素而忽略其余元素?

How can I change my regex to only include data elements and ignore the rest?

推荐答案

这应该适用于 Java,如果您可以假设在 DataElements 标记之间,一切都具有表单值.IE.没有属性,也没有嵌套元素.

This should work in Java, if you can assume that between the DataElements tags, everything has the form value. I.e. no attributes, and no nested elements.

Pattern regex = Pattern.compile("<DataElements>(.*?)</DataElements>", Pattern.DOTALL);
Matcher matcher = regex.matcher(subjectString);
Pattern regex2 = Pattern.compile("<([^<>]+)>([^<>]+)</\1>");
if (matcher.find()) {
    String DataElements = matcher.group(1);
    Matcher matcher2 = regex2.matcher(DataElements);
    while (matcher2.find()) {
        list.add(new DataElement(matcher2.group(1), matcher2.group(2)));
    } 
}

这篇关于在 Java 中使用 REGEX 解析 XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆