在Groovy中使用SAX Parser在XHTML中获取文本时,放弃自定义标记中的html标记 [英] Discard html tags within custom tags while getting text in XHTML using SAX Parser in Groovy

查看:113
本文介绍了在Groovy中使用SAX Parser在XHTML中获取文本时,放弃自定义标记中的html标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我试图在标签之间获取文本。到目前为止,我已取得成功。但有时当我的自定义标签中有特殊字符或html标签时,我无法获取文本。示例xml看起来像

 < records> 
< car name ='HSV Maloo'make ='Holden'year ='2006'>
< ae_definedTermTitleBegin />澳大利亚< ae_definedTermTitleEnd />
< ae_clauseTitleBegin /> 1.02< u>会计条款< / u>< ae_clauseTitleEnd />
< / car>
< car name ='P50'make ='Peel'year ='1962'>
< ae_definedTermTitleBegin />马恩岛< ae_definedTermTitleEnd />
< ae_clauseTitleBegin /> 99厘米宽,59公斤重的小型街车法< ae_clauseTitleEnd />
< / car>
< car name ='Royale'make ='Bugatti'year ='1931'>
< ae_definedTermTitleBegin />法国< ae_definedTermTitleEnd />
< ae_clauseTitleBegin />最有价值的车在1500万美元< ae_clauseTitleEnd />
< / car>
< / records>

我得到的输出是

<$澳大利亚,马恩岛,法国]
[。,最小的街道法律汽车在99厘米宽,59公斤重量,最贵重车在1500万美元]

您可以看到'会计术语'缺失。我所得到的只是一个点。



sax解析器代码

  import javax.xml.parsers.SAXParserFactory 
import org.xml.sax.helpers.DefaultHandler
import org.xml.sax。*

class SAXXMLParser extends DefaultHandler {
def DefinedTermTitles = []
def ClauseTitles = []
def currentMessage
def countryFlag = false

void startElement(String ns,String localName,String qName,Attributes atts ){
switch(qName){
'ae_clauseTitleBegin':
//messages.add(currentMessage)
countryFlag = true;
break

'ae_definedTermTitleBegin':
//messages.add(currentMessage)
countryFlag = true;
break


void字符(char []字符,int偏移量,int长度){
if(countryFlag){
currentMessage = new String(chars,offset,length)
println(currentMessage)
}
}

void endElement(String ns,String localName,String qName) {
switch(qName){
'ae_clauseTitleEnd':
ClauseTitles.add(currentMessage)
countryFlag = false;
break
'ae_definedTermTitleEnd':
DefinedTermTitles.add(currentMessage)
countryFlag = false;
break
}
}
}


解决方案

我对Groovy不熟悉,所以这里是一个Java解决方案。我相信这个翻译是直接的。

  import java.io.FileInputStream; 
import java.io.InputStream;
import java.util.ArrayList;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

导入org.xml.sax.Attributes;
import org.xml.sax.helpers.DefaultHandler;

公共类SaxHandler扩展了DefaultHandler {
ArrayList< String> DefinedTermTitles = new ArrayList<>();
ArrayList< String> ClauseTitles = new ArrayList<>();
字符串currentMessage;
boolean countryFlag = false;
StringBuilder message = new StringBuilder();

public void startElement(String ns,String localName,String qName,Attributes atts){
switch(qName){
caseae_clauseTitleBegin:
countryFlag = true ;
休息;

案例ae_definedTermTitleBegin:
countryFlag = true;
休息;


$ b $ public void characters(char [] chars,int offset,int length){
if(countryFlag){
message.append (新的字符串(字符,偏移量,长度));


$ b $ public void endElement(String ns,String localName,String qName){
switch(qName){
caseae_clauseTitleEnd:
ClauseTitles.add(message.toString());
countryFlag = false;
message.setLength(0);
休息;

caseae_definedTermTitleEnd:
DefinedTermTitles.add(message.toString());
countryFlag = false;
message.setLength(0);
休息;



public static void main(String argv []){
SAXParserFactory factory = SAXParserFactory.newInstance();
尝试{
String path =INPUT_PATH_HERE;
InputStream xmlInput = new FileInputStream(path +test.xml);
SAXParser saxParser = factory.newSAXParser();
SaxHandler处理程序=新的SaxHandler();
saxParser.parse(xmlInput,handler);

System.out.println(handler.DefinedTermTitles);
System.out.println(handler.ClauseTitles);

} catch(Exception err){
err.printStackTrace();
}
}
}

澳大利亚,马恩岛,法国]
[1.02会计术语,最小的街道法定车辆99厘米宽,59公斤重,最贵重车1500万美元]


So I am trying to get the text between the tags. So far I have been successful. But sometimes when there are special characters or html tags inside my custom tags I am unable to get the text. The sample xml looks like

<records>
      <car name='HSV Maloo' make='Holden' year='2006'>
        <ae_definedTermTitleBegin />Australia<ae_definedTermTitleEnd />
        <ae_clauseTitleBegin />1.02 <u>Accounting Terms</u>.<ae_clauseTitleEnd />
      </car>
      <car name='P50' make='Peel' year='1962'>
        <ae_definedTermTitleBegin />Isle of Man<ae_definedTermTitleEnd />
        <ae_clauseTitleBegin />Smallest Street-Legal Car at 99cm wide and 59 kg in weight<ae_clauseTitleEnd />
      </car>
      <car name='Royale' make='Bugatti' year='1931'>
        <ae_definedTermTitleBegin />France<ae_definedTermTitleEnd />
        <ae_clauseTitleBegin />Most Valuable Car at $15 million<ae_clauseTitleEnd />
      </car>
    </records>

The output that I am getting is

[Australia, Isle of Man, France]
[., Smallest Street-Legal Car at 99cm wide and 59 kg in weight, Most Valuable Car at $15 million]

As you can seen that 'Accounting Terms' is missing. All I get is a dot. How do I correct this?

The sax parser code

import javax.xml.parsers.SAXParserFactory
import org.xml.sax.helpers.DefaultHandler
import org.xml.sax.*

class SAXXMLParser extends DefaultHandler {
    def DefinedTermTitles = []
    def ClauseTitles = []
    def currentMessage
    def countryFlag = false

    void startElement(String ns, String localName, String qName, Attributes atts) {
        switch (qName) {
            case 'ae_clauseTitleBegin':
            //messages.add(currentMessage)
                countryFlag = true;
                break

            case 'ae_definedTermTitleBegin':
                //messages.add(currentMessage)
                countryFlag = true; 
                break           
         }      
    }   

    void characters(char[] chars, int offset, int length) {
        if (countryFlag) {
            currentMessage = new String(chars, offset, length)
            println(currentMessage)
        }
    }

    void endElement(String ns, String localName, String qName) {
        switch (qName) {        
            case 'ae_clauseTitleEnd':
                ClauseTitles.add(currentMessage)
                countryFlag = false;
                break
            case 'ae_definedTermTitleEnd':
                DefinedTermTitles.add(currentMessage)
                countryFlag = false; 
                break
         }
    }
}

解决方案

I'm not familiar with Groovy so here is a solution in Java. I believe the translation is straighforward.

import java.io.FileInputStream;
import java.io.InputStream;
import java.util.ArrayList;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.helpers.DefaultHandler;

public class SaxHandler extends DefaultHandler {
    ArrayList<String> DefinedTermTitles = new ArrayList<>();
    ArrayList<String> ClauseTitles = new ArrayList<>();
    String currentMessage;
    boolean countryFlag = false;
    StringBuilder message = new StringBuilder();

    public void startElement(String ns, String localName, String qName, Attributes atts) {
        switch (qName) {
            case "ae_clauseTitleBegin":
                countryFlag = true;
                break;

            case "ae_definedTermTitleBegin":
                countryFlag = true; 
                break;           
         }      
    }   

    public void characters(char[] chars, int offset, int length) {
        if (countryFlag) {
            message.append(new String(chars, offset, length));
        }
    }

    public void endElement(String ns, String localName, String qName) {
        switch (qName) {        
            case "ae_clauseTitleEnd":
                ClauseTitles.add(message.toString());
                countryFlag = false;
                message.setLength(0);
                break;

            case "ae_definedTermTitleEnd":
                DefinedTermTitles.add(message.toString());
                countryFlag = false; 
                message.setLength(0);
                break;
         }
    }

    public static void main (String argv []) {
        SAXParserFactory factory = SAXParserFactory.newInstance();
        try {
            String path = "INPUT_PATH_HERE";
            InputStream xmlInput = new FileInputStream(path + "test.xml");
            SAXParser saxParser = factory.newSAXParser();
            SaxHandler handler   = new SaxHandler();
            saxParser.parse(xmlInput, handler);

            System.out.println(handler.DefinedTermTitles);
            System.out.println(handler.ClauseTitles);

        } catch (Exception err) {
            err.printStackTrace ();
        }
    }
}

Output

[Australia, Isle of Man, France]
[1.02 Accounting Terms., Smallest Street-Legal Car at 99cm wide and 59 kg in weight, Most Valuable Car at $15 million]

这篇关于在Groovy中使用SAX Parser在XHTML中获取文本时,放弃自定义标记中的html标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆