使用jsoup解析xml(同时避免使用< p>标签) [英] parsing xml with jsoup (while avoiding <p> tags)

查看:300
本文介绍了使用jsoup解析xml(同时避免使用< p>标签)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题本质上与

This question is very similar in nature to this one, but for java instead of python.

<body.content>
  <block class="lead_paragraph">
    <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
  </block>
  <block class="full_text">
    <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>
  </block>

我想做的是使用jsoup提取句子的文本,而不使用所有xml格式.

What I'm trying to do is extract the text of the sentence without all the xml formatting, using jsoup.

所以我正在寻找

LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.


更新

实际上,我的情况有所不同,因为我还有一些其他的XML格式要保留,即<PERSON>

In fact my situation is a bit different though, because I've got some additional XML formatting which I'd like to keep, i.e. <PERSON>

 <block class="full_text">
    <p>SCHEINMAN</PERSON>--<PERSON>Alan</PERSON>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love, <PERSON>Roni</PERSON>, <PERSON>Sandy</PERSON>, <PERSON>Jarret</PERSON>, <PERSON>Greg</PERSON>, <PERSON>Kate</PERSON>, and <PERSON>Auden Gray</PERSON></p>
 </block></body.content></body></nitf>

理想的输出为:

SCHEINMAN</PERSON>--<PERSON>Alan</PERSON>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love, <PERSON>Roni</PERSON>, <PERSON>Sandy</PERSON>, <PERSON>Jarret</PERSON>, <PERSON>Greg</PERSON>, <PERSON>Kate</PERSON>, and <PERSON>Auden Gray</PERSON>

到目前为止,我的尝试:

My attempt so far:

BufferedReader br = new BufferedReader(new FileReader(filename));
try 
{
  StringBuilder sb = new StringBuilder();
  String line = br.readLine();

  while (line != null) 
  {
    sb.append(line);
    sb.append(System.lineSeparator());
    line = br.readLine();
  }
  String everything = sb.toString();

  Document doc = Jsoup.parse(everything);
  String link = doc.select("block.full_text").text();
  System.out.println(link);      
}
finally 
{
  br.close();
}

推荐答案

您可以在 jsoup .

String html = "<body.content>\n"
        + "  <block class=\"lead_paragraph\">\n"
        + "    <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>\n"
        + "  </block>\n"
        + "  <block class=\"full_text\">\n"
        + "    <p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p>\n"
        + "  </block>";
Document doc = Jsoup.parse(html);
String link = doc.select("block.full_text").text();
System.out.println(link);

输出:

LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.

更新:

String html = "<block class=\"full_text\">\n"
        + "    <p>SCHEINMAN</PERSON>--<PERSON>Alan</PERSON>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love, <PERSON>Roni</PERSON>, <PERSON>Sandy</PERSON>, <PERSON>Jarret</PERSON>, <PERSON>Greg</PERSON>, <PERSON>Kate</PERSON>, and <PERSON>Auden Gray</PERSON></p></block></body.content></body></nitf>";
Document doc = Jsoup.parse(html);
String link = doc.select("block.full_text").html();
System.out.println(link);

输出:

<p>SCHEINMAN--
 <person>
  Alan
 </person>. Happy Birthday. Thirteen years, many tears. Loving memories of your smile, humor, and laughter comfort us. You are always in our hearts. Love, 
 <person>
  Roni
 </person>, 
 <person>
  Sandy
 </person>, 
 <person>
  Jarret
 </person>, 
 <person>
  Greg
 </person>, 
 <person>
  Kate
 </person>, and 
 <person>
  Auden Gray
 </person></p>

这篇关于使用jsoup解析xml(同时避免使用&lt; p&gt;标签)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆