XPath 或 XQuery 以排除仅包含列表的文章部分 [英] XPath or XQuery to exclude article sections which only contains lists

查看:19
本文介绍了XPath 或 XQuery 以排除仅包含列表的文章部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提取文章的各个部分(介绍、历史、概述....).我寻找一个 XPath 来选择所有以标题开头并包含一些段落的部分.如果它们只包含一个列表,它们应该被丢弃.

I am trying to extract the sections of an article (Introduction, History, Overview....). I look for an XPath to select all the sections which begin with a heading and contain some paragraphs. If they only contain a list, they should be discarded.

例如:

<h2>Intro</h2>
<p> It has paragraph and should be extracted </p>
.....
<h2>References </h2>
<ul>...It has just list and should be discarded </ul>
<h2>...</h2>
....

如果 XPath 不可用,XQuery 也可以工作.我尝试了以下 XQuery

If XPath is not possible, an XQuery could also work. I tried the following XQuery

for $x in doc("test.xq")//h2
return
   <section>{$x/following-sibling::*[preceding-sibling::h2[1] is $x]}</section>

它根据我的需要选择部分,但我无法对其施加条件(不仅是 ul).

It selects the sections as I want, but I couldn't impose the condition (not only ul) to it.

推荐答案

你在另一个问题中提到这是在 BaseX 中,它支持 XQuery 3.0 group by 机制,那么这个怎么样:

You mention in another question that this is in BaseX, which supports the XQuery 3.0 group by mechanism, so how about this:

for $x in doc("test.xq")//h2/following-sibling::*[not(self::h2)]
group by $hId := generate-id($x/preceding-sibling::h2[1])
return
  if ($x[not(self::ul)]) then
    <section>{($x/preceding-sibling::h2[1], $x)}</section>
  else ()

在这里,我首先找到我们想要收集到的所有非 h2 元素(根据您的 XML 的结构,可能有更有效的方法来执行此操作),然后将 group bycode> 意味着在每次迭代"中,$x 变量将是一个 h2 和下一个 h2 之间的非 h2 元素的序列.if 条件然后检查该组中是否至少有一个元素不是 ul.

Here I'm first finding all the non-h2 elements that we want to gather together (there may be a more efficient way to do this depending on the structure of your XML), then the group by means that on each "iteration" the $x variable will be the sequence of non-h2 elements between one h2 and the next. The if condition then checks whether there is at least one element in this group that is not a ul.

这篇关于XPath 或 XQuery 以排除仅包含列表的文章部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆