在文本节点内的锚点中获取文本 [英] Get the text in anchors within text nodes

查看:76
本文介绍了在文本节点内的锚点中获取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Amazon上解析产品评论,我想获得评论的完整文本,其中包括链接中的文本.

I am parsing product reviews on Amazon and I would like to get the complete text of a review, which includes the text in links.

我目前正在使用jSoup,尽管如此,它还是会忽略锚点.当然,仅使用选择器就可以从锚中获取所有文本,但是我会丢失有关该文本所处上下文的信息.

I am currently using jSoup and, as good as it is, it would just ignore anchors. Of course I could get all the text from anchors by just using a selector, but I would lose information on the context in which that text was.

我认为一个例子是解释自己的最好方法.

I think an example is the best way of explaining myself.

结构示例:

<div class="container">
  <div style="a">Something...</div>
  <div style="b">...Nested spans and divs... </div>
  <div class="tiny">_____ </div>
  " From the makers of the incredible <a href="SOMELINK">SOMEPRODUCT</a> we have this other product that blablabla.... Amazing specs, but <a href="SOME_OTHER_LINK">this other product</a> is somehow better".

我得到的是:从令人难以置信的制造商那里,我们有另一款产品blablabla ...令人惊叹的规格,但在某种程度上更好."

What I obtain: " From the makers of the incredible we have this other product that blablabla... Amazing specs, but is somehow better".

我想要的是:从令人难以置信的SOMEPRODUCT的制造商那里,我们获得了另一款产品……令人赞叹的规格,但是另一款产品更好."

What I want: " From the makers of the incredible SOMEPRODUCT we have this other product that blablabla... Amazing specs, but this other product is somehow better".

我使用jSoup的代码:

My code using jSoup:

Elements allContainers = doc.select(".container");
for (Element container : allContainers) {
  String reviewText = container.ownText(); // THIS EXCLUDES TEXT FROM LINKS
StdOut.println(reviewText);

我找不到解决办法,因为它看起来不像jSoup将文本节点视为实际节点,因此似乎在下一个节点的子节点中不考虑这些锚点.

I can't find a way of doing that because it doesn't look like jSoup treats text nodes as actual nodes, and therefore those anchors do not seem to be considered among the children of the next nodes.

我也乐于接受其他想法,例如尝试使用:not选择器以获取它们,但是我不敢相信jSoup不允许保留链接中的文本,这太普遍了,难以相信他们忽略了此功能.

I am also open to other ideas, like trying to work with the :not selector in order to get them, but I can't believe that jSoup does not allow to keep text from links, this is far too common to believe they ignored this feature.

推荐答案

jSoup似乎不将文本节点视为实际节点,

it doesn't look like jSoup treats text nodes as actual nodes,

否-JSoup文本节点和元素都是实际节点.

No - JSoup text nodes are actual nodes, as are elements.

按照问题的描述方式,您有一个非常具体的要求,我同意没有内置功能可以完全在单个呼叫中完成您想做的事情.但是,使用简单的辅助方法可以解决该问题.

The way you described the problem, you have a very specific requirement and I agree there is no built-in to do exactly what you want in a single call. However with a simple helper method the problem is solvable.

首先让我们回顾一下问题-父div有以下子代:

First let's review the problem - the parent div has the following children:

div div div #text a #text a # text

当然,每个diva元素都有其他子元素,包括文本节点.根据您的示例,听起来您想遍历所有子项,而忽略所有不是文本节点的子项.找到第一个文本节点后,收集它的文本和随后的任何节点的文本.

And of course each of the div and a elements have other children, including text nodes. Based on your example it sounds like you want to iterate through all children, ignoring any that are not text nodes. When you find the first text node, gather it's text and the text of any following nodes.

当然可以,但是我不奇怪没有内置的方法可以做到这一点.

Certainly doable, but I am not surprised there is no built in method that does this.

这是解决问题的一种实现方式:

Here is one implementation to solve the problem:

   public static String textPlus(Element elem)
   {
      List<TextNode> textNodes = elem.textNodes();
      if (textNodes.isEmpty())
         return "";

      StringBuilder result = new StringBuilder();
      // start at the first text node
      Node currentNode = textNodes.get(0);
      while (currentNode != null)
      {
         // append deep text of all subsequent nodes
         if (currentNode instanceof TextNode)
         {
            TextNode currentText = (TextNode) currentNode;
            result.append(currentText.text());
         }
         else if (currentNode instanceof Element)
         {
            Element currentElement = (Element) currentNode;
            result.append(currentElement.text());
         }
         currentNode = currentNode.nextSibling();
      }
      return result.toString();
   }

调用此方法:

Elements allContainers = doc.select(".container");
for (Element container : allContainers) {
  String reviewText = textPlus(container);
  StdOut.println(reviewText);
}

给出示例html文本,此代码返回:

Given your sample html text, this code returns:

从令人难以置信的SOMEPRODUCT的制造商那里,我们有另一款产品blablabla ....令人惊叹的规格,但另一种产品要好一些."

" From the makers of the incredible SOMEPRODUCT we have this other product that blablabla.... Amazing specs, but this other product is somehow better."

希望这会有所帮助.

这篇关于在文本节点内的锚点中获取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆