从HTML文件中间的设定点提取上下文 [英] Extracting context from a set point in the middle of an HTML file

查看:121
本文介绍了从HTML文件中间的设定点提取上下文的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些HTML,并且正在某个点(一个内嵌图像)提取一个片段,但是我想在该图像周围显示一些上下文.

I have some HTML, and I'm extracting a snippet at a certain point (an inline image), but I'd like to show some context around this image.

我使用的是PHP,我知道Symfony和Wordpress都提供了一些功能来处理在HTML中间切掉文本时会发生的情况(它会关闭所有打开的标签),但是在处理代码段时却没有任何功能.另一个方向.

I'm using PHP, and I know that both Symfony and Wordpress provide functions for dealing with what happens when you chop up text in the middle of some HTML (it closes all open tags), but nothing for dealing with snippets in the other direction.

因此,对于:

 'Snippet of text and a <a href="#moo">link right her'

我可以使用上述功能进行修复,但是呢:

I can use the above-mentioned function to fix, but what about:

'nk right here</a> and then more text after the link.'

我已经考虑过即使关闭标签的片段也可能是错误的解决方法,我应该使用Xpath来解析HTML.但是,我找不到使用xpath创建类似代码片段的任何示例或提及.

I've considered the possibility that even the tag-closing snippet is probably the wrong way to go about this, and I should instead be using Xpath to parse the HTML. However, I can't find any examples or mentions of using xpath to create snippets like this.

更新:

所以我目前的想法是:

  1. 向上移动解析树,直到到达包含所有内容的标签(在我的情况下为div class = post).我在此div之前拥有的最后一个节点是起点(很可能是p标签).

  1. move up the parse tree until I get to the tag that encloses all the content (div class=post in my case). The last node that I have before this div is the starting point (most likely a p tag).

从这里获取上一个同级(应该再次为p标签).

From here, get the previous sibling (which should be a p tag again).

下降到该节点并获取最后一个子代,将文本内容保存为临时字符串.继续退后这些孩子,直到获得足够的摘要为止.

Descend into this node and get the last children, saving the text content to a temporary string. Keep stepping back through these children, until we get enough of a snippet.

这仍然不是理想的选择,因为我不确定要走多远才能获得文本内容.

This still ins't ideal, as I'm not sure how far I'll have to step down to get the text content.

有人知道这个想法在任何地方都可以实现吗?

Does anyone know of an implementation of this idea anywhere?

推荐答案

这不是一个完整的答案,但是您可以使用xpath查询来获取您感兴趣的节点,然后使用nextSibling和previousSibling属性(以扩展支持的任何形式)以获取节点的上下文.

This isn't a complete answer, but you can use an xpath query to get just the node(s) you're interested in, then us the nextSibling and previousSibling properties (in whatever form supported by the extension) to get context for the node(s).

这篇关于从HTML文件中间的设定点提取上下文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆