获取jsoup中元素的字符偏移量 [英] Get character offsets for elements in jsoup

查看:81
本文介绍了获取jsoup中元素的字符偏移量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将jsoup元素映射回源HTML中的特定字符偏移量.换句话说,如果我有这样的HTML:

I need to map jsoup elements back to specific character offsets in the source HTML. In other words, if I have HTML that looks like this:

Hello <br/> World

我需要知道"Hello"从偏移量0开始,长度为6个字符,<br/>从偏移量6开始,长度为5个字符,等等.

I need to know that "Hello " starts at offset 0 and has a length of 6 characters, <br/> starts at offset 6 and has a length of 5 characters, etc..

我在返回此信息的Element javadoc中找不到吸气剂.可以找回吗?

I could not find a getter in the Element javadoc that returns this information. Can it be retrieved?

推荐答案

我不认为Jsoup具有此功能.这个问题似乎比HTML分析更接近词法分析.

I don't believe Jsoup has this functionality. This question seems closer to lexical analysis than HTML parsing.

我将编写一个语法,然后针对该语法编写一个词法分析器,该词法分析器将标记HTML,并提供您要查找的偏移量.

I would write a grammar, and then write a lexer against that grammar which would tokenize the HTML, and supply the offsets that you're looking for.

首先,使用Jsoup解析文档以验证其是否为有效的HTML.

First, parse the document with Jsoup to verify that it is valid HTML.

然后,根据语法对文档进行词法分析.语法可能像这样:

Then, lexically analyze the document against a grammar. A grammar might look like:

Document := {optional-opening-tag} | {literal} {optional-opening-tag} | {optional-closing-tag}

optional-opening-tag := ["<" {literal} ">" {optional-opening-tag}|{literal} ] | ""

optional-closing-tag := "</ {literal} ">" | ""

literal := any string of characters not beginning with whitespace, or containing "<"

将在存储令牌,第一个字符的索引和长度的对象中找到的每个令牌插入.

Insert each token that you find in an object which stores the token, the index of the first character, and the length.

这篇关于获取jsoup中元素的字符偏移量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆