我可以使用jsoup确定HTML属性是用单引号还是双引号引起来(或不包括)? [英] Can I use jsoup to determine whether an HTML attribute is enclosed in single or double quotes (or none)?

查看:199
本文介绍了我可以使用jsoup确定HTML属性是用单引号还是双引号引起来(或不包括)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用jsoup解析HTML文档并对其进行一些分析.

I'm using jsoup to parse HTML documents and perform some analysis on them.

解析后,有没有任何方法来确定给定属性是括在双引号,单引号还是无引号中?

After parsing, is there any way to determine whether a given attribute was enclosed in double quotes, single quotes, or no quotes?

换句话说,有什么办法可以区分以下内容:

In other words, is there any way I could distinguish the following:

Document foo = Jsoup.parse("<html><body><a name=\"value\"></body></html>");
Document bar = Jsoup.parse("<html><body><a name='value'></body></html>");
Document baz = Jsoup.parse("<html><body><a name=value></body></html>");

理想情况下,Attribute应该具有布尔值isDoubleQuoted()isSingleQuoted()isUnquoted()或类似的值.

Ideally, Attribute would have booleans isDoubleQuoted(), isSingleQuoted(), and isUnquoted(), or similar.

Jsoup似乎只是在解析过程中丢弃了该信息,这很可悲,因为我需要知道进行分析.

It appears that Jsoup simply discards that information during parsing, which is quite sad, because I need to know for my analysis.

但是也许我缺少了什么? :)

But maybe I'm missing something? :)

请注意,我不能简单地在原始字符串上使用正则表达式.我正在分析的文档可以任意复杂,并且任何给定的属性(即键/值对)在文档中可能会出现多次.因此,简单地为键/值映射"grep"(例如,查看我用jsoup解析的字符串是否包含name=valuename='value'name="value")是不可能的(尽管这是一个近似值,尽管不能令人满意,但我可能不得不忍受,直到找到更好的解决方案为止.

Note that I can't simply use a regex on the original string. The documents I'm analysing can be arbitrarily complex and any given attribute (i.e., key/value pair) may appear more than once within the document. Thus, it wouldn't work to simply "grep" for a key/value mapping (e.g., see if the string that I parse with jsoup contains name=value or name='value' or name="value") to find out (although that's an approximation which, though unsatisfactory, I'm probably having to go to live with until there is a better solution).

推荐答案

以防万一,如果有人感兴趣:我仔细研究了jsoup,并确认在解析过程中是否丢弃了引用任何特定属性值的信息.当然,它在解析过程中(有必要)可用,但是基本上被丢弃了,并且没有存储在生成的DOM树中.

Just in case anyone's interested: I've had a closer look into jsoup and confirmed that the information how any particular attribute's value was quoted is discarded during parsing. It is (necessarily) available during parsing, of course, but it is basically thrown away and not stored in the resulting DOM tree.

我创建了一个拉取请求,以将此缺失的功能添加到jsoup中: https://github. com/jhy/jsoup/pull/1114 .

I created a pull request to add this missing functionality to jsoup: https://github.com/jhy/jsoup/pull/1114.

不确定将PR放入jsoup的机会有多大.该项目目前有40个待处理的拉取请求(包括矿山),其中最古老的请求可追溯到2011年秋季(七年前).另一方面,一些PR很快就被合并了. PR的最新合并可追溯到大约2个月前,并且PR在提交后仅几天就被合并了.让我们来看看.在这种情况下,直到有稳定的jsoup版本添加了此功能之前,我至少可以使用自己的fork.

Not sure how good the chances are of getting a PR into jsoup. The project currently has 40 pending pull requests (including mine), the oldest one of which dates back to fall 2011 (seven years ago). On the other hand, some PRs get merged quickly. The latest merge of a PR dates back to 2 months or so ago, and that PR was merged mere days after it was submitted. Let's see. Until such time where there is a stable version of jsoup with this functionality added, I can at least use my own fork.

这篇关于我可以使用jsoup确定HTML属性是用单引号还是双引号引起来(或不包括)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆