html标签上的Jsoup属性删除 [英] Jsoup attribute removal on html tags

查看:387
本文介绍了html标签上的Jsoup属性删除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个问题,我想过滤某些可能包含html的文本. 我使用jsoup将清单列入白名单并清理了标签,效果很好.

I have the problem that i want to filter certain texts which may contain html. I use jsoup to whitelist and clean the tags which works pretty nice.

我只有一个问题,有些标签可以包含属性,主要是样式或类,但是也可以有不同的属性. (名称,目标等).清洁时这没问题,因为它们被很好地剥离了,但是当将某些允许添加的标签列入白名单时,由于这些属性而被阻止了.基本的白名单似乎没有涵盖样式或类属性,而且我无法确定自己还会遇到什么.

I only have the problem that some of the tags can contain attributes, mostly style or classes but there could also be different attributes. (name, target, ect.) When cleaning this is no problem because they get stripped nicely but when whitelisting some tags which would be allowed get blocked because of the attributes. The basic whitelist does not seem to cover style or class attributes plus i cannot be shure what else i'm encountering.

由于我想允许使用范围很广的标签,但是在清洁过程中要删除其中的大多数标签,因此我不想为允许的所有标签添加所有属性.最简单的方法是从所有标记中剥离所有属性,因为无论如何我对它们都不感兴趣,然后检查使用纯标记剥离的文本是否有效.

Since I want to allow quite a wide range of tags, but remove most of them during cleaning, I don't want to add all attributes for all tags that I'm allowing. The simplest would be to strip all attributes from all tags, since I'm not interested in them anyway and then check if the stripped text with the plain tags is valid.

是否有一个删除所有属性或一些简单循环的函数,另一种选择是告诉白名单忽略所有属性,而只是将标签上的白名单删除.

Is there a function that removes all attributes or some simple loop, another option would be to tell the whitelister to ignore all attributes and simply whitelist on the tags.

推荐答案

最终对我有用的解决方案非常简单.我遍历所有元素,然后遍历所有属性,然后在元素上将它们删除,这给我留下了一个纯净的版本,在该版本中,我只需要验证html-tags本身即可.我认为这不是解决问题的最捷径,但这确实是我想要的.

The solution that finally worked for me is quite simple. I iterate through all elements, then iterate through all attributes and then remove them on the element, which leaves me with a cleaned version where i just have to validate the html-tags themselves. I think this is not the neatest way to solve the problem but it does what I wanted.

**编辑**

我为旧代码多次投票,而实际上却包含了一个绝对的初学者错误.遍历同一列表时,您永远无法删除. 但是,只有在删除了多个属性后,才会触发此错误.

I got upvoted many times for the old code while it actually contained an absolute beginners bug. You can never delete while iterating through the same list. This bug only triggered when more than one attribute was removed, however.

使用错误修复的更新代码:

updated code with a bugFix:

Document doc = Jsoup.parseBodyFragment(aText);
Elements el = doc.getAllElements();
for (Element e : el) {
    List<String>  attToRemove = new ArrayList<>();
    Attributes at = e.attributes();
    for (Attribute a : at) {
        // transfer it into a list -
        // to be sure ALL data-attributes will be removed!!!
        attToRemove.add(a.getKey());
    }

    for(String att : attToRemove) {
        e.removeAttr(att);
   }
}


return Jsoup.isValid(doc.body().html(), theLegalWhitelist);

这篇关于html标签上的Jsoup属性删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆