jSoup如何将与某些类模式匹配的标签列入白名单? [英] jSoup How to Whitelist tags matching certain class patterns?

查看:130
本文介绍了jSoup如何将与某些类模式匹配的标签列入白名单?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用白名单,如下所示:

I am using Whitelist as follows:

           Document doc = Jsoup.parse(urls[0], 5000);
            if (doc != null){

                Whitelist wl = Whitelist.basicWithImages();
                // wl.preserveRelativeLinks(false);
                Cleaner cleaner = new Cleaner(wl);
                cleanedDoc=cleaner.clean(doc);
                if (cleanedDoc != null){
                   whiteListedHtml = cleanedDoc.html();
                }
            }
        }catch(IOException e){
           Log.d(TAG,"exception="+e.getMessage());
        }

现在,这非常痛苦地接近我想要做的事情,除了: 有div标签,其类具有"nav"或"ad",并且正在填充页面 与红宝石.例如,我想保留div标签,但如果该类恰好出现了"nav"或"ad",则不要.

Now this is so painfully close to what I would like to do except: There are div tags whose class have "nav" or "ad" and are filling the page with rubish. I want to keep div tags for example but not if the class happens to have 'nav' or 'ad' appearing in it.

这使我考虑将白名单...子类化. RTFM http://jsoup.org/apidocs/org/jsoup/safety/Whitelist. html 我看到了 addTag()和removeTag()(不知道removeTag()不可用,但这是另一个问题).我真正想做的是,仅当标签的类在字符串中包含某些值(例如"ad"或"nav")时,才将其删除. 唯一有希望的方法是:

This makes me think about subclassing Whitelist .... RTFM http://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html I see addTag(), and removeTag() (somehow removeTag() is not available but thats another issue). What I really want to do is remove if and only if the class of the tag contains certain values in the sting such as 'ad' or 'nav'. The only method that looks hopeful is:

protected boolean isSafeTag(String tag)

Test if the supplied tag is allowed by this whitelist

Parameters:
    tag - test tag 
Returns:
    true if allowed 

那么我如何拔出该字符串的类值进行测试?无论如何,是否可以在不将白名单归类的情况下进行此项检查?

So how can I pull out the class value of this string for test? Is there anyway to do this check without subclassing whitelist?

推荐答案

您可以做的一件事是手动删除<div class='ad'>之类的标签.首先,将div标记添加到白名单中(否则清洁程序将删除它们)

One thing you can do is to remove tags like <div class='ad'> manually. First, add the div tag to your whitelist (else the cleaner will remove them)

 Whitelist wl = Whitelist
     .basicWithImages()
     .addTags("div");

然后,选择所有要删除的元素,然后...只需删除它们^^

After that, select all elements you want to remove and ... simply remove them ^^

doc
    .select("div[class=\"nav\"]")
    .forEach(e -> e.remove());

(您也可以使用通配符-参见选择器语法)

(you can also use wildcards - see the selector-syntax)

然后像您一样清理文档.

Afterwards clean the document just like you did.

注意:您也可以采用其他方法-先清理,然后删除

这篇关于jSoup如何将与某些类模式匹配的标签列入白名单?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆