清除受保护的电子邮件 [英] Scraping of protected email

查看:83
本文介绍了清除受保护的电子邮件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从网站上抓取电子邮件.它在浏览器中可见,但是当我尝试使用requests \ BeautifulSoup对其进行抓取时,我得到的是:"[受电子邮件保护]"

I need to scrape emails from the website. It's visible in a browser but when I try to scrape it with requests\BeautifulSoup I get this: "[email protected]"

我可以使用Selenium来完成此操作,但是这将花费更多时间,我想知道是否可以使用request \ BeautifulSoup来抓取这些电子邮件?也许需要使用一些库来处理js.

I can do this with Selenium but it will take more time and I would like to know is it possible to scrape these emails with requests\BeautifulSoup? Maybe it's needed to use some libraries for working with js.

电子邮件标签:

<span id="signature_email"><a class="__cf_email__" href="/cdn-cgi/l/email-protection" data-cfemail="30425f5e70584346515c5c531e535f5d">[email&#160;protected]</a><script data-cfhash='f9e31' type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script></span></span> <span class="separator">|</span>

推荐答案

从CF标签的提供的html中,我假设您正在抓取cloudflare网站.它们提供了对列出的电子邮件进行混淆的功能(请参阅此处),它会对HTML中的地址进行加密,然后使用JavaScript对其进行解密.因此,使用硒可以看到电子邮件地址,但是使用请求则不会.

From the CF tag, in your supplied html, I assume you are scraping a cloudflare site. They offer a feature to obfuscate emails listed (see here) which encrypts the addresses in the HTML and using JavaScript decrypts them. Hence, using selenium you'll see email-addresses but using requests you won't.

由于可以很容易地从JavaScript中获取解密方法,因此您可以使用Python编写自己的解密方法.

Since the decryption method can be easily taken from the JavaScript, you can write your own decryption method in Python.

在JavaScript中,

In JavaScript,

(function () {
    try {
        var s, a, i, j, r, c, l = document.getElementById("__cf_email__");
        a = l.className;
        if (a) {
            s = '';
            r = parseInt(a.substr(0, 2), 16);
            for (j = 2; a.length - j; j += 2) {
                c = parseInt(a.substr(j, 2), 16) ^ r;
                s += String.fromCharCode(c);
            }
            s = document.createTextNode(s);
            l.parentNode.replaceChild(s, l);
        }
    } catch (e) {}
})();

在Python中,

def decodeEmail(e):
    de = ""
    k = int(e[:2], 16)

    for i in range(2, len(e)-1, 2):
        de += chr(int(e[i:i+2], 16)^k)

    return de

这篇关于清除受保护的电子邮件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆