抓取受保护的电子邮件 [英] Scraping of protected email

查看:22
本文介绍了抓取受保护的电子邮件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从网站上抓取电子邮件.它在浏览器中可见,但是当我尝试用 requestsBeautifulSoup 抓取它时,我得到了这个:[email protected]"

I need to scrape emails from the website. It's visible in a browser but when I try to scrape it with requestsBeautifulSoup I get this: "[email protected]"

我可以用 Selenium 做到这一点,但需要更多时间,我想知道是否可以通过 requestsBeautifulSoup 抓取这些电子邮件?也许需要使用一些库来处理 js.

I can do this with Selenium but it will take more time and I would like to know is it possible to scrape these emails with requestsBeautifulSoup? Maybe it's needed to use some libraries for working with js.

电子邮件标签:

<span id="signature_email"><a class="__cf_email__" href="/cdn-cgi/l/email-protection" data-cfemail="30425f5e70584346515c5c531e535f5d">[email&#160;protected]</a><script data-cfhash='f9e31' type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script></span></span> <span class="separator">|</span>

推荐答案

根据 CF 标签,在您提供的 html 中,我假设您正在抓取一个 cloudflare 站点.他们提供了一项功能来混淆列出的电子邮件 (see here) 加密 HTML 中的地址并使用 JavaScript 解密它们.因此,使用 selenium 您会看到电子邮件地址,但使用请求则不会.

From the CF tag, in your supplied html, I assume you are scraping a cloudflare site. They offer a feature to obfuscate emails listed (see here) which encrypts the addresses in the HTML and using JavaScript decrypts them. Hence, using selenium you'll see email-addresses but using requests you won't.

由于解密方法可以很容易地从 JavaScript 中获取,您可以用 Python 编写自己的解密方法.

Since the decryption method can be easily taken from the JavaScript, you can write your own decryption method in Python.

在 JavaScript 中,

In JavaScript,

(function () {
    try {
        var s, a, i, j, r, c, l = document.getElementById("__cf_email__");
        a = l.className;
        if (a) {
            s = '';
            r = parseInt(a.substr(0, 2), 16);
            for (j = 2; a.length - j; j += 2) {
                c = parseInt(a.substr(j, 2), 16) ^ r;
                s += String.fromCharCode(c);
            }
            s = document.createTextNode(s);
            l.parentNode.replaceChild(s, l);
        }
    } catch (e) {}
})();

在 Python 中,

In Python,

def decodeEmail(e):
    de = ""
    k = int(e[:2], 16)

    for i in range(2, len(e)-1, 2):
        de += chr(int(e[i:i+2], 16)^k)

    return de

这篇关于抓取受保护的电子邮件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆