与HTML code正则表达式匹配电子邮件javascript变量 [英] Javascript variable with html code regex email matching

查看:174
本文介绍了与HTML code正则表达式匹配电子邮件javascript变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

本Python脚本不能正常工作,输出的电子邮件地址example@email.com这种情况。

This python script is not working to output the email address example@email.com for this case.

这是我的previous职务。

This was my previous post.

<一个href=\"http://stackoverflow.com/questions/27682751/how-can-i-use-beautifulsoup-or-slimit-on-a-site-to-output-the-email-address-from\">How我可以使用BeautifulSoup或SLIMIT在网站上输出的电子邮件地址从一个javascript变量

#!/usr/bin/env python

from bs4 import BeautifulSoup
import re

soup = '''
<script LANGUAGE="JavaScript">
function something()
{
var ptr;
ptr = "";
ptr += "<table><td class=france></td></table>";
ptr += "<table><td class=france><a href=mail";
ptr += "to:example@email.com>email</a></td></table>";
document.all.something.innerHTML = ptr;
}
</script>
'''


soup = BeautifulSoup(soup)

for script in soup.find_all('script'):
  reg = '(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)'
  reg2 = 'mailto:.*'
  secondHalf= re.search(reg, script.text)
  firstHalf= re.search(reg2, script.text)
  secondHalfEmail = secondHalf.group()
  firstHalfEmail = firstHalf.group()
  firstHalfEmail = firstHalfEmail.replace('mailto:', '')
  firstHalfEmail = firstHalfEmail.replace('";', '')
  if firstHalfEmail == secondHalfEmail:
     email = secondHalfEmail
  else:
     if ('>') not in firstHalfEmail:
        if ('>') not in secondHalfEmail:
            if firstHalfEmail != secondHalfEmail:
                email = firstHalfEmail + secondHalfEmail
        else:
            email = firstHalfEmail
    else:
        email = secondHalfEmail

    print email

这将是很好,如果有人能帮助我。

It would be nice if someone can help me.

感谢您

推荐答案

您的问题是,你找不到电子邮件地址,在你的文字,因为上半年的邮件,是不是在同一行下半年至。要正确解决您的问题只有在这个节目知道到底PTR的价值。

Your problem is that you can't find "mailto" in your text, because the first half "mail" is not in the same line as the second half "to". To solve your problem properly only have to know the value of ptr at the end of this program.

我知道这是一个糟糕的方​​式做到这一点,但如果你是确保结构始终是这样的:

I know that this is a bad way to do it, but if you are sure that the structure is always like this:

soup = """
<script LANGUAGE="JavaScript"> function ...() 
{ var ptr; 
ptr = ""; 
ptr += "..."; 
ptr += "..."; 
ptr += "...";
document.all.something.innerHTML = ptr; 
}
</script> 
"""

您可以使用此:

soup = BeautifulSoup(soup)

for script in soup.find_all('script'):
    #This matches everything between "{ var ptr;" 
    #and "document"
    regex = "{ var ptr;(.*)document"
    code = re.search(regex, script.text, flags=re.DOTALL).groups()[0]
    #This is actually dangerous because anything 
    #in the code will be executed here, but if
    #it's like your example everything will 
    #work fine and you can access the value of ptr
    exec(code)
    print ptr

现在您可以使用Beautifulsoup或重新解析PTR。如果你它是如何结构不这样做,你可以使用这样的:

Now you can use either Beautifulsoup or re to parse ptr. If you don't how it's structured, you can use this:

    mail = re.search("<a href=mailto:(.*?)>", ptr).groups()[0]

这篇关于与HTML code正则表达式匹配电子邮件javascript变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆