Javascript变量与html代码正则表达式电子邮件匹配 [英] Javascript variable with html code regex email matching

查看:219
本文介绍了Javascript变量与html代码正则表达式电子邮件匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



这是我以前的帖子。



如何在网站上使用BeautifulSoup或Slimit从JavaScript变量输出电子邮件地址

来自bs4导入的$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ 


$ b $ < script LANGUAGE =JavaScript>
function something()
{
var ptr;
ptr =;
ptr + =< table>< td class = france>< / td>< / table>;
ptr + =< table>< td class = france>< a href = mail;
ptr + =to:example@email.com> email< / a>< / td>< / table>;
document.all.something.innerHTML = ptr;
}
< / script>
'''


汤= BeautifulSoup(汤)

在soup.find_all('script')中的脚本:
reg ='(<)?(\w + @ \w +(?: \.\w +)+)(?(1)>)'
reg2 ='mailto:。*'
secondHalf = re.search(reg,script.text)
firstHalf = re.search(reg2,script.text)
secondHalfEmail = secondHalf.group()
firstHalfEmail = firstHalf.group ()
firstHalfEmail = firstHalfEmail.replace('mailto:','')
firstHalfEmail = firstHalfEmail.replace(';','')
如果firstHalfEmail == secondHalfEmail:
email = secondHalfEmail
else:
if('>')不在firstHalfEmail中:
if('>')不在secondHalfEmail中:
如果firstHalfEmail!= secondHalfEmail :
email = firstHalfEmail + secondHalfEmail
else:
email = firstHalfEmail
其他:
电子邮件= secondHalfEmail

打印电子邮件

如果有人可以帮助我会很好。



谢谢

解决方案

您的问题是您在文本中找不到mailto,因为上半部邮件与下半部分不在同一行 。要正确解决你的问题,只需要知道这个程序结束时ptr的价值。



我知道这是一个糟糕的做法,但如果你确定结构总是这样的:

  soup =
< script LANGUAGE =JavaScript > function ...()
{var ptr;
ptr =;
ptr + =...;
ptr + =... ;
ptr + =...;
document.all.something.innerHTML = ptr;
}
< / script>

您可以使用这个:

  soup = BeautifulSoup(soup)

在soup.find_all('script')中的脚本:
#这将匹配{var ptr;
#anddocument
regex ={var ptr;(。*)document
code = re.search(regex,script.text,flags = re.DOTALL).groups ()[0]
#这实际上是危险的,因为代码中的任何
#将在这里执行,但如果
#就像你的例子一样都会
#work fine您可以访问ptr
的价值(代码)
打印ptr

现在你可以使用Beautifulsoup或re来解析ptr。如果你没有结构化,你可以使用这个:

  mail = re.search(< a href = mailto:(。*?)>,ptr).groups()[0] 


This python script is not working to output the email address example@email.com for this case.

This was my previous post.

How can I use BeautifulSoup or Slimit on a site to output the email address from a javascript variable

#!/usr/bin/env python

from bs4 import BeautifulSoup
import re

soup = '''
<script LANGUAGE="JavaScript">
function something()
{
var ptr;
ptr = "";
ptr += "<table><td class=france></td></table>";
ptr += "<table><td class=france><a href=mail";
ptr += "to:example@email.com>email</a></td></table>";
document.all.something.innerHTML = ptr;
}
</script>
'''


soup = BeautifulSoup(soup)

for script in soup.find_all('script'):
  reg = '(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)'
  reg2 = 'mailto:.*'
  secondHalf= re.search(reg, script.text)
  firstHalf= re.search(reg2, script.text)
  secondHalfEmail = secondHalf.group()
  firstHalfEmail = firstHalf.group()
  firstHalfEmail = firstHalfEmail.replace('mailto:', '')
  firstHalfEmail = firstHalfEmail.replace('";', '')
  if firstHalfEmail == secondHalfEmail:
     email = secondHalfEmail
  else:
     if ('>') not in firstHalfEmail:
        if ('>') not in secondHalfEmail:
            if firstHalfEmail != secondHalfEmail:
                email = firstHalfEmail + secondHalfEmail
        else:
            email = firstHalfEmail
    else:
        email = secondHalfEmail

    print email

It would be nice if someone can help me.

Thank you

解决方案

Your problem is that you can't find "mailto" in your text, because the first half "mail" is not in the same line as the second half "to". To solve your problem properly only have to know the value of ptr at the end of this program.

I know that this is a bad way to do it, but if you are sure that the structure is always like this:

soup = """
<script LANGUAGE="JavaScript"> function ...() 
{ var ptr; 
ptr = ""; 
ptr += "..."; 
ptr += "..."; 
ptr += "...";
document.all.something.innerHTML = ptr; 
}
</script> 
"""

You can use this:

soup = BeautifulSoup(soup)

for script in soup.find_all('script'):
    #This matches everything between "{ var ptr;" 
    #and "document"
    regex = "{ var ptr;(.*)document"
    code = re.search(regex, script.text, flags=re.DOTALL).groups()[0]
    #This is actually dangerous because anything 
    #in the code will be executed here, but if
    #it's like your example everything will 
    #work fine and you can access the value of ptr
    exec(code)
    print ptr

Now you can use either Beautifulsoup or re to parse ptr. If you don't how it's structured, you can use this:

    mail = re.search("<a href=mailto:(.*?)>", ptr).groups()[0]

这篇关于Javascript变量与html代码正则表达式电子邮件匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆