Javascript变量与html代码正则表达式电子邮件匹配 [英] Javascript variable with html code regex email matching
问题描述
这是我以前的帖子。
如何在网站上使用BeautifulSoup或Slimit从JavaScript变量输出电子邮件地址
来自bs4导入的$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $
$ b $ < script LANGUAGE =JavaScript>
function something()
{
var ptr;
ptr =;
ptr + =< table>< td class = france>< / td>< / table>;
ptr + =< table>< td class = france>< a href = mail;
ptr + =to:example@email.com> email< / a>< / td>< / table>;
document.all.something.innerHTML = ptr;
}
< / script>
'''
汤= BeautifulSoup(汤)
在soup.find_all('script')中的脚本:
reg ='(<)?(\w + @ \w +(?: \.\w +)+)(?(1)>)'
reg2 ='mailto:。*'
secondHalf = re.search(reg,script.text)
firstHalf = re.search(reg2,script.text)
secondHalfEmail = secondHalf.group()
firstHalfEmail = firstHalf.group ()
firstHalfEmail = firstHalfEmail.replace('mailto:','')
firstHalfEmail = firstHalfEmail.replace(';','')
如果firstHalfEmail == secondHalfEmail:
email = secondHalfEmail
else:
if('>')不在firstHalfEmail中:
if('>')不在secondHalfEmail中:
如果firstHalfEmail!= secondHalfEmail :
email = firstHalfEmail + secondHalfEmail
else:
email = firstHalfEmail
其他:
电子邮件= secondHalfEmail
打印电子邮件
如果有人可以帮助我会很好。
谢谢
您的问题是您在文本中找不到mailto,因为上半部邮件与下半部分不在同一行 。要正确解决你的问题,只需要知道这个程序结束时ptr的价值。
我知道这是一个糟糕的做法,但如果你确定结构总是这样的:
soup =
< script LANGUAGE =JavaScript > function ...()
{var ptr;
ptr =;
ptr + =...;
ptr + =... ;
ptr + =...;
document.all.something.innerHTML = ptr;
}
< / script>
您可以使用这个:
soup = BeautifulSoup(soup)
在soup.find_all('script')中的脚本:
#这将匹配{var ptr;
#anddocument
regex ={var ptr;(。*)document
code = re.search(regex,script.text,flags = re.DOTALL).groups ()[0]
#这实际上是危险的,因为代码中的任何
#将在这里执行,但如果
#就像你的例子一样都会
#work fine您可以访问ptr
的价值(代码)
打印ptr
现在你可以使用Beautifulsoup或re来解析ptr。如果你没有结构化,你可以使用这个:
mail = re.search(< a href = mailto:(。*?)>,ptr).groups()[0]
This python script is not working to output the email address example@email.com for this case.
This was my previous post.
#!/usr/bin/env python
from bs4 import BeautifulSoup
import re
soup = '''
<script LANGUAGE="JavaScript">
function something()
{
var ptr;
ptr = "";
ptr += "<table><td class=france></td></table>";
ptr += "<table><td class=france><a href=mail";
ptr += "to:example@email.com>email</a></td></table>";
document.all.something.innerHTML = ptr;
}
</script>
'''
soup = BeautifulSoup(soup)
for script in soup.find_all('script'):
reg = '(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)'
reg2 = 'mailto:.*'
secondHalf= re.search(reg, script.text)
firstHalf= re.search(reg2, script.text)
secondHalfEmail = secondHalf.group()
firstHalfEmail = firstHalf.group()
firstHalfEmail = firstHalfEmail.replace('mailto:', '')
firstHalfEmail = firstHalfEmail.replace('";', '')
if firstHalfEmail == secondHalfEmail:
email = secondHalfEmail
else:
if ('>') not in firstHalfEmail:
if ('>') not in secondHalfEmail:
if firstHalfEmail != secondHalfEmail:
email = firstHalfEmail + secondHalfEmail
else:
email = firstHalfEmail
else:
email = secondHalfEmail
print email
It would be nice if someone can help me.
Thank you
Your problem is that you can't find "mailto" in your text, because the first half "mail" is not in the same line as the second half "to". To solve your problem properly only have to know the value of ptr at the end of this program.
I know that this is a bad way to do it, but if you are sure that the structure is always like this:
soup = """
<script LANGUAGE="JavaScript"> function ...()
{ var ptr;
ptr = "";
ptr += "...";
ptr += "...";
ptr += "...";
document.all.something.innerHTML = ptr;
}
</script>
"""
You can use this:
soup = BeautifulSoup(soup)
for script in soup.find_all('script'):
#This matches everything between "{ var ptr;"
#and "document"
regex = "{ var ptr;(.*)document"
code = re.search(regex, script.text, flags=re.DOTALL).groups()[0]
#This is actually dangerous because anything
#in the code will be executed here, but if
#it's like your example everything will
#work fine and you can access the value of ptr
exec(code)
print ptr
Now you can use either Beautifulsoup or re to parse ptr. If you don't how it's structured, you can use this:
mail = re.search("<a href=mailto:(.*?)>", ptr).groups()[0]
这篇关于Javascript变量与html代码正则表达式电子邮件匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!