在ruby / rails的html块中提取电子邮件地址 [英] Extracting email addresses in an html block in ruby/rails
问题描述
我正在创建一个解析器,以防止来自tinyMCE的文本块中的垃圾邮件和电子邮件收集(因此其中可能包含或不包含html标签)
I am creating a parser that wards off against spamming and harvesting of emails from a block of text that comes from tinyMCE (so it may or may not have html tags in it)
我已经尝试过正则表达式,到目前为止,它已经成功完成:
I've tried regexes and so far this has been successful:
/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i
问题是,我需要忽略所有带有mailto hrefs的电子邮件地址。例如:
problem is, i need to ignore all email addresses with mailto hrefs. for example:
<a href="mailto:test@mail.com">test@mail.com</a>
只应返回第二封电子邮件。
should only return the second email add.
要了解即时信息的背景,即时将电子邮件地址反向转换为一个块,以便上面的示例如下所示:
To get a background of what im doing, im reversing the email addresses in a block so the above example would look like this:
<a href="mailto:test@mail.com">moc.liam@tset</a>
我当前的正则表达式的问题是它也替换了href中的那个。有没有办法让我使用单个正则表达式来做到这一点?还是我必须先检查一个然后再检查另一个?有没有办法让我仅通过使用gsub来执行此操作,还是必须使用一些nokogiri / hpricot magicks和诸如此类的东西来解析mailto?
problem with my current regex is that it also replaces the one in href. Is there a way for me to do this with a single regex? Or do i have to check for one then the other? Is there a way for me to do this just by using gsub or do I have to use some nokogiri/hpricot magicks and whatnot to parse the mailtos? Thanks in advance!
以下是我的参考资料:
so.com/questions/504860/extract-电子邮件地址来自文本块
so.com/questions/504860/extract-email-addresses-from-a-block-of-text
so.com/questions/1376149/regexp-for-extractioning-amailto-address
so.com/questions/1376149/regexp-for-extracting-a-mailto-address
我还使用以下代码进行测试:
im also testing using this:
edit
这是我当前的帮助程序代码:
here's my current helper code:
def email_obfuscator(text)
text.gsub(/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i) { |m|
m = "<span class='anti-spam'>#{m.reverse}</span>"
}
end
这将导致:
<a target="_self" href="mailto:<span class='anti-spam'>moc.liamg@tset</span>"><span class="anti-spam">moc.liamg@tset</span></a>
推荐答案
如果lookbehind不起作用,则另一种选择:
Another option if lookbehind doesn't work:
/ \b(mailto:)?([A-Z0-9 ._%+-] + @ [A-Z0-9。 -] + \。[AZ] {2,4})\b / i
这将匹配所有电子邮件,然后您可以手动检查第一个捕获的组是否为 mailto:,然后跳过此匹配项。
This would match all emails, then you can manually check if first captured group is "mailto:" then skip this match.
这篇关于在ruby / rails的html块中提取电子邮件地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!