如何将单元格中的文本与正则表达式匹配,并仅保留与正则表达式匹配的文本? [英] How to match text in a cell to regex and keep only the text which matches regex?
问题描述
我要做什么:有一个很大的excel表,上面有很多随便的客户信息.我想在新的Excel文件中以设置的格式对电子邮件地址和其他数据进行排序.
What I am trying to do: There is a large excel sheet with a lot haphazard customer information. I want to sort the email address and other data in a set format in a new excel file.
我不太清楚如何将单元格文本(格式类似于地址电子邮件"压缩后的格式)与正则表达式匹配,以及仅将正则表达式数据保留在列表中.
I can't quite figure out how to match the cell text(which will have some format like Address Email squished togethe and similar) with the regex and to keep only the regex data in a list.
非常感谢您的帮助.谢谢
Would really appreciate some help. Thanks
import sys, os, openpyxl
def sort_email_from_xl():
sheet = sheet_select() #Opens the worksheet
emailRegex = re.compile(r'''([a-zA-Z0-9._%+-]+@+[a-zA-Z0-9.-]+(\.[a-zA-Z]{2,4}))''',re.VERBOSE)
customeremails = []
for row in range(0, max_row):
if cell.text == emailRegex:
mail = cell.text
customeremails.append(mail)
return customeremails
print(customeremails)
推荐答案
此代码应该可以工作(尽管我只能测试正则表达式部分):
This code should work (I could only test the regex part though):
import sys, os, openpyxl
def sort_email_from_xl():
sheet = sheet_select() #Opens the worksheet
emailRegex = re.compile(".*?([a-zA-Z0-9\._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,4}).*?")
customeremails = []
for row in range(0, max_row):
if emailRegex.match(cell.text):
mail = emailRegex.match(cell.text).groups()[0]
cell.text = mail
customeremails.append(mail)
print(customeremails)
您的代码有很多问题.首先关于正则表达式:
There were many problems with your code. First about the regex:
- the regex was not allowing text around your email address, added that with
.*?
at start and end - you don't need the
re.VERBOSE
part as you'd only need it if you want to add inline comments to your regex, see doc - you allowed email addresses with many
@
in between - you matched the TLD separately, that's unneeded
现在,电子邮件正则表达式可用于基本用法,但我绝对建议从Stackoverflow上的其他答案中使用经过验证的电子邮件正则表达式.
Now, the email regex works for basic usage, but I'd definitively recommend to take a proven email regex from other answers on Stackoverflow.
然后:使用emailRegex.match(cell.text)
,您可以检查cell.text
是否与您的正则表达式匹配,使用emailRegex.match(cell.text).groups()[0]
,您仅提取匹配的部分.您也有一个return
语句.
Then: with emailRegex.match(cell.text)
you can check if the cell.text
matches your regex and with emailRegex.match(cell.text).groups()[0]
you extract only the matching part. You had one return
statement too much as well.
由于某种原因,上面的代码给了我一个NameError:未定义名称'max_row'
For some reason the above code is giving me a NameError: name 'max_row' is not defined
您需要更正循环遍历的行,例如像在此处记录
You need to correct the looping through the rows e.g. like documented here
这篇关于如何将单元格中的文本与正则表达式匹配,并仅保留与正则表达式匹配的文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!