如何将单元格中的文本与正则表达式匹配,并仅保留与正则表达式匹配的文本? [英] How to match text in a cell to regex and keep only the text which matches regex?

查看:118
本文介绍了如何将单元格中的文本与正则表达式匹配,并仅保留与正则表达式匹配的文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要做什么:有一个很大的excel表,上面有很多随便的客户信息.我想在新的Excel文件中以设置的格式对电子邮件地址和其他数据进行排序.

What I am trying to do: There is a large excel sheet with a lot haphazard customer information. I want to sort the email address and other data in a set format in a new excel file.

我不太清楚如何将单元格文本(格式类似于地址电子邮件"压缩后的格式)与正则表达式匹配,以及仅将正则表达式数据保留在列表中.

I can't quite figure out how to match the cell text(which will have some format like Address Email squished togethe and similar) with the regex and to keep only the regex data in a list.

非常感谢您的帮助.谢谢

Would really appreciate some help. Thanks

import sys, os, openpyxl
def sort_email_from_xl():
    sheet = sheet_select()   #Opens the worksheet
    emailRegex = re.compile(r'''([a-zA-Z0-9._%+-]+@+[a-zA-Z0-9.-]+(\.[a-zA-Z]{2,4}))''',re.VERBOSE)
    customeremails = []
    for row in range(0, max_row):
        if cell.text == emailRegex:
            mail = cell.text
            customeremails.append(mail)
            return customeremails
    print(customeremails)

推荐答案

此代码应该可以工作(尽管我只能测试正则表达式部分):

This code should work (I could only test the regex part though):

import sys, os, openpyxl
def sort_email_from_xl():
    sheet = sheet_select()   #Opens the worksheet
    emailRegex = re.compile(".*?([a-zA-Z0-9\._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,4}).*?")
    customeremails = []
    for row in range(0, max_row):
        if emailRegex.match(cell.text):
            mail = emailRegex.match(cell.text).groups()[0]
            cell.text = mail
            customeremails.append(mail)
    print(customeremails)

您的代码有很多问题.首先关于正则表达式:

There were many problems with your code. First about the regex:

  • 正则表达式不允许在您的电子邮件地址周围输入文字,并在开头和结尾处加上.*?
  • 不需要re.VERBOSE部分,因为仅当您想向正则表达式
  • the regex was not allowing text around your email address, added that with .*? at start and end
  • you don't need the re.VERBOSE part as you'd only need it if you want to add inline comments to your regex, see doc
  • you allowed email addresses with many @ in between
  • you matched the TLD separately, that's unneeded

现在,电子邮件正则表达式可用于基本用法,但我绝对建议从Stackoverflow上的其他答案中使用经过验证的电子邮件正则表达式.

Now, the email regex works for basic usage, but I'd definitively recommend to take a proven email regex from other answers on Stackoverflow.

然后:使用emailRegex.match(cell.text),您可以检查cell.text是否与您的正则表达式匹配,使用emailRegex.match(cell.text).groups()[0],您仅提取匹配的部分.您也有一个return语句.

Then: with emailRegex.match(cell.text) you can check if the cell.text matches your regex and with emailRegex.match(cell.text).groups()[0] you extract only the matching part. You had one return statement too much as well.

由于某种原因,上面的代码给了我一个NameError:未定义名称'max_row'

For some reason the above code is giving me a NameError: name 'max_row' is not defined

您需要更正循环遍历的行,例如像在此处记录

You need to correct the looping through the rows e.g. like documented here

这篇关于如何将单元格中的文本与正则表达式匹配,并仅保留与正则表达式匹配的文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆