正则表达式使用python比较和提取字母字符 [英] Regex to compare and extract alphabet characters using python

查看:73
本文介绍了正则表达式使用python比较和提取字母字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,如下所示:

Hi I have a dataset as shown below:

Format,Message,time
A,ab@1 yl@5 rd@20 pp@40,3
B,bc@1 gn@7 yl@20 ss@25 rd@50, 21
C,cc@1 yl@9 rd@20, 22

我想使用从Messages的yl和rd中提取的数字,然后在其数字(例如yl @ 5 ---> 5)和时间列的数字之间进行数字比较.因此,如果将第1行,第3行与第5行和第20行进行比较,那么如果它小于两个元素,则将赋值为g.如果时间为7,则将分配值y,同样,如果其值为20或更高,则将其分配为r.

I would like to use extracted numbers from yl and rd of Messages then do a comparison of numbers between its number (e.g. yl@5 ---> 5) and the time column's number. so if row 1, 3 will be compared with 5 and 20. So if it is lesser than both element, it will be assign with a value g. If the time is 7, value y will be assigned and likewise if it is 20 and above, it will be asigned as r.

所以会像

Format,Message,time,status
A,ab@1 yl@5 rd@20 pp@40,3,g
B,bc@1 gn@7 yl@20 ss@25 rd@50,21,y
C,cc@1 yl@9 rd@20,22,r

推荐答案

您的问题实际上是很多问题.从"dataframe"标签看来,您正在使用熊猫来进行此操作.您要查询的正则表达式可能会增加数字"yl"和"rd"(如果有的话,我假设它们始终存在).但是正则表达式通常不进行数学或数值比较,因此仅占第三位.

Your question is really a number of questions. From the 'dataframe' tag, it appears you're doing this using pandas. The regular expression you're asking about could extra the numbers for 'yl' and 'rd' (if any, I'm assuming they are always there). But a regular expression typically doesn't do math or numerical comparisons, so that's a third bit.

一个与yl的数值匹配的正则表达式(假定为整数,而不是浮点数):

A regular expression to match the numerical value for 'yl' (assuming integer, not float):

r'yl@(\d+)'

您可以在单个表达式中提取它们,但前提是它们始终处于相同顺序,或者成为复杂的正则表达式.

You could extract them in a single expression, but that would assume they are always in the same order, or become a complicated regular expression.

要确保仅匹配 yl @ 5 ,但不能匹配 xyl @ 5 ,则可以在开始处添加一些限制(需要空格或行首))和结尾(需要空格或行尾):

To ensure only yl@5 gets matched, but something like xyl@5 does not, you can add some restrictions to the start (require space or start of line) and end (require space or end of line):

r'(^|\s)yl@(\d+)($|\s)'

或者,如果您遇到 yl 用名称分隔的情况,例如 a:yl ,也可以添加它:

Or, if you have situations where yl is name-spaced, like a:yl, you can add that as well:

r'(^|\s)([a-z]+:)?l@(\d+)($|\s)'

但是,所有这些只是使用正则表达式语言构建更具体的表达式.RegexBuddy是我喜欢使用(没有从属关系)的一个非常好的写正则表达式的工具,但是也有相当不错的在线工具,例如 https://regex101.com/.

However, all this is just building more specific expressions using the regular expression language. A very good tool for writing regex I enjoy using (no affiliation) is RegexBuddy, but there are pretty good online tools as well, like https://regex101.com/.

在代码示例中使用,基本上可以完成您建议的操作:

Used in a code example basically doing what you suggested:

import re
from pandas import DataFrame

df = DataFrame({
    'Format': ['A', 'B', 'C'],
    'Message': ['ab@1 yl@5 rd@20 pp@40', 'bc@1 gn@7 yl@20 ss@25 rd@50', 'cc@1 yl@9 rd@20'],
    'time': [3, 21, 22]
})


def determine_status(row):
    def find(tag, message):
        match = re.search(rf"{tag}@(\d+)", message)
        if match:
            return match.group(1)
        else:
            raise ValueError(f'{tag} not in message.')

    yl = int(find('yl', row['Message']))
    rd = int(find('rd', row['Message']))

    time = int(row['time'])
    if time < yl < rd:
        return 'g'
    if yl <= time < rd:
        return 'y'
    return 'r'


df['status'] = df.apply(determine_status, axis=1)

print(df)

find 函数获取一个标记和一条消息,并使用正则表达式为消息中的标记生成数值.

The find function takes a tag and a message and produces the numerical value for the tag in the message using a regular expression.

determine_status 函数就是这样做的-它期望从DataFrame中获取一行,并使用 Message time 列来确定状态并返回它.

The determine_status function does just that - it expects a row from a DataFrame and will use the Message and time column to determine a status and returns it.

df.apply 创建一个新的 status 列,并为DataFrame中的每一行填充 determine_status 的结果

df.apply is then used to create a new status column and fill it with the result of determine_status for every row in the DataFrame.

您没有指定要使用的Python版本,但是如果它是Python 3.6之前的版本,则会发现 f'{tag}之类的表达式不在消息中.'将不起作用-相反,您将使用'{tag}不在消息中.'.format(tag = tag).

You didn't specify what version of Python you are using, but if it's a version before Python 3.6, you'll find that the expressions like f'{tag} not in message.' won't work - instead you'd use something like '{tag} not in message.'.format(tag=tag).

这篇关于正则表达式使用python比较和提取字母字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆