将正则表达式应用于 pandas 数据框 [英] applying regex to a pandas dataframe
问题描述
我在将正则表达式函数应用于python数据框中的列时遇到问题.这是我的数据框的标题:
I'm having trouble applying a regex function a column in a python dataframe. Here is the head of my dataframe:
Name Season School G MP FGA 3P 3PA 3P%
74 Joe Dumars 1982-83 McNeese State 29 NaN 487 5 8 0.625
84 Sam Vincent 1982-83 Michigan State 30 1066 401 5 11 0.455
176 Gerald Wilkins 1982-83 Chattanooga 30 820 350 0 2 0.000
177 Gerald Wilkins 1983-84 Chattanooga 23 737 297 3 10 0.300
243 Delaney Rudd 1982-83 Wake Forest 32 1004 324 13 29 0.448
我以为我对将函数应用于Dataframes掌握得很好,所以也许我缺少Regex技能.
I thought I had a pretty good grasp of applying functions to Dataframes, so maybe my Regex skills are lacking.
这是我整理的内容:
import re
def split_it(year):
return re.findall('(\d\d\d\d)', year)
df['Season2'] = df['Season'].apply(split_it(x))
TypeError: expected string or buffer
输出将是名为Season2的列,其中包含连字符之前的年份.我敢肯定,没有正则表达式,这是一种更简便的方法,但更重要的是,我正在努力弄清楚我做错了什么
Output would be a column called Season2 that contains the year before the hyphen. I'm sure theres an easier way to do it without regex, but more importantly, i'm trying to figure out what I did wrong
非常感谢您提前提供帮助.
Thanks for any help in advance.
推荐答案
当我尝试您的代码(它的一种变体)时,我会得到NameError: name 'x' is not defined
,但不是.
When I try (a variant of) your code I get NameError: name 'x' is not defined
-- which it isn't.
您可以使用
df['Season2'] = df['Season'].apply(split_it)
或
df['Season2'] = df['Season'].apply(lambda x: split_it(x))
,但是第二个只是编写第一个的更长,更慢的方式,因此没有太多意义(除非您要处理其他参数,我们不在这里.)您的函数将返回列表,但是:
but the second one is just a longer and slower way to write the first one, so there's not much point (unless you have other arguments to handle, which we don't here.) Your function will return a list, though:
>>> df["Season"].apply(split_it)
74 [1982]
84 [1982]
176 [1982]
177 [1983]
243 [1982]
Name: Season, dtype: object
尽管您可以轻松更改它. FWIW,我将使用向量化字符串操作并执行类似的操作
although you could easily change that. FWIW, I'd use vectorized string operations and do something like
>>> df["Season"].str[:4].astype(int)
74 1982
84 1982
176 1982
177 1983
243 1982
Name: Season, dtype: int64
或
>>> df["Season"].str.split("-").str[0].astype(int)
74 1982
84 1982
176 1982
177 1983
243 1982
Name: Season, dtype: int64
这篇关于将正则表达式应用于 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!