如何使用extract从pandas数据框中提取大写字母和一些子字符串? [英] How to extract the uppercase as well as some substring from pandas dataframe using extract?
问题描述
这个问题是上一个问题的后续问题如何从熊猫系列中仅提取大写子字符串?.
This question is the follow up question to previous question How to extract only uppercase substring from pandas series?.
我没有改变旧问题,而是决定提出新问题.
Instead of changing the old question, I decided to ask the new question.
我的目标是从名为 item 的列中提取聚合方法 agg
和特征名称 feat
.
My aim is to extract the aggregation method agg
and feature name feat
from a column named item.
问题来了:
import numpy as np
import pandas as pd
df = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})
regexp = (r'(?P<agg>) ' # agg is the word in uppercase (all other substring is lowercased)
r'(?P<feat>), ' # 1. if there is no uppercase, whole string is feat
# 2. if there is uppercase the substring after example. is feat
# e.g. cat ==> cat
# cat.N_MOST_COMMON(example.ord)[2] ==> ord
)
df[['agg','feat']] = df.col.str.extract(regexp,expand=True)
# I am not sure how to build up regexp here.
print(df)
"""
Required output
item agg feat
0 num num
1 bool bool
2 cat cat
3 cat.COUNT(example) COUNT # note: here feat is empty
4 cat.N_MOST_COMMON(example.ord)[2] N_MOST_COMMON ord
5 cat.FIRST(example.ord) FIRST ord
6 cat.FIRST(example.num) FIRST num
""";
推荐答案
对于 feat
,由于您已经在其他 StackOverflow 问题中得到了 agg
的答案,我认为您可以使用以下内容根据两个不同的模式提取两个不同的系列,这些模式分别用 |
和 fillna()
一个系列与另一个分开.
For feat
, since you already got the answer to agg
in your other StackOverflow question, I think you can use the following to extract two different series based off two different patterns that are separated with |
and then fillna()
one series with another.
^([^A-Z]*$)
如果完整字符串是小写的,应该只返回完整字符串[^az].*example\.([az]+)\).*$
应该只返回example.
之后和之前的字符串)
仅当example 之前的字符串中有大写时.
^([^A-Z]*$)
should only return the full string if the full string is lowercase[^a-z].*example\.([a-z]+)\).*$
should only return strings afterexample.
and before)
only if there is uppercase in the string prior toexample.
df = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})
s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)
df['feat'] = s[0].fillna(s[1]).fillna('')
df
Out[1]:
item feat
0 num num
1 bool bool
2 cat cat
3 cat.COUNT(example)
4 cat.N_MOST_COMMON(example.ord)[2] ord
5 cat.FIRST(example.ord) ord
6 cat.FIRST(example.num) num
以上为您提供了您正在寻找的样本数据的输出,并符合您的条件.但是:
The above gives you the output you are looking for your sample data and holds to your conditions. However:
- 如果
example.
后面有大写怎么办?当前输出将返回''
- What if there are UPPERCASE after
example.
? Current output would return''
参见下面的示例 #2,其中一些数据根据上述要点进行了更改:
see example #2 below with some of the data changed according to above point:
df = pd.DataFrame({'item': ['num','cat.count(example.AAA)', 'cat.count(example.aaa)', 'cat.count(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})
s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)
df['feat'] = s[0].fillna(s[1]).fillna('')
df
Out[2]:
item feat
0 num num
1 cat.count(example.AAA)
2 cat.count(example.aaa) cat.count(example.aaa)
3 cat.count(example) cat.count(example)
4 cat.N_MOST_COMMON(example.ord)[2] ord
5 cat.FIRST(example.ord) ord
6 cat.FIRST(example.num) num
这篇关于如何使用extract从pandas数据框中提取大写字母和一些子字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!