如何使用extract从pandas数据框中提取大写字母和一些子字符串? [英] How to extract the uppercase as well as some substring from pandas dataframe using extract?

查看:66
本文介绍了如何使用extract从pandas数据框中提取大写字母和一些子字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题是上一个问题的后续问题如何从熊猫系列中仅提取大写子字符串?.

This question is the follow up question to previous question How to extract only uppercase substring from pandas series?.

我没有改变旧问题,而是决定提出新问题.

Instead of changing the old question, I decided to ask the new question.

我的目标是从名为 item 的列中提取聚合方法 agg 和特征名称 feat.

My aim is to extract the aggregation method agg and feature name feat from a column named item.

问题来了:


import numpy as np
import pandas as pd


df = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})


regexp = (r'(?P<agg>) '     # agg is the word in uppercase (all other substring is lowercased)
         r'(?P<feat>), '   # 1. if there is no uppercase, whole string is feat
                           # 2. if there is uppercase the substring after example. is feat
                           # e.g. cat ==> cat
                           # cat.N_MOST_COMMON(example.ord)[2] ==> ord
                  
        )

df[['agg','feat']] = df.col.str.extract(regexp,expand=True)

# I am not sure how to build up regexp here.


print(df)

"""
Required output


                                item   agg               feat
0                                num                     num
1                               bool                     bool
2                                cat                     cat
3                 cat.COUNT(example)   COUNT                           # note: here feat is empty
4  cat.N_MOST_COMMON(example.ord)[2]   N_MOST_COMMON     ord
5             cat.FIRST(example.ord)   FIRST             ord
6             cat.FIRST(example.num)   FIRST             num
""";

推荐答案

对于 feat,由于您已经在其他 StackOverflow 问题中得到了 agg 的答案,我认为您可以使用以下内容根据两个不同的模式提取两个不同的系列,这些模式分别用 |fillna() 一个系列与另一个分开.

For feat, since you already got the answer to agg in your other StackOverflow question, I think you can use the following to extract two different series based off two different patterns that are separated with | and then fillna() one series with another.

  1. ^([^A-Z]*$) 如果完整字符串是小写的,应该只返回完整字符串
  2. [^az].*example\.([az]+)\).*$ 应该只返回 example. 之后和 之前的字符串) 仅当 example 之前的字符串中有大写时.
  1. ^([^A-Z]*$) should only return the full string if the full string is lowercase
  2. [^a-z].*example\.([a-z]+)\).*$ should only return strings after example. and before ) only if there is uppercase in the string prior to example.


df = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})

s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)
df['feat'] = s[0].fillna(s[1]).fillna('')
df
Out[1]: 
                                item  feat
0                                num   num
1                               bool  bool
2                                cat   cat
3                 cat.COUNT(example)      
4  cat.N_MOST_COMMON(example.ord)[2]   ord
5             cat.FIRST(example.ord)   ord
6             cat.FIRST(example.num)   num

以上为您提供了您正在寻找的样本数据的输出,并符合您的条件.但是:

The above gives you the output you are looking for your sample data and holds to your conditions. However:

  1. 如果 example. 后面有大写怎么办?当前输出将返回 ''
  1. What if there are UPPERCASE after example.? Current output would return ''

参见下面的示例 #2,其中一些数据根据上述要点进行了更改:

see example #2 below with some of the data changed according to above point:

df = pd.DataFrame({'item': ['num','cat.count(example.AAA)', 'cat.count(example.aaa)', 'cat.count(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})

s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)
df['feat'] = s[0].fillna(s[1]).fillna('')
df
Out[2]: 
                                item                    feat
0                                num                     num
1             cat.count(example.AAA)                        
2             cat.count(example.aaa)  cat.count(example.aaa)
3                 cat.count(example)      cat.count(example)
4  cat.N_MOST_COMMON(example.ord)[2]                     ord
5             cat.FIRST(example.ord)                     ord
6             cat.FIRST(example.num)                     num

这篇关于如何使用extract从pandas数据框中提取大写字母和一些子字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆