Python解析数据框元素 [英] Python parse dataframe element

查看:119
本文介绍了Python解析数据框元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大熊猫数据框列(Data Type),我想分成三列

  target_table_df = LoadS_A [ '属性名',
'数据类型',
'主键指示符']]

示例输入(target_table_df)

 属性名称数据类型主键指示符
0 ACC_LIM DECIMAL(18, 4)False
1 ACC_NO NUMBER(11,0)False
2 ACC_OPEN_DT DATE False
3 ACCB DECIMAL(18,4)False
4 ACDB DECIMAL(18,4)False
5 AGRMNT_ID NUMBER(11,0)True
6 BRNCH_NUM NUMBER(11,0)False
7 CLRD_BAL DECIMAL(18,4)False
8 CR_INT_ACRD_GRSS DECIMAL(18,4)False
9 CR_INT_ACRD_NET DECIMAL(18,4)False

我的目标是:




  • 重新分配'数据类型'到括号前面的文本



[..如果括号存在于数据类型中): / p>


  • 创建新列'Precision'并分配给第一个逗号分隔

  • 创建新列'Scale'并分配给第二个逗号分隔值



因此,预期的输出将成为:

 数据类型精度量表
0十进制18 4
1数11 0
2日期
3十进制18 4
4十进制18 4
5数4 0

我曾经尝试过愤怒实现这一点,但我是新的数据帧....无法解决如果我要迭代所有行,或者是否有办法应用于数据框中的所有值?



任何帮助非常感谢

解决方案

使用 target_table_df ['Data Type']。str.extract(pattern)



您需要将模式分配为一个正则表达式,捕获您要查找的每个组件。

  pattern = r'([^ \(] +)(\(([^,] *),(。*) \)$'

([^ \(] +) 表示抓住尽可能多的非开放括号,最多可以打开第一个圆括号。



\(( [^,] *,表示在开一个括号后抓取第一组非逗号字符,并以逗号停止。



,(。*)\)说要抓住逗号和圆括号之间的其余字符。



(\(([^,] *),(。*)\))?



解决方案



所有内容一起看起来像这样:

  pattern = r'([^ \(] +)(\(([^,] *), (。*)\))?'
df = s.str.extract(pattern,expand = True).iloc [:, [0,2,3]]

#格式化以获得您想要的
df.columns = ['数据类型','精度','缩放'
df.index.name =无
打印df

我把一个 .iloc [:, [0,2,3]] 最后,因为我使用的模式在列 1 中抓取整个括号,我想跳过它。

 数据类型精度量表
0十进制18 4
1数11 0
2日期NaN NaN
3十进制18 4
4十进制18 4
5数11 0


I have a pandas dataframe column (Data Type) which I want to split into three columns

target_table_df = LoadS_A [['Attribute Name',
                              'Data Type',
                              'Primary Key Indicator']]

Example input (target_table_df)

                 Attribute Name      Data Type Primary Key Indicator
0                       ACC_LIM  DECIMAL(18,4)                 False
1                        ACC_NO   NUMBER(11,0)                 False
2                   ACC_OPEN_DT           DATE                 False
3                          ACCB  DECIMAL(18,4)                 False
4                          ACDB  DECIMAL(18,4)                 False
5                     AGRMNT_ID   NUMBER(11,0)                  True
6                     BRNCH_NUM   NUMBER(11,0)                 False
7                      CLRD_BAL  DECIMAL(18,4)                 False
8              CR_INT_ACRD_GRSS  DECIMAL(18,4)                 False
9               CR_INT_ACRD_NET  DECIMAL(18,4)                 False

I aim to:

  • Reassign 'Data Type' to the text preceding the parenthesis

[..if parenthesis exists in 'Data Type']:

  • Create new column 'Precision' and assign to first comma separated value
  • Create new column 'Scale' and assign to second comma separated value

Intended output would therefore become:

    Data Type   Precision   Scale
0   decimal 18  4
1   number  11  0
2   date        
3   decimal 18  4
4   decimal 18  4
5   number  4   0

I have tried in anger to achieve this but i'm new to dataframes....can't work out if I am to iterate over all rows or if there is a way to apply to all values in the dataframe?

Any help much appreciated

解决方案

Use target_table_df['Data Type'].str.extract(pattern)

You'll need to assign pattern to be a regular expression that captures each of the components you're looking for.

pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'

([^\(]+) says grab as many non-open parenthesis characters you can up to the first open parenthesis.

\(([^,]*, says to grab the first set of non-comma characters after an open parenthesis and stop at the comma.

,(.*)\) says to grab the rest of the characters between the comma and the close parenthesis.

(\(([^,]*),(.*)\))? says the whole parenthesis thing may not even happen, grab it if you can.

Solution

everything together looks like this:

pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'
df = s.str.extract(pattern, expand=True).iloc[:, [0, 2, 3]]

# Formatting to get it how you wanted
df.columns = ['Data Type', 'Precision', 'Scale']
df.index.name = None
print df

I put a .iloc[:, [0, 2, 3]] at the end because the pattern I used grabs the whole parenthesis in column 1 and I wanted to skip it. Leave it off and see.

  Data Type Precision Scale
0   decimal        18     4
1    number        11     0
2      date       NaN   NaN
3   decimal        18     4
4   decimal        18     4
5    number        11     0

这篇关于Python解析数据框元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆