如何从Pandas数据框中的字典中删除花括号,撇号和方括号(Python) [英] How to remove curly braces, apostrophes and square brackets from dictionaries in a Pandas dataframe (Python)

查看:97
本文介绍了如何从Pandas数据框中的字典中删除花括号,撇号和方括号(Python)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

csv文件中包含以下数据:

from StringIO import StringIO
import pandas as pd

the_data = """
ABC,2016-6-9 0:00,95,{'//PurpleCar': [115L], '//YellowCar': [403L], '//BlueCar': [16L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-10 0:00,0,{'//PurpleCar': [219L], '//YellowCar': [381L], '//BlueCar': [90L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-11 0:00,0,{'//PurpleCar': [817L], '//YellowCar': [21L], '//BlueCar': [31L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-12 0:00,0,{'//PurpleCar': [80L], '//YellowCar': [2011L], '//BlueCar': [8888L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-13 0:00,0,{'//PurpleCar': [32L], '//YellowCar': [15L], '//BlueCar': [4L], '//WhiteCar-XYZ': [0L]}
DEF,2016-6-16 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-17 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-18 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-19 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-20 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
"""

我将文件读入Pandas数据框中,如下所示:

df = pd.read_csv(StringIO(the_data), sep=',')

然后,我添加一些列标题,如下所示:

df.columns = ['Company',
                    'Date',
                    'Volume',
                    'Car1',
                    'Car2',
                    'Car3',
                    'Car4']

我看到数据正在通过以下方式传输:

ABC,2016-6-9 0:00,95,{'//PurpleCar': [115L], '//YellowCar': [403L], '//BlueCar': [16L], '//WhiteCar-XYZ': [0L]

但是,我想不使用以下任何一项数据:

a)字典开头的大括号("{")和字典结尾的大括号("}")

b)数值后的"L"

c)包围数值的方括号("[""]")

d)键周围的撇号

理想情况下,数据将按以下方式转换:

ABC,2016-6-9 0:00,95,//PurpleCar: 115, //YellowCar: 403, //BlueCar: 16, //WhiteCar-XYZ: 0

我尝试过:

df['Car1'] = df['Car1'].str.strip(['{', '}', '[', 'L]'])

但是,它不起作用.这样会导致"Car1"列变为NaN值.

是否可以对数据帧进行转换,以使数据帧的每一行都读取如下?

ABC,2016-6-9 0:00,95,//PurpleCar: 115, //YellowCar: 403, //BlueCar: 16, //WhiteCar-XYZ: 0

谢谢!

更新:

使用以下正则表达式:

df['Car1'] = df['Car1'].str.replace(r'\D+', '').astype('int')

结果:

ABC,2016-6-9 0:00,95, 115 , //YellowCar: 403, //BlueCar: 16, //WhiteCar-XYZ: 0

我们丢失了'//PurpleCar',只留下了数值115.这是一个好的开始,但是如果我们也能看到'//PurpleCar'键,那就太好了.

有什么想法吗?


更新2:

基于piRSquared和HYRY的评论,我的目标是能够绘制数值结果.因此,我希望数据框如下所示:

   Company   Date            PurpleCar  YellowCar   BlueCar     WhiteCar      

0  ABC       2016-6-9 0:00   115        403         16          0
1  ABC       2016-6-10 0:00  219        381         90          0
2  ABC       2016-6-11 0:00  817        21          31          0
3  ABC       2016-6-12 0:00  80         2011        8888        0
4  ABC       2016-6-13 0:00  32         15          4           0
5  DEF       2016-6-16 0:00  32         15          4           0
6  DEF       2016-6-17 0:00  32         15          4           0
7  DEF       2016-6-18 0:00  32         15          4           0
8  DEF       2016-6-19 0:00  32         15          4           0
9  DEF       2016-6-20 0:00  32         15          4           0

*更新3:*

最初发布的数据有一个小错误.这是数据:

the_data = """
ABC,2016-6-9 0:00,95,"{'//Purple': [115L], '//Yellow': [403L], '//Blue': [16L], '//White-XYZ': [0L]}"
ABC,2016-6-10 0:00,0,"{'//Purple': [219L], '//Yellow': [381L], '//Blue': [90L], '//White-XYZ': [0L]}"
ABC,2016-6-11 0:00,0,"{'//Purple': [817L], '//Yellow': [21L], '//Blue': [31L], '//White-XYZ': [0L]}"
ABC,2016-6-12 0:00,0,"{'//Purple': [80L], '//Yellow': [2011L], '//Blue': [8888L], '//White-XYZ': [0L]}"
ABC,2016-6-13 0:00,0,"{'//Purple': [32L], '//Yellow': [15L], '//Blue': [4L], '//White-XYZ': [0L]}"
DEF,2016-6-16 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [3L]}"
DEF,2016-6-17 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [0L]}"
DEF,2016-6-18 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [7L]}"
DEF,2016-6-19 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [14L]}"
DEF,2016-6-20 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [21L]}"
"""

此数据与原始数据之间的区别是在大括号前("{")和大括号后("}")后的撇号(").

解决方案

该文件似乎实际上是转义的CSV,因此我们不需要为此部分进行自定义分析.

正如@Blckknght在注释中指出的那样,该文件不是有效的CSV.我会在回答中做一些假设.他们是

  1. 您无法控制数据,因此无法正确转义逗号.
  2. 前三列将不包含任何逗号.
  3. 第三列遵循python dict的语法.
  4. 列表中总是有一个值位于dict值中.

首先,一些进口商品

import ast
import pandas as pd

我们将用逗号分隔行,因为我们不需要处理任何类型的CSV转义(假设#1和#2).

rows = (line.split(",", 3) for line in the_data.splitlines() if line.strip() != "")

fixed_columns = pd.DataFrame.from_records(rows, columns=["Company", "Date", "Value", "Cars_str"])

fixed_columns = pd.read_csv(..., names=["Company", "Date", "Value", "Cars_str"])

前三列是固定的,我们将其保留不变.我们可以用ast.literal_eval解析的最后一列,因为它是dict(假设#3).如果格式比正则表达式更改,这将使IMO更具可读性和灵活性.另外,您还会更早地检测到格式更改.

cars = fixed_columns["Cars_str"].apply(ast.literal_eval)
del fixed_columns["Cars_str"]

这部分相当您的其他问题.

我们准备了函数来处理字典的键和值,因此,如果我们对字典内容的假设失败,它们将失败.

def get_single_item(list_that_always_has_single_item):
    v, = list_that_always_has_single_item
    return v

def extract_car_name(car_str):
    assert car_str.startswith("//"), car_str
    return car_str[2:]

我们应用函数并构造pd.Series,使我们能够...

dynamic_columns = cars.apply(
    lambda x: pd.Series({
            extract_car_name(k): get_single_item(v) 
            for k, v in x.items()
    }))    

...将列添加到数据框

result = pd.concat([fixed_columns, dynamic_columns], axis=1)
result

最后,我们得到了表格:

  Company            Date Value  BlackCar  BlueCar  NPO-GreenCar  PinkCar  \
0     ABC   2016-6-9 0:00    95       NaN     16.0           NaN      NaN   
1     ABC  2016-6-10 0:00     0       NaN     90.0           NaN      NaN   
2     ABC  2016-6-11 0:00     0       NaN     31.0           NaN      NaN   
3     ABC  2016-6-12 0:00     0       NaN   8888.0           NaN      NaN   
4     ABC  2016-6-13 0:00     0       NaN      4.0           NaN      NaN   
5     DEF  2016-6-16 0:00     0      15.0      NaN           0.0      4.0   
6     DEF  2016-6-17 0:00     0      15.0      NaN           0.0      4.0   
7     DEF  2016-6-18 0:00     0      15.0      NaN           0.0      4.0   
8     DEF  2016-6-19 0:00     0      15.0      NaN           0.0      4.0   
9     DEF  2016-6-20 0:00     0      15.0      NaN           0.0      4.0   

   PurpleCar  WhiteCar-XYZ  YellowCar  
0      115.0           0.0      403.0  
1      219.0           0.0      381.0  
2      817.0           0.0       21.0  
3       80.0           0.0     2011.0  
4       32.0           0.0       15.0  
5       32.0           NaN        NaN  
6       32.0           NaN        NaN  
7       32.0           NaN        NaN  
8       32.0           NaN        NaN  
9       32.0           NaN        NaN  

I have the following data in a csv file:

from StringIO import StringIO
import pandas as pd

the_data = """
ABC,2016-6-9 0:00,95,{'//PurpleCar': [115L], '//YellowCar': [403L], '//BlueCar': [16L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-10 0:00,0,{'//PurpleCar': [219L], '//YellowCar': [381L], '//BlueCar': [90L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-11 0:00,0,{'//PurpleCar': [817L], '//YellowCar': [21L], '//BlueCar': [31L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-12 0:00,0,{'//PurpleCar': [80L], '//YellowCar': [2011L], '//BlueCar': [8888L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-13 0:00,0,{'//PurpleCar': [32L], '//YellowCar': [15L], '//BlueCar': [4L], '//WhiteCar-XYZ': [0L]}
DEF,2016-6-16 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-17 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-18 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-19 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-20 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
"""

I read the file into a Pandas data frame, as follows:

df = pd.read_csv(StringIO(the_data), sep=',')

Then, I add a few column headers, as follows:

df.columns = ['Company',
                    'Date',
                    'Volume',
                    'Car1',
                    'Car2',
                    'Car3',
                    'Car4']

I see that the data is coming through as follows:

ABC,2016-6-9 0:00,95,{'//PurpleCar': [115L], '//YellowCar': [403L], '//BlueCar': [16L], '//WhiteCar-XYZ': [0L]

But, I'd like to see the data without any of the following:

a) the curly braces ("{") at the beginning and the curly brace ("}") at the end of the dictionary

b) the "L" after the numerical values

c) the square brackets ("[" and "]") surrounding the numerical values

d) the apostrophes surrounding the keys

Ideally, the data would be transformed as follows:

ABC,2016-6-9 0:00,95,//PurpleCar: 115, //YellowCar: 403, //BlueCar: 16, //WhiteCar-XYZ: 0

I tried this:

df['Car1'] = df['Car1'].str.strip(['{', '}', '[', 'L]'])

But, it doesn't work. It results in the 'Car1' column becoming NaN values.

Is it possible to transform the data frame such that each row of the data frame reads as follows?

ABC,2016-6-9 0:00,95,//PurpleCar: 115, //YellowCar: 403, //BlueCar: 16, //WhiteCar-XYZ: 0

Thanks!

UPDATE:

Using the following regular expression:

df['Car1'] = df['Car1'].str.replace(r'\D+', '').astype('int')

Results in this:

ABC,2016-6-9 0:00,95, 115 , //YellowCar: 403, //BlueCar: 16, //WhiteCar-XYZ: 0

We lose '//PurpleCar' and are left with only the numeric value of 115. That's a good start, but it would be great if we can see the '//PurpleCar' key, too.

Any ideas?


UPDATE 2:

Based on the comments by piRSquared and HYRY, my goal is to be able to plot the numerical results. So, I would like to have the data frame look as follows:

   Company   Date            PurpleCar  YellowCar   BlueCar     WhiteCar      

0  ABC       2016-6-9 0:00   115        403         16          0
1  ABC       2016-6-10 0:00  219        381         90          0
2  ABC       2016-6-11 0:00  817        21          31          0
3  ABC       2016-6-12 0:00  80         2011        8888        0
4  ABC       2016-6-13 0:00  32         15          4           0
5  DEF       2016-6-16 0:00  32         15          4           0
6  DEF       2016-6-17 0:00  32         15          4           0
7  DEF       2016-6-18 0:00  32         15          4           0
8  DEF       2016-6-19 0:00  32         15          4           0
9  DEF       2016-6-20 0:00  32         15          4           0

* UPDATE 3: *

The data originally posted had a small mistake. Here is the data:

the_data = """
ABC,2016-6-9 0:00,95,"{'//Purple': [115L], '//Yellow': [403L], '//Blue': [16L], '//White-XYZ': [0L]}"
ABC,2016-6-10 0:00,0,"{'//Purple': [219L], '//Yellow': [381L], '//Blue': [90L], '//White-XYZ': [0L]}"
ABC,2016-6-11 0:00,0,"{'//Purple': [817L], '//Yellow': [21L], '//Blue': [31L], '//White-XYZ': [0L]}"
ABC,2016-6-12 0:00,0,"{'//Purple': [80L], '//Yellow': [2011L], '//Blue': [8888L], '//White-XYZ': [0L]}"
ABC,2016-6-13 0:00,0,"{'//Purple': [32L], '//Yellow': [15L], '//Blue': [4L], '//White-XYZ': [0L]}"
DEF,2016-6-16 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [3L]}"
DEF,2016-6-17 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [0L]}"
DEF,2016-6-18 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [7L]}"
DEF,2016-6-19 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [14L]}"
DEF,2016-6-20 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [21L]}"
"""

The difference between this data and the original data is the apostrophes (") before the opening curly brace ("{") and after the closing curly brace ("}").

解决方案

Edit: The file seems to be actually an escaped CSV so we don't need a custom parsing for this part.

As @Blckknght points out in the comment, the file is not a valid CSV. I'll make some assumptions in my answer. They are

  1. You don't control the data and thus can't properly escape the commas.
  2. The first three columns won't contain any comma.
  3. The third column follows the syntax of a python dict.
  4. There is always one value in the list which is in the dict values.

First, some imports

import ast
import pandas as pd

We'll just split the rows by commas as we don't need to deal with any sort of CSV escaping (assumptions #1 and #2).

rows = (line.split(",", 3) for line in the_data.splitlines() if line.strip() != "")

fixed_columns = pd.DataFrame.from_records(rows, columns=["Company", "Date", "Value", "Cars_str"])

fixed_columns = pd.read_csv(..., names=["Company", "Date", "Value", "Cars_str"])

The first three columns are fixed and we leave them as they are. The last column we can parse with ast.literal_eval because it's a dict (assumption #3). This is IMO more readable and more flexible if the format changes than regex. Also you'll detect the format change earlier.

cars = fixed_columns["Cars_str"].apply(ast.literal_eval)
del fixed_columns["Cars_str"]

And this part answers rather your other question.

We prepare functions to process the keys and values of the dict so they fail if our assumptions about content of the dict fail.

def get_single_item(list_that_always_has_single_item):
    v, = list_that_always_has_single_item
    return v

def extract_car_name(car_str):
    assert car_str.startswith("//"), car_str
    return car_str[2:]

We apply the functions and construct pd.Series which allow us to...

dynamic_columns = cars.apply(
    lambda x: pd.Series({
            extract_car_name(k): get_single_item(v) 
            for k, v in x.items()
    }))    

...add the columns to the dataframe

result = pd.concat([fixed_columns, dynamic_columns], axis=1)
result

Finally, we get the table:

  Company            Date Value  BlackCar  BlueCar  NPO-GreenCar  PinkCar  \
0     ABC   2016-6-9 0:00    95       NaN     16.0           NaN      NaN   
1     ABC  2016-6-10 0:00     0       NaN     90.0           NaN      NaN   
2     ABC  2016-6-11 0:00     0       NaN     31.0           NaN      NaN   
3     ABC  2016-6-12 0:00     0       NaN   8888.0           NaN      NaN   
4     ABC  2016-6-13 0:00     0       NaN      4.0           NaN      NaN   
5     DEF  2016-6-16 0:00     0      15.0      NaN           0.0      4.0   
6     DEF  2016-6-17 0:00     0      15.0      NaN           0.0      4.0   
7     DEF  2016-6-18 0:00     0      15.0      NaN           0.0      4.0   
8     DEF  2016-6-19 0:00     0      15.0      NaN           0.0      4.0   
9     DEF  2016-6-20 0:00     0      15.0      NaN           0.0      4.0   

   PurpleCar  WhiteCar-XYZ  YellowCar  
0      115.0           0.0      403.0  
1      219.0           0.0      381.0  
2      817.0           0.0       21.0  
3       80.0           0.0     2011.0  
4       32.0           0.0       15.0  
5       32.0           NaN        NaN  
6       32.0           NaN        NaN  
7       32.0           NaN        NaN  
8       32.0           NaN        NaN  
9       32.0           NaN        NaN  

这篇关于如何从Pandas数据框中的字典中删除花括号,撇号和方括号(Python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆