如何从Pandas数据框中的字典中删除花括号,撇号和方括号(Python) [英] How to remove curly braces, apostrophes and square brackets from dictionaries in a Pandas dataframe (Python)
问题描述
csv文件中包含以下数据:
from StringIO import StringIO
import pandas as pd
the_data = """
ABC,2016-6-9 0:00,95,{'//PurpleCar': [115L], '//YellowCar': [403L], '//BlueCar': [16L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-10 0:00,0,{'//PurpleCar': [219L], '//YellowCar': [381L], '//BlueCar': [90L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-11 0:00,0,{'//PurpleCar': [817L], '//YellowCar': [21L], '//BlueCar': [31L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-12 0:00,0,{'//PurpleCar': [80L], '//YellowCar': [2011L], '//BlueCar': [8888L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-13 0:00,0,{'//PurpleCar': [32L], '//YellowCar': [15L], '//BlueCar': [4L], '//WhiteCar-XYZ': [0L]}
DEF,2016-6-16 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-17 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-18 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-19 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-20 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
"""
我将文件读入Pandas数据框中,如下所示:
df = pd.read_csv(StringIO(the_data), sep=',')
然后,我添加一些列标题,如下所示:
df.columns = ['Company',
'Date',
'Volume',
'Car1',
'Car2',
'Car3',
'Car4']
我看到数据正在通过以下方式传输:
ABC,2016-6-9 0:00,95,{'//PurpleCar': [115L], '//YellowCar': [403L], '//BlueCar': [16L], '//WhiteCar-XYZ': [0L]
但是,我想不使用以下任何一项数据:
a)字典开头的大括号("{"
)和字典结尾的大括号("}"
)
b)数值后的"L"
c)包围数值的方括号("["
和"]"
)
d)键周围的撇号
理想情况下,数据将按以下方式转换:
ABC,2016-6-9 0:00,95,//PurpleCar: 115, //YellowCar: 403, //BlueCar: 16, //WhiteCar-XYZ: 0
我尝试过:
df['Car1'] = df['Car1'].str.strip(['{', '}', '[', 'L]'])
但是,它不起作用.这样会导致"Car1"列变为NaN值.
是否可以对数据帧进行转换,以使数据帧的每一行都读取如下?
ABC,2016-6-9 0:00,95,//PurpleCar: 115, //YellowCar: 403, //BlueCar: 16, //WhiteCar-XYZ: 0
谢谢!
更新:
使用以下正则表达式:
df['Car1'] = df['Car1'].str.replace(r'\D+', '').astype('int')
结果:
ABC,2016-6-9 0:00,95, 115 , //YellowCar: 403, //BlueCar: 16, //WhiteCar-XYZ: 0
我们丢失了'//PurpleCar',只留下了数值115.这是一个好的开始,但是如果我们也能看到'//PurpleCar'键,那就太好了.
有什么想法吗?
更新2:
基于piRSquared和HYRY的评论,我的目标是能够绘制数值结果.因此,我希望数据框如下所示:
Company Date PurpleCar YellowCar BlueCar WhiteCar
0 ABC 2016-6-9 0:00 115 403 16 0
1 ABC 2016-6-10 0:00 219 381 90 0
2 ABC 2016-6-11 0:00 817 21 31 0
3 ABC 2016-6-12 0:00 80 2011 8888 0
4 ABC 2016-6-13 0:00 32 15 4 0
5 DEF 2016-6-16 0:00 32 15 4 0
6 DEF 2016-6-17 0:00 32 15 4 0
7 DEF 2016-6-18 0:00 32 15 4 0
8 DEF 2016-6-19 0:00 32 15 4 0
9 DEF 2016-6-20 0:00 32 15 4 0
*更新3:*
最初发布的数据有一个小错误.这是数据:
the_data = """
ABC,2016-6-9 0:00,95,"{'//Purple': [115L], '//Yellow': [403L], '//Blue': [16L], '//White-XYZ': [0L]}"
ABC,2016-6-10 0:00,0,"{'//Purple': [219L], '//Yellow': [381L], '//Blue': [90L], '//White-XYZ': [0L]}"
ABC,2016-6-11 0:00,0,"{'//Purple': [817L], '//Yellow': [21L], '//Blue': [31L], '//White-XYZ': [0L]}"
ABC,2016-6-12 0:00,0,"{'//Purple': [80L], '//Yellow': [2011L], '//Blue': [8888L], '//White-XYZ': [0L]}"
ABC,2016-6-13 0:00,0,"{'//Purple': [32L], '//Yellow': [15L], '//Blue': [4L], '//White-XYZ': [0L]}"
DEF,2016-6-16 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [3L]}"
DEF,2016-6-17 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [0L]}"
DEF,2016-6-18 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [7L]}"
DEF,2016-6-19 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [14L]}"
DEF,2016-6-20 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [21L]}"
"""
此数据与原始数据之间的区别是在大括号前("{"
)和大括号后("}"
)后的撇号(")
.
该文件似乎实际上是转义的CSV,因此我们不需要为此部分进行自定义分析.>
正如@Blckknght在注释中指出的那样,该文件不是有效的CSV.我会在回答中做一些假设.他们是
- 您无法控制数据,因此无法正确转义逗号.
- 前三列将不包含任何逗号.
- 第三列遵循python dict的语法.
- 列表中总是有一个值位于dict值中.
首先,一些进口商品
import ast
import pandas as pd
我们将用逗号分隔行,因为我们不需要处理任何类型的CSV转义(假设#1和#2).
rows = (line.split(",", 3) for line in the_data.splitlines() if line.strip() != "")
fixed_columns = pd.DataFrame.from_records(rows, columns=["Company", "Date", "Value", "Cars_str"])
fixed_columns = pd.read_csv(..., names=["Company", "Date", "Value", "Cars_str"])
前三列是固定的,我们将其保留不变.我们可以用ast.literal_eval
解析的最后一列,因为它是dict
(假设#3).如果格式比正则表达式更改,这将使IMO更具可读性和灵活性.另外,您还会更早地检测到格式更改.
cars = fixed_columns["Cars_str"].apply(ast.literal_eval)
del fixed_columns["Cars_str"]
这部分相当您的其他问题.>
我们准备了函数来处理字典的键和值,因此,如果我们对字典内容的假设失败,它们将失败.
def get_single_item(list_that_always_has_single_item):
v, = list_that_always_has_single_item
return v
def extract_car_name(car_str):
assert car_str.startswith("//"), car_str
return car_str[2:]
我们应用函数并构造pd.Series
,使我们能够...
dynamic_columns = cars.apply(
lambda x: pd.Series({
extract_car_name(k): get_single_item(v)
for k, v in x.items()
}))
...将列添加到数据框
result = pd.concat([fixed_columns, dynamic_columns], axis=1)
result
最后,我们得到了表格:
Company Date Value BlackCar BlueCar NPO-GreenCar PinkCar \
0 ABC 2016-6-9 0:00 95 NaN 16.0 NaN NaN
1 ABC 2016-6-10 0:00 0 NaN 90.0 NaN NaN
2 ABC 2016-6-11 0:00 0 NaN 31.0 NaN NaN
3 ABC 2016-6-12 0:00 0 NaN 8888.0 NaN NaN
4 ABC 2016-6-13 0:00 0 NaN 4.0 NaN NaN
5 DEF 2016-6-16 0:00 0 15.0 NaN 0.0 4.0
6 DEF 2016-6-17 0:00 0 15.0 NaN 0.0 4.0
7 DEF 2016-6-18 0:00 0 15.0 NaN 0.0 4.0
8 DEF 2016-6-19 0:00 0 15.0 NaN 0.0 4.0
9 DEF 2016-6-20 0:00 0 15.0 NaN 0.0 4.0
PurpleCar WhiteCar-XYZ YellowCar
0 115.0 0.0 403.0
1 219.0 0.0 381.0
2 817.0 0.0 21.0
3 80.0 0.0 2011.0
4 32.0 0.0 15.0
5 32.0 NaN NaN
6 32.0 NaN NaN
7 32.0 NaN NaN
8 32.0 NaN NaN
9 32.0 NaN NaN
I have the following data in a csv file:
from StringIO import StringIO
import pandas as pd
the_data = """
ABC,2016-6-9 0:00,95,{'//PurpleCar': [115L], '//YellowCar': [403L], '//BlueCar': [16L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-10 0:00,0,{'//PurpleCar': [219L], '//YellowCar': [381L], '//BlueCar': [90L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-11 0:00,0,{'//PurpleCar': [817L], '//YellowCar': [21L], '//BlueCar': [31L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-12 0:00,0,{'//PurpleCar': [80L], '//YellowCar': [2011L], '//BlueCar': [8888L], '//WhiteCar-XYZ': [0L]}
ABC,2016-6-13 0:00,0,{'//PurpleCar': [32L], '//YellowCar': [15L], '//BlueCar': [4L], '//WhiteCar-XYZ': [0L]}
DEF,2016-6-16 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-17 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-18 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-19 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
DEF,2016-6-20 0:00,0,{'//PurpleCar': [32L], '//BlackCar': [15L], '//PinkCar': [4L], '//NPO-GreenCar': [0L]}
"""
I read the file into a Pandas data frame, as follows:
df = pd.read_csv(StringIO(the_data), sep=',')
Then, I add a few column headers, as follows:
df.columns = ['Company',
'Date',
'Volume',
'Car1',
'Car2',
'Car3',
'Car4']
I see that the data is coming through as follows:
ABC,2016-6-9 0:00,95,{'//PurpleCar': [115L], '//YellowCar': [403L], '//BlueCar': [16L], '//WhiteCar-XYZ': [0L]
But, I'd like to see the data without any of the following:
a) the curly braces ("{"
) at the beginning and the curly brace ("}"
) at the end of the dictionary
b) the "L" after the numerical values
c) the square brackets ("["
and "]"
) surrounding the numerical values
d) the apostrophes surrounding the keys
Ideally, the data would be transformed as follows:
ABC,2016-6-9 0:00,95,//PurpleCar: 115, //YellowCar: 403, //BlueCar: 16, //WhiteCar-XYZ: 0
I tried this:
df['Car1'] = df['Car1'].str.strip(['{', '}', '[', 'L]'])
But, it doesn't work. It results in the 'Car1' column becoming NaN values.
Is it possible to transform the data frame such that each row of the data frame reads as follows?
ABC,2016-6-9 0:00,95,//PurpleCar: 115, //YellowCar: 403, //BlueCar: 16, //WhiteCar-XYZ: 0
Thanks!
UPDATE:
Using the following regular expression:
df['Car1'] = df['Car1'].str.replace(r'\D+', '').astype('int')
Results in this:
ABC,2016-6-9 0:00,95, 115 , //YellowCar: 403, //BlueCar: 16, //WhiteCar-XYZ: 0
We lose '//PurpleCar' and are left with only the numeric value of 115. That's a good start, but it would be great if we can see the '//PurpleCar' key, too.
Any ideas?
UPDATE 2:
Based on the comments by piRSquared and HYRY, my goal is to be able to plot the numerical results. So, I would like to have the data frame look as follows:
Company Date PurpleCar YellowCar BlueCar WhiteCar
0 ABC 2016-6-9 0:00 115 403 16 0
1 ABC 2016-6-10 0:00 219 381 90 0
2 ABC 2016-6-11 0:00 817 21 31 0
3 ABC 2016-6-12 0:00 80 2011 8888 0
4 ABC 2016-6-13 0:00 32 15 4 0
5 DEF 2016-6-16 0:00 32 15 4 0
6 DEF 2016-6-17 0:00 32 15 4 0
7 DEF 2016-6-18 0:00 32 15 4 0
8 DEF 2016-6-19 0:00 32 15 4 0
9 DEF 2016-6-20 0:00 32 15 4 0
* UPDATE 3: *
The data originally posted had a small mistake. Here is the data:
the_data = """
ABC,2016-6-9 0:00,95,"{'//Purple': [115L], '//Yellow': [403L], '//Blue': [16L], '//White-XYZ': [0L]}"
ABC,2016-6-10 0:00,0,"{'//Purple': [219L], '//Yellow': [381L], '//Blue': [90L], '//White-XYZ': [0L]}"
ABC,2016-6-11 0:00,0,"{'//Purple': [817L], '//Yellow': [21L], '//Blue': [31L], '//White-XYZ': [0L]}"
ABC,2016-6-12 0:00,0,"{'//Purple': [80L], '//Yellow': [2011L], '//Blue': [8888L], '//White-XYZ': [0L]}"
ABC,2016-6-13 0:00,0,"{'//Purple': [32L], '//Yellow': [15L], '//Blue': [4L], '//White-XYZ': [0L]}"
DEF,2016-6-16 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [3L]}"
DEF,2016-6-17 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [0L]}"
DEF,2016-6-18 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [7L]}"
DEF,2016-6-19 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [14L]}"
DEF,2016-6-20 0:00,0,"{'//Purple': [32L], '//Black': [15L], '//Pink': [4L], '//NPO-Green': [21L]}"
"""
The difference between this data and the original data is the apostrophes (")
before the opening curly brace ("{"
) and after the closing curly brace ("}"
).
Edit: The file seems to be actually an escaped CSV so we don't need a custom parsing for this part.
As @Blckknght points out in the comment, the file is not a valid CSV. I'll make some assumptions in my answer. They are
- You don't control the data and thus can't properly escape the commas.
- The first three columns won't contain any comma.
- The third column follows the syntax of a python dict.
- There is always one value in the list which is in the dict values.
First, some imports
import ast
import pandas as pd
We'll just split the rows by commas as we don't need to deal with any sort of CSV escaping (assumptions #1 and #2).
rows = (line.split(",", 3) for line in the_data.splitlines() if line.strip() != "")
fixed_columns = pd.DataFrame.from_records(rows, columns=["Company", "Date", "Value", "Cars_str"])
fixed_columns = pd.read_csv(..., names=["Company", "Date", "Value", "Cars_str"])
The first three columns are fixed and we leave them as they are. The last column we can parse with ast.literal_eval
because it's a dict
(assumption #3). This is IMO more readable and more flexible if the format changes than regex. Also you'll detect the format change earlier.
cars = fixed_columns["Cars_str"].apply(ast.literal_eval)
del fixed_columns["Cars_str"]
And this part answers rather your other question.
We prepare functions to process the keys and values of the dict so they fail if our assumptions about content of the dict fail.
def get_single_item(list_that_always_has_single_item):
v, = list_that_always_has_single_item
return v
def extract_car_name(car_str):
assert car_str.startswith("//"), car_str
return car_str[2:]
We apply the functions and construct pd.Series
which allow us to...
dynamic_columns = cars.apply(
lambda x: pd.Series({
extract_car_name(k): get_single_item(v)
for k, v in x.items()
}))
...add the columns to the dataframe
result = pd.concat([fixed_columns, dynamic_columns], axis=1)
result
Finally, we get the table:
Company Date Value BlackCar BlueCar NPO-GreenCar PinkCar \
0 ABC 2016-6-9 0:00 95 NaN 16.0 NaN NaN
1 ABC 2016-6-10 0:00 0 NaN 90.0 NaN NaN
2 ABC 2016-6-11 0:00 0 NaN 31.0 NaN NaN
3 ABC 2016-6-12 0:00 0 NaN 8888.0 NaN NaN
4 ABC 2016-6-13 0:00 0 NaN 4.0 NaN NaN
5 DEF 2016-6-16 0:00 0 15.0 NaN 0.0 4.0
6 DEF 2016-6-17 0:00 0 15.0 NaN 0.0 4.0
7 DEF 2016-6-18 0:00 0 15.0 NaN 0.0 4.0
8 DEF 2016-6-19 0:00 0 15.0 NaN 0.0 4.0
9 DEF 2016-6-20 0:00 0 15.0 NaN 0.0 4.0
PurpleCar WhiteCar-XYZ YellowCar
0 115.0 0.0 403.0
1 219.0 0.0 381.0
2 817.0 0.0 21.0
3 80.0 0.0 2011.0
4 32.0 0.0 15.0
5 32.0 NaN NaN
6 32.0 NaN NaN
7 32.0 NaN NaN
8 32.0 NaN NaN
9 32.0 NaN NaN
这篇关于如何从Pandas数据框中的字典中删除花括号,撇号和方括号(Python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!