字典到pandas DataFrame的列表列表 [英] List of lists to dictionary to pandas DataFrame

查看:331
本文介绍了字典到pandas DataFrame的列表列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试拟合以下数据:

I am trying to fit this data:

[['Manufacturer: Hyundai',
  'Model: Tucson',
  'Mileage: 258000 km',
  'Registered: 07/2019'],
 ['Manufacturer: Mazda',
  'Model: 6',
  'Year: 2014',
  'Registered: 07/2019']]

到熊猫DataFrame.

to pandas DataFrame.

并非所有标签都出现在每个记录中,例如,某些记录具有里程",而另一些则没有. 我一共有26个功能,而几乎所有功能都很少.

Not all labels are present in each record, for example some records have 'Mileage' and others don't and vice versa. I have a total of 26 features and very few items have all of them.

我想构造一个熊猫DataFrame,它将在列中保存要素,如果要素不存在,则内容应为"NaN".

I would like to construct pandas DataFrame that will hold features in columns and if feature doesn't exists than content should be 'NaN'.

我有

colnames=['Manufacturer', 'Model', 'Mileage', 'Registered', 'Year'...(all 26 features here)] 
df = pd.read_csv("./data/output.csv", sep=",", names=colnames, header=None)

很少有先决条件列能提供预期的输出,但是在涉及可选功能时,缺少数据会导致之后的功能在错误的列下出现. 仅当所有功能都存在时,记录才能正确映射.

Few first prerequisite columns are giving expected output but when it comes to optional features than missing data causing features after that to turn out under wrong columns. Records are mapped correctly only if all features are present.

我忘了提及一些缺少价值的功能,这些功能也没有:"但出现在列表中. 因此,在这2种情况下:

I forgot to mention that some features that are missing value also don't have ":" but are present in list. So in this 2 cases:

  • 里程",(缺少值,但也缺少:")
  • 从唱片高手那里缺少里程"

两种情况下的分配均应为"NaN".

assignment for both cases should be 'NaN'.

推荐答案

将嵌套列表推导用于词典列表,如果缺少相同的键,则传递给DataFrame构造器:NaN:

Use nested list comprehension for list of dictionaries and pass to DataFrame contructor, if same key is missing is added NaN:

L = [['Manufacturer: Hyundai',
  'Model: Tucson',
  'Mileage: 258000 km',
  'Registered: 07/2019'],
 ['Manufacturer: Mazda',
  'Model: 6',
  'Year: 2014',
  'Registered: 07/2019']]

df = pd.DataFrame([dict(y.split(':') for y in x) for x in L])
print (df)
  Manufacturer     Mileage    Model Registered   Year
0      Hyundai   258000 km   Tucson    07/2019    NaN
1        Mazda         NaN        6    07/2019   2014

您可以使用.split(maxsplit=1)来按第一个空格进行分割:

You can use .split(maxsplit=1) for split by first whitespace:

L = [['Manufacturer Hyundai',
  'Model Tucson',
  'Mileage 258000 km',
  'Registered 07/2019'],
 ['Manufacturer Mazda',
  'Model 6',
  'Year 2014',
  'Registered 07/2019']]


df = pd.DataFrame([dict(y.split(maxsplit=1) for y in x) for x in L])
print (df)

  Manufacturer    Mileage   Model Registered  Year
0      Hyundai  258000 km  Tucson    07/2019   NaN
1        Mazda        NaN       6    07/2019  2014

L = [['Manufacturer  Hyundai',
  'Model  Tucson',
  'Mileage  258000 km',
  'Registered  07/2019'],
 ['Manufacturer  Mazda',
  'Model  6',
  'Year  2014',
  'Registered  07/2019',
  'Additional equipment aaa']]

words2 = ['Additional equipment']

L1 = []
for x in L:
    di = {}
    for y in x:
        for word in words2:
            if set(word.split(maxsplit=2)[:2]) < set(y.split()):
                i, j, k = y.split(maxsplit=2)
                di['_'.join([i, j])] = k
            else:
                i, j = y.split(maxsplit=1)
                di[i] = j
    L1.append(di)

df = pd.DataFrame(L1)
print (df)
  Additional_equipment Manufacturer    Mileage   Model Registered  Year
0                  NaN      Hyundai  258000 km  Tucson    07/2019   NaN
1                  aaa        Mazda        NaN       6    07/2019  2014

这篇关于字典到pandas DataFrame的列表列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆