读取按 nan 行拆分的数据帧并在 Python 中提取特定列 [英] Read dataframe split by nan rows and extract specific columns in Python

查看:46
本文介绍了读取按 nan 行拆分的数据帧并在 Python 中提取特定列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个来自

预处理:

2018, 2019, 2020, numobject 类型,我需要将其转换为浮点:

cols = ['2018', '2019', '2020', 'num']df[cols].replace('--', np.nan, regex=True).astype(float)

我还需要从 2019-bj-price-quantity, 2019-sh-price-quantity, 2019-gz-price 的 bj, sh, gz, sz 中提取城市名称-数量,2019-sz-价格-数量

pattern = '|'.join(['2019-', '-price-quantity'])df['city'] = df['city'].str.replace(pattern, '')

最后,我需要提取每个城市的 numpricequantity 并像这样重塑一个新的数据框:

我怎么能在熊猫中做到这一点?谢谢.

更新:

df = pd.read_excel('./data2.xlsx', sheet_name = 'Sheet1', header = None)df.groupby(df.iloc[:, 0].isna().cumsum()).transform('first')

出:

 0 1 2 3 40 2019-bj-价格-数量 2018.0 2019.0 2020.0 数量1 2019-bj-价格-数量 2018.0 2019.0 2020.0 数量2 2019-bj-价格-数量 2018.0 2019.0 2020.0 数量3 2019-bj-价格-数量 2018.0 2019.0 2020.0 数量4 2019-sh-价格-数量 2018.0 2019.0 2020.0 数量5 2019-sh-价格-数量 2018.0 2019.0 2020.0 数量6 2019-sh-价格-数量 2018.0 2019.0 2020.0 数量7 2019-sh-price-quantity 2018.0 2019.0 2020.0 数量8 2019-sh-价格-数量 2018.0 2019.0 2020.0 数量9 南南南南南南南10 2019-gz-价格-数量 2018.0 2019.0 2020.0 数量11 2019-gz-价格-数量 2018.0 2019.0 2020.0 数量12 2019-gz-价格-数量 2018.0 2019.0 2020.0 数量13 2019-gz-价格-数量 2018.0 2019.0 2020.0 数量14 2019-gz-价格-数量 2018.0 2019.0 2020.0 数量15 南南南南南南南16 2019-sz-价格-数量 2018.0 2019.0 2020.0 数量17 2019-sz-价格-数量 2018.0 2019.0 2020.0 数量18 2019-sz-价格-数量 2018.0 2019.0 2020.0 数量19 2019-sz-价格-数量 2018.0 2019.0 2020.0 数量20 2019-sz-价格-数量 2018.0 2019.0 2020.0 数量

参考相关:读取由 nan 行分割的数据帧,并在 Python 中将它们重塑为多个数据帧

解决方案

我的代码基于 jezrael 的精彩回答,欢迎分享更好的解决方案或改进它:

# 为默认列名添加 header=Nonedf = pd.read_excel('./data2.xlsx', sheet_name = 'Sheet1', header=None)# 按第二行转换列df.columns = df.iloc[1].rename(None)# 通过按第二列向前填充非缺失值来创建新列 `city`df.insert(0, 'city', df.iloc[:, 0].mask(df.iloc[:, 1].notna()).ffill())模式 = '|'.join(['2019-', '-价格-数量'])df['city'] = df['city'].str.replace(pattern, '')df['year'] = df['year'].str.replace(pattern, '')# 将浮点数转换为整数df.columns = [int(x) if isinstance(x, float) else x for x in df.columns]df = df[df.year.isin(['价格','数量'])]df = df[['city', 'year', 'num']]df['num'] = df['num'].replace('--', np.nan, regex=True).astype(float)df = df.set_index(['city', 'year']).unstack().reset_index()df.columns = df.columns.droplevel(0)df.rename({'year':'city'},axis=1,inplace=True)打印(df)

出:

年价格数量0 北京 21.0 10.01 gz 6.0 15.02 sh 12.0 NaN3 尺 13.0 纳米

I have a example excel file data2.xlsx from here, which has a Sheet1 as follows:

Preprocess:

The columns 2018, 2019, 2020, num are object type, which I need to convert to float:

cols = ['2018', '2019', '2020', 'num']
df[cols].replace('--', np.nan, regex=True).astype(float)

Also I need to extract city names from bj, sh, gz, sz from 2019-bj-price-quantity, 2019-sh-price-quantity, 2019-gz-price-quantity, 2019-sz-price-quantity

pattern = '|'.join(['2019-', '-price-quantity'])
df['city'] = df['city'].str.replace(pattern, '')

Finally I need to extract price and quantity of nums for each city and reshape a new dataframe like this:

How could I do that in pandas? Thanks.

Update:

df = pd.read_excel('./data2.xlsx', sheet_name = 'Sheet1', header = None)
df.groupby(df.iloc[:, 0].isna().cumsum()).transform('first')

Out:

                         0       1       2       3    4
0   2019-bj-price-quantity  2018.0  2019.0  2020.0  num
1   2019-bj-price-quantity  2018.0  2019.0  2020.0  num
2   2019-bj-price-quantity  2018.0  2019.0  2020.0  num
3   2019-bj-price-quantity  2018.0  2019.0  2020.0  num
4   2019-sh-price-quantity  2018.0  2019.0  2020.0  num
5   2019-sh-price-quantity  2018.0  2019.0  2020.0  num
6   2019-sh-price-quantity  2018.0  2019.0  2020.0  num
7   2019-sh-price-quantity  2018.0  2019.0  2020.0  num
8   2019-sh-price-quantity  2018.0  2019.0  2020.0  num
9                      NaN     NaN     NaN     NaN  NaN
10  2019-gz-price-quantity  2018.0  2019.0  2020.0  num
11  2019-gz-price-quantity  2018.0  2019.0  2020.0  num
12  2019-gz-price-quantity  2018.0  2019.0  2020.0  num
13  2019-gz-price-quantity  2018.0  2019.0  2020.0  num
14  2019-gz-price-quantity  2018.0  2019.0  2020.0  num
15                     NaN     NaN     NaN     NaN  NaN
16  2019-sz-price-quantity  2018.0  2019.0  2020.0  num
17  2019-sz-price-quantity  2018.0  2019.0  2020.0  num
18  2019-sz-price-quantity  2018.0  2019.0  2020.0  num
19  2019-sz-price-quantity  2018.0  2019.0  2020.0  num
20  2019-sz-price-quantity  2018.0  2019.0  2020.0  num

Reference related: Read dataframe split by nan rows and reshape them into multiple dataframes in Python

解决方案

My code based on jezrael's great answer, welcome to share better solution or improve it:

# add header=None for default columns names
df = pd.read_excel('./data2.xlsx', sheet_name = 'Sheet1', header=None)

# convert columns by second row
df.columns = df.iloc[1].rename(None)

# create new column `city` by forward filling non missing values by second column
df.insert(0, 'city', df.iloc[:, 0].mask(df.iloc[:, 1].notna()).ffill())

pattern = '|'.join(['2019-', '-price-quantity'])
df['city'] = df['city'].str.replace(pattern, '')
df['year'] = df['year'].str.replace(pattern, '')
# convert floats to integers 
df.columns = [int(x) if isinstance(x, float) else x for x in df.columns]
df = df[df.year.isin(['price', 'quantity'])]
df = df[['city', 'year', 'num']]
df['num'] = df['num'].replace('--', np.nan, regex=True).astype(float)
df = df.set_index(['city', 'year']).unstack().reset_index()
df.columns = df.columns.droplevel(0)
df.rename({'year': 'city'}, axis=1, inplace=True)
print(df)

Out:

year      price  quantity
0     bj   21.0      10.0
1     gz    6.0      15.0
2     sh   12.0       NaN
3     sz   13.0       NaN

这篇关于读取按 nan 行拆分的数据帧并在 Python 中提取特定列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆