比 append 更快地制作 Pandas Multiindex 数据帧的方法 [英] Faster way to make pandas Multiindex dataframe than append

查看:46
本文介绍了比 append 更快地制作 Pandas Multiindex 数据帧的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种更快的方法将数据从我的 json 对象加载到多索引数据帧中.

我的 JSON 就像:

 {1990-1991":{克利夫兰":{薪水":$14,403,000",玩家":{热棒威廉姆斯":$3,785,000",Danny Ferry":$2,640,000",标记价格":$1,400,000",布拉德·多尔蒂":$1,320,000",拉里·南斯":$1,260,000",查基·布朗":$630,000",史蒂夫·克尔":548,000 美元",Derrick Chievous":$525,000",温斯顿·贝内特":525,000 美元",约翰·莫顿":$350,000",Milos Babic":$200,000",杰拉德·帕迪奥":$120,000",达内尔·瓦伦丁":100,000 美元",亨利·詹姆斯":75,000 美元"},网址":https://hoopshype.com/salaries/cleveland_cavaliers/1990-1991/"},

我正在制作这样的数据框:

 df = pd.DataFrame(columns=[year", team", player", salary"])nbaSalaryData.keys() 中的年份:对于 nbaSalaryData[year] 中的团队:对于 nbaSalaryData[year][team]['players'] 中的球员:df = df.append({年":年,团队":团队,玩家":玩家,薪水":nbaSalaryData[year][team]['players'][player]}, ignore_index=True)df = df.set_index(['year', 'team', 'player']).sort_index()df

结果:

 薪水年队队员1990-1991 亚特兰大 Doc Rivers 895,000 美元多米尼克威尔金斯 $2,065,000加里·伦纳德 $200,000约翰巴特尔 590,000 美元凯文威利斯 $685,000………………2020-2021 华盛顿罗宾·洛佩兹 $7,300,000八村锐 $4,692,840拉塞尔·威斯布鲁克 $41,358,814托马斯·布莱恩特 $8,333,333特洛伊布朗 $3,372,840

这是我想要的形式——年份、球队和球员作为索引,薪水作为列.我知道使用 append 很慢,但我想不出替代方法.我尝试使用元组(配置略有不同 - 没有球员和薪水),但最终不起作用.

 元组 = []指数 = 无nbaSalaryData.keys() 中的年份:对于 nbaSalaryData[year] 中的团队:t = nbaSalaryData[年][球队]tuples.append((年,团队))index = pd.MultiIndex.from_tuples(tuples, names=["year", "team"])df = index.to_frame()df

输出:

 年团队年队1990-1991 克利夫兰 1990-1991 克利夫兰纽约 1990-1991 纽约底特律 1990-1991 底特律洛杉矶湖人队 1990-1991 洛杉矶湖人队亚特兰大 1990-1991 亚特兰大

我对 Pandas 不太熟悉,但我意识到一定有比 append() 更快的方法.

解决方案

您可以改编非常相似问题的答案 如下:

z = json.loads(json_data)out = pd.Series({(i,j,m): z[i][j][k][m]对于 i 在 z对于 j 在 z[i]对于 ['玩家'] 中的 k对于 m 在 z[i][j][k]}).to_frame('salary').rename_axis('year team player'.split())# 出去:薪水年队队员1990-1991 克利夫兰热棒威廉姆斯 $3,785,000丹尼费里 $2,640,000标价 $1,400,000布拉德·多尔蒂 $1,320,000拉里·南斯 $1,260,000查基·布朗 $630,000史蒂夫·克尔 548,000 美元德里克·奇沃斯 $525,000温斯顿·贝内特 $525,000约翰·莫顿 $350,000米洛斯·巴比克 $200,000杰拉德·帕迪奥 120,000 美元达内尔·瓦伦丁 $100,000亨利·詹姆斯 $75,000

此外,如果您打算对这些薪水进行一些数值分析,您可能希望它们是数字,而不是字符串.如果是这样,还要考虑:

out['salary'] = pd.to_numeric(out['salary'].str.replace(r'\D', ''))

PS:说明:

for 行只是扁平化嵌套 dict 的一大理解.要了解其工作原理,请先尝试:

<预><代码>[(i,j)对于 i 在 z对于 j 在 z[i]]

第三个 for 将列出 z[i][j] 的所有键,即:['salary', 'players', 'url'],但我们只对'players'感兴趣,所以我们这么说.

最后一点是,我们想要一个 dict 而不是 list.试试不用 pd.Series() 包围的表达式,你会看到到底发生了什么.

I am looking for a faster way to load data from my json object into a multiindex dataframe.

My JSON is like:

    {
        "1990-1991": {
            "Cleveland": {
                "salary": "$14,403,000",
                "players": {
                    "Hot Rod Williams": "$3,785,000",
                    "Danny Ferry": "$2,640,000",
                    "Mark Price": "$1,400,000",
                    "Brad Daugherty": "$1,320,000",
                    "Larry Nance": "$1,260,000",
                    "Chucky Brown": "$630,000",
                    "Steve Kerr": "$548,000",
                    "Derrick Chievous": "$525,000",
                    "Winston Bennett": "$525,000",
                    "John Morton": "$350,000",
                    "Milos Babic": "$200,000",
                    "Gerald Paddio": "$120,000",
                    "Darnell Valentine": "$100,000",
                    "Henry James": "$75,000"
                },
                "url": "https://hoopshype.com/salaries/cleveland_cavaliers/1990-1991/"
            },

I am making the dataframe like:

    df = pd.DataFrame(columns=["year", "team", "player", "salary"])
    
    for year in nbaSalaryData.keys():
        for team in nbaSalaryData[year]:
            for player in nbaSalaryData[year][team]['players']:
                df = df.append({
                        "year": year,
                        "team": team,
                        "player": player,
                        "salary": nbaSalaryData[year][team]['players'][player]
                    }, ignore_index=True)
    
    df = df.set_index(['year', 'team', 'player']).sort_index()
    df

Which results in:

                                              salary 
    year       team     player
    1990-1991  Atlanta  Doc Rivers          $895,000
                        Dominique Wilkins   $2,065,000
                        Gary Leonard        $200,000
                        John Battle         $590,000
                        Kevin Willis        $685,000
    ... ... ... ...
    2020-2021   Washington  Robin Lopez     $7,300,000
                        Rui Hachimura       $4,692,840
                        Russell Westbrook   $41,358,814
                        Thomas Bryant       $8,333,333
                        Troy Brown          $3,372,840

This is the form I want - year, team, and player as indexes and salary as a column. I know using append is slow but I cannot figure out an alternative. I tried to make it using tuples (with a slightly different configuration - no players and salary) but it ended up not working.

    tuples = []
    index = None

    for year in nbaSalaryData.keys():
        for team in nbaSalaryData[year]:
            t = nbaSalaryData[year][team]
            tuples.append((year, team))

    index = pd.MultiIndex.from_tuples(tuples, names=["year", "team"])
    df = index.to_frame()
    df

Which outputs:

                             year   team
    year    team        
    1990-1991   Cleveland   1990-1991   Cleveland
                New York    1990-1991   New York
                Detroit     1990-1991   Detroit
                LA Lakers   1990-1991   LA Lakers
                Atlanta     1990-1991   Atlanta  

I'm not that familiar with pandas but realize there must be a faster way than append().

解决方案

You can adapt the answer to a very similar question as follow:

z = json.loads(json_data)

out = pd.Series({
    (i,j,m): z[i][j][k][m]
    for i in z
    for j in z[i]
    for k in ['players']
    for m in z[i][j][k]
}).to_frame('salary').rename_axis('year team player'.split())

# out:

                                           salary
year      team      player                       
1990-1991 Cleveland Hot Rod Williams   $3,785,000
                    Danny Ferry        $2,640,000
                    Mark Price         $1,400,000
                    Brad Daugherty     $1,320,000
                    Larry Nance        $1,260,000
                    Chucky Brown         $630,000
                    Steve Kerr           $548,000
                    Derrick Chievous     $525,000
                    Winston Bennett      $525,000
                    John Morton          $350,000
                    Milos Babic          $200,000
                    Gerald Paddio        $120,000
                    Darnell Valentine    $100,000
                    Henry James           $75,000

Also, if you intend to do some numerical analysis with those salaries, you probably want them as numbers, not strings. If so, also consider:

out['salary'] = pd.to_numeric(out['salary'].str.replace(r'\D', ''))

PS: Explanation:

The for lines are just one big comprehension to flatten your nested dict. To understand how it works, try first:

[
    (i,j)
    for i in z
    for j in z[i]
]

The 3rd for would be to list all keys of z[i][j], which would be: ['salary', 'players', 'url'], but we are only interested in 'players', so we say so.

The final bit is, instead of a list, we want a dict. Try the expression without surrounding with pd.Series() and you'll see exactly what's going on.

这篇关于比 append 更快地制作 Pandas Multiindex 数据帧的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆