使用从两列计算出的键对CSV排序,获取前n个最大值 [英] Sort CSV using a key computed from two columns, grab first n largest values

查看:98
本文介绍了使用从两列计算出的键对CSV排序,获取前n个最大值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里是Python业余爱好者...在这里说我有一个示例csv文件的摘要:

Python amateur here...let's say here I have snippet of an example csv file:

Country, Year, GDP, Population
Country1,2002,44545,24352
Country2,2004,14325,75677
Country3,2005,23132412,1345234
Country4,,2312421,12412

我需要按某年(例如2002年)的人均GDP(人均GDP)降序对文件进行排序,然后获取人均GDP值最高的前10行.

I need to sort the file by descending GDP per capita (GDP/Population) in a certain year, say, 2002, then grab the first 10 rows with the largest GDP per capita values.

到目前为止,在将csv导入到数据"变量后,我使用以下命令获取了所有2002年的数据,而没有丢失字段:

So far, after I import the csv to a 'data' variable, I grab all the 2002 data without missing fields using:

data_2 = []
for row in data:
if row[1] == '2002' and row[2]!= ' ' and row[3] != ' ':
    data_2.append(row)

我需要找到某种方式来对data_2进行降序排序,最好不使用类,然后抓取与最大的10个值相关的每一行,然后再写入另一个csv.如果有人可以指出正确的方向,我将不胜感激,因为我尝试了无数的Google ...

I need to find some way to sort data_2 by row[2]/row[3] descending, preferably without using a class, and then grab each entire row tied to each of the largest 10 values to then write to another csv. If someone could point me in the right direction I would be forever grateful as I've tried countless googles...

推荐答案

这种方法可让您对文件进行一次扫描以获取每个国家/地区的前10名...

This is an approach that will enable you to do one scan of the file to get the top 10 for each country...

通过使用heapq模块,可以在没有pandas的情况下执行此操作,以下内容未经测试,但应为您参考适当的文档并适应您的目的奠定基础:

It is possible to do this without pandas by utilising the heapq module, the following is untested, but should be a base for you to refer to appropriate documentation and adapt for your purposes:

import csv
import heapq
from itertools import islice

freqs = {}
with open('yourfile') as fin:
    csvin = csv.reader(fin)
    rows_with_gdp = ([float(row[2]) / float(row[3])] + row for row in islice(csvin, 1, None) if row[2] and row[3])
    for row in rows_with_gdp:
        cnt = freqs.setdefault(row[2], [[]] * 10) # 2 = year, 10 = num to keep
        heapq.heappushpop(cnt, row)

for year, vals in freqs.iteritems():
    print year, [row[1:] for row in sorted(filter(None, vals), reverse=True)]

这篇关于使用从两列计算出的键对CSV排序,获取前n个最大值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆