对具有公共前缀的文件进行分组和合并 [英] group and combine files with common prefixes

查看:135
本文介绍了对具有公共前缀的文件进行分组和合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我做了一些功能,可以帮助我按地区下载所有的csv选举.下载文件的名称如下所示:

I made a few functions that help me download all csv of elections by precincts. The names of the downloaded files look like this :

Hzwpukgh_2008Parliamentary-Majoritarian
Hzwpukgh_2008Parliamentary-PartyList
Hzwpukgh_2008Presidential
...
Truc_2008Presidential

对于给定的选举和给定的地区,它给了我以下内容:

It gives me, for a given election and a given precinct, the following :

"Election"," Map Level"," Precinct ID"," Precinct Name","Overall Results","#1 - Mikheil Saakashvili","#2 - Levan Gachechiladze","#3 - Shalva Natelashvili","#4 - Arkadi (Badri) Patarkatsishvili","#5 - Davit Gamkrelidze","#6 - Giorgi (Gia) Maisashvili","#7 - Irina Sarishvili-Chanturia","Total Voter Turnout (#)","Total Voter Turnout (%)","Average votes per minute (08:00-12:00)","Average votes per minute (12:00-17:00)","Average votes per minute (17:00-20:00)"
"2008 Presidential","Precinct","1","39-1","Mikheil Saakashvili","74.48","18.45","1.74","5.92","3.71","0.58","0.12","862","58.24","1.19","1.45","1.05"
"2008 Presidential","Precinct","10","39-10","Mikheil Saakashvili","61.62","24.75","3.03","5.56","5.05","0","0","198","75","0.25","0.34","0.2"

我想将给定区域的不同年份的csv(比如说Hzwpukgh)收集到一个看起来像这样的csv :

I would like to gather csv of different years of a given precinct, let say Hzwpukgh, to one csv that would look like this :

                       2010 Presidential   2017 Presidential ...  
Tprolps Zhhrhzocpsp                67.68                 NaN
Levan Gachechiladze                20.96                 NaN
...
Npvynp Thynclshzocpsp                NaN               64.15
Davit Bakradze                       NaN               13.86
...

但是,第一步,我希望将csvs合并为一个.那么如何在下划线之前合并具有相同名称的文件?

But, first step, I am looking to merge the csvs into one. So how to merge files with the same names before the underscore ?

它看起来像:

"Election"," Map Level"," Precinct ID"," Precinct Name","Overall Results","#1 - Mikheil Saakashvili","#2 - Levan Gachechiladze","#3 - Shalva Natelashvili","#4 - Arkadi (Badri) Patarkatsishvili","#5 - Davit Gamkrelidze","#6 - Giorgi (Gia) Maisashvili","#7 - Irina Sarishvili-Chanturia","Total Voter Turnout (#)","Total Voter Turnout (%)","Average votes per minute (08:00-12:00)","Average votes per minute (12:00-17:00)","Average votes per minute (17:00-20:00)"
"2008 Presidential","Precinct","1","39-1","Mikheil Saakashvili","74.48","18.45","1.74","5.92","3.71","0.58","0.12","862","58.24","1.19","1.45","1.05"
"2008 Presidential","Precinct","10","39-10","Mikheil Saakashvili","61.62","24.75","3.03","5.56","5.05","0","0","198","75","0.25","0.34","0.2"
...
"2008 Parliamentary-Majoritarian","Precinct","1","39-1","Mikheil Saakashvili","74.48","18.45","1.74","5.92","3.71","0.58","0.12","862","58.24","1.19","1.45","1.05"
"2008 Parliamentary-Majoritarian","Precinct","10","39-10","Mikheil Saakashvili","61.62","24.75","3.03","5.56","5.05","0","0","198","75","0.25","0.34","0.2"

然后,我将能够创建上面显示的数据框.如果您还有其他方法,我会很高兴听到他们的声音:)

Then I would be able to create the dataframe shown above. If you have any other methods I would be very glad to hear them :)

我尝试了以下方法:

import glob
import random
import os
import pandas

def find_filesets(path="."):
    csv_files = {}
    for name in glob.glob("{}/*_*.csv".format(path)):
        # there's almost certainly a better way to do this
        key = os.path.splitext(os.path.basename(name))[0].split('_')[0]
        csv_files.setdefault(key, []).append(name)

    for key,filelist in csv_files.items(): 
        print(key, filelist)
        # do something with filelist
        create_merged_csv(key, filelist)

def create_merged_csv(key, filelist):
    with open('{}-aggregate.csv'.format(key), 'w+b') as outfile:
        for filename in filelist:
            df = pandas.read_csv(filename)
            print(df)
            df.to_csv(outfile, index=False)

find_filesets('./Results')

但是它返回了:

01 ['./Results\\01_2016Parliamentary-Majoritarian.csv', './Results\\01_2016Parliamentary-MajoritarianRunoff.csv', './Results\\01_2016Parliamentary-PartyList.csv']
   "Election"," Map Level"," Precinct ID"," Precinct Name","Overall Results","#1 - Initiative Group","#2 - United National Movement","#3 - Free Democrats","#4 - Alliance of Patriots","#5 - Democratic Movement","#6 - Republican party","#7 - Georgia for Peace","#8 - State for the People","#9 - Georgian Idea","#10 - National Forum","#11 - For United Georgia","#12 - Georgia","#13 - Ours - People's Party","#14 - Progressive Democratic Movement","#14 - Georgian Group","#14 - Labour","#14 - Communist Party - Stalin","#14 - Socialist Workers Party","#14 - United Communist Party","#14 - Industrialists - Our Homeland","#14 - Merab Kostava Society","#14 - Leftist Alliance","#14 - In the Name of the Lord","#14 - Georgian Dream","Invalid Ballots (%)","More Ballots Than Votes (#)","More Votes Than Ballots (#)","Total Voter Turnout (#)","Total Voter Turnout (%)","Average votes per minute (08:00-12:00)","Average votes per minute (12:00-17:00)","Average votes per minute (17:00-20:00)"
0   "2016 Parliamentary - Majoritarian","Precinct"...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
1   "2016 Parliamentary - Majoritarian","Precinct"...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
2   "2016 Parliamentary - Majoritarian","Precinct"...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
3   "2016 Parliamentary - Majoritarian","Precinct"...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:22: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-3b33d1e84680> in <module>
      4 import pandas
      5 
----> 6 find_filesets('./Results')

<ipython-input-13-533474b39654> in find_filesets(path)
      9         print(key, filelist)
     10         # do something with filelist
---> 11         create_merged_csv(key, filelist)


<ipython-input-13-533474b39654> in create_merged_csv(key, filelist)
     22             df = pandas.read_csv(filename, sep='delimiter')
     23             print(df)
---> 24             df.to_csv(outfile, index=False, header=None)


C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
   3018                                  doublequote=doublequote,
   3019                                  escapechar=escapechar, decimal=decimal)
-> 3020         formatter.save()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\formats\csvs.py in save(self)
    170                 self.writer = UnicodeWriter(f, **writer_kwargs)
    171 
--> 172             self._save()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\formats\csvs.py in _save(self)
    286                 break
    287 
--> 288             self._save_chunk(start_i, end_i)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\formats\csvs.py in _save_chunk(self, start_i, end_i)
    313 
    314         libwriters.write_csv_rows(self.data, ix, self.nlevels,
--> 315                                   self.cols, self.writer)

pandas/_libs/writers.pyx in pandas._libs.writers.write_csv_rows()

TypeError: a bytes-like object is required, not 'str'

推荐答案

to_csv()将文件路径作为参数,而是给它一个打开的文件.

to_csv() takes a file path as an argument, you are giving it an opened file instead.

可以简单地避免打开文件来解决此问题:

It can be fixed simply avoiding opening the file:

def create_merged_csv(key, filelist):
    outfile = '{}-aggregate.csv'.format(key)
    for filename in filelist:
        df = pandas.read_csv(filename)
        print(df)
        df.to_csv(outfile, index=False)

但是,这可能不是您想要的. 您要先合并/附加数据框,然后再写入最终文件.

However, this is probably not what you want. You want to merge/append the data frames first and then write the final file.

这里是一个示例,假设您想要的就是添加数据框.

Here is an example, assuming that appending the dataframe is what you want.

def create_merged_csv(key, filelist):
    df = [] #init as empty list
    outfile = '{}-aggregate.csv'.format(key)
    for filename in filelist:
        if len(df):
            df1 = pandas.read_csv(filename)
            df = df.append(df1, ignore_index=True)
            print(df1)
        else:
            df = pandas.read_csv(filename)
            print(df)

    df.to_csv(outfile, index=False)

这篇关于对具有公共前缀的文件进行分组和合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆