Python-从多个Zip文件中提取CSV文件并合并数据 [英] Python - Extract CSV Files from Multiple Zip Files and Combine the Data

查看:140
本文介绍了Python-从多个Zip文件中提取CSV文件并合并数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Python脚本和熊猫来组合多个ZIP文件.我正在使用托管在GitHub存储库中的数据: https://github.com/statistikat/coronaDAT

I have a Python script and pandas to combine multiple ZIP files. I am using data hosted in a GitHub repository here: https://github.com/statistikat/coronaDAT

该脚本应将所有ZIP文件放在一个文件夹结构中,在ZIP文件中找到"Bezirke.csv"文件,然后将所有Bezirke.csv文件合并为一个大CSV文件.但是,该代码仅从该文件夹中抓取一个ZIP文件.

The script should take all ZIP files in a folder structure, find the "Bezirke.csv" file in the ZIP file, and combine all the Bezirke.csv files into one large CSV file. However, the code is only grabbing one ZIP file from the folder.

关于为什么它不从文件夹中其他ZIP文件获取数据的任何建议?

Any suggestions on why the it is not taking the data from the other ZIP files in the folder?

import glob
from zipfile import ZipFile

path = r'/Users/matt/test/' # use your path

#load all zip files in folder
all_files = glob.glob(path + "/*.zip")

li = []

for filename in all_files:

    zip_file = ZipFile(filename)
    df = {text_file.filename: pd.read_csv(zip_file.open(text_file.filename), delimiter=';', header=0, index_col=['Timestamp'], parse_dates=['Timestamp'])
       for text_file in zip_file.infolist()
       if text_file.filename.endswith('Bezirke.csv')}

    li.append(df)

#print dataframe in console
print(df)

#prepare date to export to csv
frame = pd.concat(df, axis=0)

#export to csv
frame.to_csv( "combined_zip_Bezirke.csv", encoding='utf-8-sig')
print("Export to CSV Successful")```

推荐答案

我对您的代码进行了一些更改,并使用三个示例文件对其进行了测试:

I made some changes to your code and tested it with three sample files:

20200523_010000_orig_csv.zip
20200523_020000_orig_csv.zip
20200523_030000_orig_csv.zip

修改后的代码:

df_master = pd.DataFrame()
flag = False

for filename in all_files:    
    zip_file = ZipFile(filename)
    for text_file in zip_file.infolist():
        if text_file.filename.endswith('Bezirke.csv'):
            df = pd.read_csv(zip_file.open(text_file.filename), 
            delimiter=';', 
            header=0, 
            index_col=['Timestamp'], 
            parse_dates=['Timestamp']
            )
    if not flag:
        df_master = df
        flag = True
    else:
        df_master = pd.concat([df_master, df])

print(df_master.info())

输出:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 279 entries, 2020-05-23 02:00:00 to 2020-05-23 03:00:00
Data columns (total 4 columns):
Bezirk             279 non-null object
Anzahl             279 non-null int64
Anzahl_Inzidenz    279 non-null object
GKZ                279 non-null int64

文件数据现在被追加.不确定这是否是最佳的最佳解决方案.如果适用,请提出改进​​建议.

The files data is getting appended now. Not sure if this is the best solution in terms of optimisation. Kindly suggest changes for improvement, if applicable.

数据帧输出:

                                           Bezirk  Anzahl   Anzahl_Inzidenz  GKZ
Timestamp                                                                       
2020-05-23 02:00:00                     Amstetten     301  259,228000068898  305
2020-05-23 02:00:00                         Baden     157  107,384937381586  306
2020-05-23 02:00:00                       Bludenz     256  401,795523746743  801
2020-05-23 02:00:00                Braunau am Inn     130  124,511531683396  404
2020-05-23 02:00:00                       Bregenz     274  203,894837888721  802
2020-05-23 02:00:00           Bruck an der Leitha     105  102,931085187727  307
2020-05-23 02:00:00            Bruck-Mürzzuschlag      52  52,5337428271236  621
2020-05-23 02:00:00              Deutschlandsberg      49   80,564278785288  603
2020-05-23 02:00:00                      Dornbirn     153  171,830954279489  803
2020-05-23 02:00:00                      Eferding      53  159,850404150078  405
2020-05-23 02:00:00             Eisenstadt(Stadt)      18  122,976019676163  101
2020-05-23 02:00:00           Eisenstadt-Umgebung      23  53,5793323549281  103
2020-05-23 02:00:00                     Feldkirch     213  198,770051978835  804
2020-05-23 02:00:00                   Feldkirchen      15  50,1052209640244  210
2020-05-23 02:00:00                     Freistadt     120  180,123384518395  406
2020-05-23 02:00:00                   Gänserndorf      99  95,4805856142584  308
2020-05-23 02:00:00                         Gmünd      14  38,0714110896582  309
2020-05-23 02:00:00                       Gmunden      73  71,8284775314619  407
2020-05-23 02:00:00                   Graz(Stadt)     466  161,353988490544  601
2020-05-23 02:00:00                 Graz-Umgebung     246  159,471022948269  606
2020-05-23 02:00:00                  Grieskirchen      74  114,336923100694  408
2020-05-23 02:00:00                       Güssing      24  93,0340737295034  104
2020-05-23 02:00:00                       Hallein      69  114,287607248153  502
2020-05-23 02:00:00          Hartberg-Fürstenfeld     318  350,908167994527  622
2020-05-23 02:00:00                      Hermagor       5  27,4363476733977  203
2020-05-23 02:00:00                    Hollabrunn      37  72,7515828384915  310
2020-05-23 02:00:00                          Horn      58   186,55516243165  311
2020-05-23 02:00:00                          Imst     293  487,877980551485  702
2020-05-23 02:00:00                Innsbruck-Land     388  216,375377820408  703
2020-05-23 02:00:00               Innsbruck-Stadt     410  310,347437741276  701
...                                           ...     ...               ...  ...
2020-05-23 03:00:00        Sankt Johann im Pongau     434  538,641976840877  504
2020-05-23 03:00:00            Sankt Pölten(Land)     267  203,748359329691  319
2020-05-23 03:00:00           Sankt Pölten(Stadt)      90  163,505559189012  302
2020-05-23 03:00:00        Sankt Veit an der Glan      52  95,3166529190725  205
2020-05-23 03:00:00                     Schärding      58  101,209276353674  414
2020-05-23 03:00:00                      Scheibbs     103  248,774243412313  320
2020-05-23 03:00:00                        Schwaz     352   419,68213847126  709
2020-05-23 03:00:00           Spittal an der Drau      42  55,1970666701712  206
2020-05-23 03:00:00                  Steyr(Stadt)      62  162,333411881758  402
2020-05-23 03:00:00                    Steyr-Land     181  299,534976086849  415
2020-05-23 03:00:00              Südoststeiermark      62  72,1374800749299  623
2020-05-23 03:00:00                       Tamsweg      22  108,267716535433  505
2020-05-23 03:00:00                         Tulln     175  168,640564319511  321
2020-05-23 03:00:00               Urfahr-Umgebung     281  328,635752295187  416
2020-05-23 03:00:00                  Villach Land      48  74,2252737056968  207
2020-05-23 03:00:00                 Villach Stadt      21  33,7387336728628  202
2020-05-23 03:00:00                   Vöcklabruck     107  78,5303809824371  417
2020-05-23 03:00:00                     Voitsberg     106  207,189069799261  616
2020-05-23 03:00:00                   Völkermarkt      52  124,170208701466  208
2020-05-23 03:00:00        Waidhofen an der Thaya      73  281,983930778739  322
2020-05-23 03:00:00  Waidhofen an der Ybbs(Stadt)      25  222,005150519492  303
2020-05-23 03:00:00                          Weiz     197  218,057846208339  617
2020-05-23 03:00:00                   Wels(Stadt)      58  93,9621235439921  403
2020-05-23 03:00:00                     Wels-Land      79  108,080006566886  418
2020-05-23 03:00:00                   Wien(Stadt)    3048  160,633172963666  900
2020-05-23 03:00:00         Wiener Neustadt(Land)      86  110,269133617982  323
2020-05-23 03:00:00        Wiener Neustadt(Stadt)      56  123,683106212867  304
2020-05-23 03:00:00                     Wolfsberg      43  81,5536926753404  209
2020-05-23 03:00:00                   Zell am See     359  410,463972925385  506
2020-05-23 03:00:00                        Zwettl      73  172,895646819194  325

[279 rows x 4 columns]

最后,您可以将输出写入csv文件.我不确定您是否要像当前代码那样将文件名附加到输出的csv文件中,因为它每次都使用相同的名称"Bezirke.csv".

Finally, you can write the output to the csv file. I am not sure if you want to append the file name to the output csv file as your current code does because its going to the same name 'Bezirke.csv' everytime.

这篇关于Python-从多个Zip文件中提取CSV文件并合并数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆