将数据从.data文件转换为.csv文件,然后使用pandas将数据放入列中 [英] Convert data from .data file to .csv file and put data in columns using pandas

查看:171
本文介绍了将数据从.data文件转换为.csv文件,然后使用pandas将数据放入列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将数据从.data文件转换为.csv文件,然后将.data文件中的数据放在其下带有值的列中.但是,.data文件具有特定的格式,我不知道如何将文本放在各列中. .data文件的外观如下:

I want to convert data from a .data file to a .csv file and put the data from the .data file in columns with values under them. However, the .data file has a specific format and I don't know how to put the text in it in columns. Here is how the .data file looks like:

column1  
column2  
column3  
column4  
column5  
column6  
column7  
column8  
column9  
column10  
column11  
column12  
column13  
........
column36

1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573  
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444  

1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573  
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444  

1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573  
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444  

1243;6543;5754;5678;4567;4567;4567;2573;7532;6332;6432;6542;5542;7883;7643;4684;4568;4573  
3567;5533;6532;6432;7643;8635;7654;6543;8753;7643;7543;7543;7543;6543;6444;7543;6444;6444

如上所示的文件具有36列的名称,每列1行.在这些之下有许多数据点,其中有36个值,以分号分隔.数据点长2行,每个数据点用空白行分隔. .csv文件必须如下所示:

The file as shown above has the names of 36 columns, each on 1 line. Under these are many datapoints, with 36 values in them that are separated by semicolons. The datapoints are 2 lines long and each datapoint is separated by a blank line. The .csv file must look like this:

column1,column2,column3,column4,column5,column6,column7,column8,column9,column10,column11,column12,column14,column15,column16,column17,column18,column20,column20,column21,column22,column23,column24,column25,column26,column27,column28,column29,column30,column31,column32,column33,column34,column35,column36
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444
1243,6543,5754,5678,4567,4567,4567,2573,7532,6332,6432,6542,5542,7883,7643,4684,4568,4573,3567,5533,6532,6432,7643,8635,7654,6543,8753,7643,7543,7543,7543,6543,6444,7543,6444,6444

如上所示,.csv的第一行必须由36列组成,其中的名称用逗号分隔.接下来的行必须包含所有数据点,每个数据点必须在一行上,并且其中的36个值必须用逗号分隔.

The first line of the .csv as shown above file must consist of 36 columns with the names in it separated by commas. The next lines must consist of all datapoints, each on 1 line and in which the 36 values must be separated by commas.

您可以为此使用软件库"pandas"吗?无论如何,这是我的起始代码:

Can you use the software library 'pandas' for this? Anyways, this is my starting code:

with open("file.data") as fIn, open("file.csv", "w") as fOut:
    for r, line in enumerate(fIn):
        if not line:
            break

谢谢

推荐答案

当然,您可以使用熊猫来做到这一点.您只需要阅读第一条N行(在您的情况下为36行)以将它们用作标头,并像普通的csv一样读取文件的其余部分(擅长使用熊猫).然后,您可以将pandas.DataFrame对象保存到csv.

Sure you can do it with pandas. You just need to read first N lines (36 in your case) to use them as header and read rest of the file like normal csv (pandas good at it). Then you can save pandas.DataFrame object to csv.

由于您的数据被拆分为相邻的行,因此我们应该将已读取的DataFrame拆分为两行,然后将它们一一堆叠起来(水平放置).

Since your data splitted into adjacent lines, we should split DataFrame we've read on two and stack them one next to other (horizontaly).

考虑以下代码:

import pandas as pd

COLUMNS_COUNT = 36
# read first `COLUMNS_COUNT` lines to serve as a header
with open('data.data', 'r') as f:
    columns = [next(f).strip() for line in range(COLUMNS_COUNT)]
# read rest of the file to temporary DataFrame
temp_df = pd.read_csv('data.data', skiprows=COLUMNS_COUNT, header=None, delimiter=';', skip_blank_lines=True)
# split temp DataFrame on even and odd rows
even_df = temp_df.iloc[::2].reset_index(drop=True)
odd_df = temp_df.iloc[1::2].reset_index(drop=True)
# stack even and odd DataFrames horizontaly
df = pd.concat([even_df, odd_df], axis=1)
# assign column names
df.columns = columns
# save result DataFrame to csv
df.to_csv('out.csv', index=False)

UPD:代码已更新,可以正确处理分为两行的数据

UPD: code updated to correctly process data splitted onto two lines

这篇关于将数据从.data文件转换为.csv文件,然后使用pandas将数据放入列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆