pandas read_fwf无法加载文件的整个内容 [英] Pandas read_fwf not Loading Entire Content of File

查看:134
本文介绍了 pandas read_fwf无法加载文件的整个内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的固定宽度文件(约3000万行,4gb),当我尝试使用pandas read_fwf()创建DataFrame时,它仅加载了文件的一部分,并且只是好奇是否有人与此解析器类似的问题是无法读取文件的全部内容.

I have a rather large fixed-width file (~30M rows, 4gb) and when I attempted to create a DataFrame using pandas read_fwf() it only loaded a portion of the file, and was just curious if anyone has had a similar issue with this parser not reading the entire contents of a file.

import pandas as pd

file_name = r"C:\....\file.txt"
fwidths = [3,7,9,11,51,51]

df = read_fwf(file_name, widths = fwidths, names = [col0, col1, col2, col3, col4, col5])
print df.shape #<30M

如果我使用read_csv()天真地将文件读入1列,则所有文件都会读到内存中,并且不会丢失数据.

If I naively read the file into 1 column using read_csv(), all of the file is read to memory and there is no data loss.

import pandas as pd

file_name = r"C:\....\file.txt"

df = read_csv(file_name, delimiter = "|", names = [col0]) #arbitrary delimiter (the file doesn't include pipes)
print df.shape #~30M

当然,在没有看到文件内容或格式的情况下,它可能与我身上的东西有关,但希望查看过去是否有其他人对此有任何疑问.我进行了健全性检查,并测试了文件中几行的深度,它们似乎都正确格式化了(当我能够使用相同的规格将其插入带有Talend的Oracle DB中时,进行了进一步的验证).

Of course, without seeing the contents or format of the file it could be related to something on my end, but wanted to see if anyone else has had any issues with this in the past. I did a sanity check and tested a couple of the rows deep in the file and they all seem to be formatted correctly (further verified when I was able to pull this into an Oracle DB with Talend using the same specs).

让我知道如果有人有任何想法,最好通过Python运行一切,而在我开始开发分析时不要来回走动.

Let me know if anyone has any ideas, it would be great to run everything via Python and not go back and forth when I begin to develop analytics.

推荐答案

输入文件的几行内容对于查看日期是有用的.不过,我生成了一些您拥有的类似格式(我认为)的随机文件,并将pd.read_fwf应用于其中.这是生成和读取代码:

Few lines of the input file would be useful to see how the date looks like. Nevertheless, I generated some random file of similar format (I think) that you have, and applied pd.read_fwf into it. This is the code for the generation and reading it:

从随机导入随机

import pandas as pd


file_name = r"/tmp/file.txt"

lines_no = int(30e6)

with open(file_name, 'w') as f:
    for i in range(lines_no):
        if i%int(1e5) == 0:
            print("Writing progress: {:0.1f}%"
                    .format(float(i) / float(lines_no)*100), end='\r')
        f.write(" ".join(["{:<10.8f}".format(random()*10) for v in range(6)])+"\n")


print("File created. Now read it using pd.read_fwf ...")

fwidths = [11,11,11,11,11,11]

df = pd.read_fwf(file_name, widths = fwidths,
               names = ['col0', 'col1', 'col2', 'col3', 'col4', 'col5'])


#print(df)

print(df.shape) #<30M

因此,在这种情况下,它接缝工作正常.我使用Python 3.4,Ubuntu 14.04 x64和pandas 0.15.1.创建文件并使用pd.read_fwf读取文件需要花费一些时间.但这似乎是有效的,至少对于我和我的设置而言.

So in this case, it seams it is working fine. I use Python 3.4, Ubuntu 14.04 x64 and pandas 0.15.1. It takes a while to create the file and read it using pd.read_fwf. But it seems to be working, at least for me and my setup.

结果是:(30000000, 6)

创建的示例文件:

7.83905215 9.64128377 9.64105762 8.25477816 7.31239330 2.23281189
8.55574419 9.08541874 9.43144800 5.18010536 9.06135038 2.02270145
7.09596172 7.17842495 9.95050576 4.98381816 1.36314390 5.47905083
6.63270922 4.42571036 2.54911162 4.81059164 2.31962024 0.85531626
2.01521946 6.50660619 8.85352934 0.54010559 7.28895079 7.69120905

这篇关于 pandas read_fwf无法加载文件的整个内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆