在pandas read_csv之前预处理数据文件 [英] Pre-process data file before pandas read_csv
问题描述
我使用SAP的数据输出,但它既不是CSV,因为它不引用包含其分隔符的字符串,也不固定宽度,因为它有多字节字符。
要获取它到pandas我目前读的文件,获取分隔符的位置,分隔每一行的分隔符然后将其保存为适当的CSV,我可以毫无困难地阅读。
我看到pandas read_csv可以获得文件缓冲区。我如何直接传递我的流,而不保存csv文件?我应该做一个发电机吗?我可以得到csv.writer.writerow输出而不给它一个文件句柄?
这是我的代码:
import pandas as pd
caminho = r'C:\Users\user\Documents\SAP\Tests\\'
arquivo =ExpComp_01.txt
tipo_dado = {KEY_GUID:object,DEL_IND:object,HDR_GUID:object,PRICE:object LEADTIME:int16,MANUFACTURER:object,LOAD_TIME:object,APPR_TIME:object,SEND_TIME:object,DESCRIPTION:object}
def desmembra(linha,limites):
#这个函数接收每个分隔符的索引并在其周围裁剪
posicao = limites [0]
限制limites [1:] :
yield linha [posicao + 1:limite]
posicao = limite
def pre_processa(arquivo):
import csv
import os
#转换标准CSV中的SAP输出
打开(arquivo,r,encoding =mbcs)as entrada,open(arquivo [: - 3] +
csv,w ,newline =,encoding =mbcs)as said:
escreve = csv.writer(saida,csv.QUOTE_MINIMAL,delimiter =;)。writerow
for entrada:
#查找标题
如果行[0] ==|:
delimitadores = [x for x,v in enumerate(line)if v =='|']
if line [-2]!=|:
delimitadores.append(None)
cabecalho_teste = line [:50]
escreve([campo.strip()for campo in desmembra ,delimitadores)])
break
对于entrada中的行:
如果line [0] ==|和行[:50]!= cabecalho_teste:
escreve([campo.strip()for campo in desmembra(line,delimitadores)])
pre_processa(caminho + arquivo)
dados = pd.read_csv(caminho + arquivo [:3] +csv,sep =;,
header = 0,encoding =mbcs,dtype = tipo_dado)
此外,如果您可以分享最佳做法:
datetime string as this 20.120.813.132.432
其中我可以成功转换使用
dados [SEND_TIME] = pd.to_datetime(dados [SEND_TIME],format =%Y%m%d%H%M%S)
dados [SEND_TIME]。replace regex = False,inplace = True,to_replace = r'。',value = r'')
我不能写一个解析器,因为我有日期存储在不同的字符串格式。它会更快地指定一个转换器在导入期间做它或有熊猫做它在最后列吗?
我有一个类似的问题与代码 99999999
,我必须添加点到 99.999.999
。我不知道如果我应该写一个转换器或等待,直到导入后做一个 df.replace
EDIT - 样本数据:
| KEY_GUID | DEL_IND | HDR_GUID | Prod_CD | DESCRIPTION |价格| LEADTIME |制造商| LOAD_TIME | APPR_TIME | SEND_TIME |
---------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ---------------------------------
| 000427507E64FB29E2006281548EB186 |电子邮件|注册|忘记密码? 29,55 | 30 | | 20.120.813.132.432 | 2012081313292929 | 20.120.505.010.157 |
| 000527507E64FB29E2006281548EB186 | |下一页| 122,91 | 30 | | 20.120.813.132.432 | 20120813132929 | 20.120.505.010.141 |
| 0005DB50112F9E69E10000000A1D2028 | | 384BB350BF56315DE20062700D627978 | 75123676 |Dnerasodáeot.sadot.m| 252.446,99 | 3 |波兰| 20.121.226.175.640 | 2012.1226183608 | 20.121.222.000.015 |
| 000627507E64FB29E2006281548EB186 | |新手上路| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 657,49 | 30 | | 20.120.813.132.432 | 20120813132929 | | 20.120.505.010.128 |
| 000727507E64FB29E2006281548EB186 | | 4C1AD7E25DC50D61E10000000A19FF83 | | Rnerasodaeot.sadot.m | 523,63 | 30 | | 20.120.813.132.432 | 20.120.707.010.119 |
---------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ---------------------------------
------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------------
| KEY_GUID | DEL_IND | HDR_GUID | Prod_CD | DESCRIPTION |价格| LEADTIME |制造商| LOAD_TIME | APPR_TIME | SEND_TIME |
---------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- --------------------------------- | 000827507E64FB29E2006281548EB186 |电子邮件| 2.073,63 | 30 | | 20.120.813.132.432 | 2012081313292929 | 20.120.505.010.127 |
| 000927507E64FB29E2006281548EB186 | |新浪微博0,22 | 30 | | 20.120.813.132.432 | 2012081313292929 | 20.120.505.010.135 |
| 000A27507E64FB29E2006281548EB186 |电子邮件|注册|忘记密码? 300,75 | 30 | | 20.120.813.132.432 | 20.120.505.010.140 |
| 000B27507E64FB29E2006281548EB186 | | 4C1AD7E25DC50D61E10000000A19FF83 | |Aneraéodaeot.sadot.m| 1,19 | 30 | | 20.120.813.132.432 | 2012081313292929 | 20.120.505.010.131 |
| 000C27507E64FB29E2006281548EB186 |电子邮件| Cnerasodaeot.sadot.m | 30,90 | 30 | | 20.120.813.132.432 | 20120813132929 | 20.120.505.010.144 |
---------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ---------------------------------
我会用其他字段处理其他表。所有在这一般形式。我只能相信标题中的分隔符。我也可能在数据中重复标题。它看起来像一个基本的打印输出。
如果你想建立一个DataFrame没有先写CSV, t
需要 pd.read_csv
。虽然可以使用 io.BytesIO
或
cString.StringIO
来写入内存文件类对象,它不会使
有意义的转换一个可迭代的值(如 desmembra(line,delimitadores)
)
到单而是用 pd.read_csv
来重新解析它。
c $ c> pd.DataFrame ,因为 pd.DataFrame
可接受行数据的迭代器。
使用纯Python操作一个一个的值通常不是最快的方法。一般来说,在整个列上使用Pandas函数更快。因此,我将首先将 arquivo
解析为一个DataFrame字符串,然后使用Pandas函数将列后处理成正确的dtype和值。
import pandas as pd
pre>
import os
import csv
import io
caminho = r'C:\Users\u5en\Documents\SAP\Testes\\'
arquivo = os.path.join(caminho, ExpComp_01.txt)
arquivo_csv = os.path.splitext(arquivo)[0] +'.csv'
def desmembra(linha,limites):
#接收每个分隔符的索引并在其周围裁剪
return [linha [limites [i] +1:limites [i + 1]]。strip()
for i in range(len(limites [: - 1 ])
def pre_processa(arquivo,enc):
#将SAP输出转换成字符串列表的迭代器
with io.open(arquivo,r encoding = enc)as entrada:
对于entrada中的行:
#查找标题
如果line [0] ==|:
delimitadores = [x for x,v in enumerate(line)if v =='|']
if line [-2]!=|:
delimitadores.append(None)
cabecalho_teste = line [:50]
yield desmembra(line,delimitadores)
break
在entrada中的行:
如果line [0] ==|和行[:50]!= cabecalho_teste:
yield desmembra(line,delimitadores)
def post_process(dados):
dados ['LEADTIME'] = dados ['LEADTIME '] .astype('int16')
for('SEND_TIME','LOAD_TIME','PRICE'):
dados [col] = dados [col] .str.replace 。','')
for col('SEND_TIME','LOAD_TIME','APPR_TIME'):
dados [col] = pd.to_datetime(dados [col],format =%Y %m%d%H%M%S)
return dados
enc ='mbcs'
saida = pre_processa(arquivo,enc)
header = next (saida)
dados = pd.DataFrame(saida,columns = header)
dados = post_process(dados)
print(dados)
产生
KEY_GUID DEL_IND HDR_GUID \
0 000427507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
1 000527507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
2 0005DB50112F9E69E10000000A1D2028 384BB350BF56315DE20062700D627978
3 000627507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
4 000727507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
5 000927507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
6 000A27507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
7 000B27507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
8 000C27507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83
PROD_CD描述价格LEADTIME制造商\
0 75123636Vneráéíoaeot.sadot.m29,55 30
1 75123643 Tnerasodaeot | sadot.m 122,91 30
2 75123676Dnerasodáeot.sadot.m252446,99 3波兰
3 75123652 Pner | sodaeot.sadot.m 657,49 30
4 Rnerasodaeot.sadot.m 523,63 30
5 75123662 Ane | asodaeot.sadot.m 0,22 30
6 75123626Pneraíodaeot.sadot.m300,75 30
7Aneraéodaeot.sadot.m1,19 30
8 75123613 Cnerasodaeot.sadot.m 30,90 30
LOAD_TIME APPR_TIME SEND_TIME
0 2012-08-13 13:24:32 2012-08-13 13:29:29 2012- 05-05 01:01:57
1 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:41
2 2012- 12-26 17:56:40 2012-12-26 18:36:08 2012-12-22 00:00:15
3 2012-08-13 13:24:32 2012-08-13 13: 29:29 2012-05-05 01:01:28
4 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-07-07 01:01:19
5 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:35
6 2012-08-13 13:24:32 2012- 08-13 13:29:29 2012-05-05 01:01:40
7 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01: 01:31
8 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:44
I work with data output from SAP, but it is neither CSV, as it does not quote strings containing its delimiter, nor fixed width as it has multi-byte chars. It is kind of a "fixed width" character-wise.
To get it into pandas I currently read the file, get the delimiters position, slice every line around the delimiters and then save it to a proper CSV which I can read without trouble.
I see that pandas read_csv can get a file buffer. How would I pass my stream straight to it, without saving a csv file? Should I make a generator? Can I get csv.writer.writerow output without giving it a file handle?
Here is my code:
import pandas as pd caminho= r'C:\Users\user\Documents\SAP\Tests\\' arquivo = "ExpComp_01.txt" tipo_dado = {"KEY_GUID":"object", "DEL_IND":"object", "HDR_GUID":"object", , "PRICE":"object", "LEADTIME":"int16", "MANUFACTURER":"object", "LOAD_TIME":"object", "APPR_TIME":"object", "SEND_TIME":"object", "DESCRIPTION":"object"} def desmembra(linha, limites): # This functions receives each delimiter's index and cuts around it posicao=limites[0] for limite in limites[1:]: yield linha[posicao+1:limite] posicao=limite def pre_processa(arquivo): import csv import os # Translates SAP output in standard CSV with open(arquivo,"r", encoding="mbcs") as entrada, open(arquivo[:-3] + "csv", "w", newline="", encoding="mbcs") as saida: escreve=csv.writer(saida,csv.QUOTE_MINIMAL, delimiter=";").writerow for line in entrada: # Find heading if line[0]=="|": delimitadores = [x for x, v in enumerate(line) if v == '|'] if line[-2] != "|": delimitadores.append(None) cabecalho_teste=line[:50] escreve([campo.strip() for campo in desmembra(line,delimitadores)]) break for line in entrada: if line[0]=="|" and line[:50]!=cabecalho_teste: escreve([campo.strip() for campo in desmembra(line, delimitadores)]) pre_processa(caminho+arquivo) dados = pd.read_csv(caminho + arquivo[:-3] + "csv", sep=";", header=0, encoding="mbcs", dtype=tipo_dado)
Also, if you could share best practices: I have odd datetime strings as this
20.120.813.132.432
which I can successfully convert usingdados["SEND_TIME"]=pd.to_datetime(dados["SEND_TIME"], format="%Y%m%d%H%M%S") dados["SEND_TIME"].replace(regex=False,inplace=True,to_replace=r'.',value=r'')
I can't write a parser for it because I have dates stored in different string formats. Would it be faster to specify a converter to do it during import or have pandas do it column-wise in the end? I have a similar issue with a code
99999999
that I have to add dots to99.999.999
. I do not know if I should write a converter or wait until after the import to do adf.replace
EDIT -- Sample data:
| KEY_GUID|DEL_IND| HDR_GUID|Prod_CD |DESCRIPTION | PRICE|LEADTIME|MANUFACTURER| LOAD_TIME|APPR_TIME | SEND_TIME| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |000427507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123636|Vneráéíoaeot.sadot.m | 29,55 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.157 | |000527507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123643|Tnerasodaeot|sadot.m | 122,91 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.141 | |0005DB50112F9E69E10000000A1D2028| |384BB350BF56315DE20062700D627978|75123676|Dnerasodáeot.sadot.m |252.446,99 |3 |POLAND |20.121.226.175.640 |20121226183608|20.121.222.000.015 | |000627507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123652|Pner|sodaeot.sadot.m | 657,49 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.128 | |000727507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83| |Rnerasodaeot.sadot.m | 523,63 |30 | |20.120.813.132.432 |20120813132929|20.120.707.010.119 | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | KEY_GUID|DEL_IND| HDR_GUID|Prod_CD |DESCRIPTION | PRICE|LEADTIME|MANUFACTURER| LOAD_TIME|APPR_TIME | SEND_TIME| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |000827507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123603|Inerasodéeot.sadot.m | 2.073,63 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.127 | |000927507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123662|Ane|asodaeot.sadot.m | 0,22 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.135 | |000A27507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123626|Pneraíodaeot.sadot.m | 300,75 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.140 | |000B27507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83| |Aneraéodaeot.sadot.m | 1,19 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.131 | |000C27507E64FB29E2006281548EB186| |4C1AD7E25DC50D61E10000000A19FF83|75123613|Cnerasodaeot.sadot.m | 30,90 |30 | |20.120.813.132.432 |20120813132929|20.120.505.010.144 | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I would be dealing with other tables with other fields. All in this general form. I can only trust the separators in the heading. Also I may have repeated headings in the data. It looks like a matricial printout.
解决方案If you want to build a DataFrame without first writing to a CSV, then you don't need
pd.read_csv
. While it is possible to useio.BytesIO
orcString.StringIO
to write to an in-memory file-like object, it doesn't make sense to convert an iterable of values (likedesmembra(line, delimitadores)
) to a single string just to re-parse it withpd.read_csv
.Instead, it is more direct to use
pd.DataFrame
, sincepd.DataFrame
can accept an iterator of row data.Operating on values one-by-one using plain Python is usually not the fastest way to go. Generally, using Pandas functions on whole columns is faster. Therefore, I would parse
arquivo
into a DataFrame of strings first, and then use Pandas functions to post-process the columns into the correct dtype and values.
import pandas as pd import os import csv import io caminho = r'C:\Users\u5en\Documents\SAP\Testes\\' arquivo = os.path.join(caminho, "ExpComp_01.txt") arquivo_csv = os.path.splitext(arquivo)[0] + '.csv' def desmembra(linha, limites): # This functions receives each delimiter's index and cuts around it return [linha[limites[i]+1:limites[i+1]].strip() for i in range(len(limites[:-1]))] def pre_processa(arquivo, enc): # Translates SAP output into an iterator of lists of strings with io.open(arquivo, "r", encoding=enc) as entrada: for line in entrada: # Find heading if line[0] == "|": delimitadores = [x for x, v in enumerate(line) if v == '|'] if line[-2] != "|": delimitadores.append(None) cabecalho_teste = line[:50] yield desmembra(line, delimitadores) break for line in entrada: if line[0] == "|" and line[:50] != cabecalho_teste: yield desmembra(line, delimitadores) def post_process(dados): dados['LEADTIME'] = dados['LEADTIME'].astype('int16') for col in ('SEND_TIME', 'LOAD_TIME', 'PRICE'): dados[col] = dados[col].str.replace(r'.', '') for col in ('SEND_TIME', 'LOAD_TIME', 'APPR_TIME'): dados[col] = pd.to_datetime(dados[col], format="%Y%m%d%H%M%S") return dados enc = 'mbcs' saida = pre_processa(arquivo, enc) header = next(saida) dados = pd.DataFrame(saida, columns=header) dados = post_process(dados) print(dados)
yields
KEY_GUID DEL_IND HDR_GUID \ 0 000427507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83 1 000527507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83 2 0005DB50112F9E69E10000000A1D2028 384BB350BF56315DE20062700D627978 3 000627507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83 4 000727507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83 5 000927507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83 6 000A27507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83 7 000B27507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83 8 000C27507E64FB29E2006281548EB186 4C1AD7E25DC50D61E10000000A19FF83 Prod_CD DESCRIPTION PRICE LEADTIME MANUFACTURER \ 0 75123636 Vneráéíoaeot.sadot.m 29,55 30 1 75123643 Tnerasodaeot|sadot.m 122,91 30 2 75123676 Dnerasodáeot.sadot.m 252446,99 3 POLAND 3 75123652 Pner|sodaeot.sadot.m 657,49 30 4 Rnerasodaeot.sadot.m 523,63 30 5 75123662 Ane|asodaeot.sadot.m 0,22 30 6 75123626 Pneraíodaeot.sadot.m 300,75 30 7 Aneraéodaeot.sadot.m 1,19 30 8 75123613 Cnerasodaeot.sadot.m 30,90 30 LOAD_TIME APPR_TIME SEND_TIME 0 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:57 1 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:41 2 2012-12-26 17:56:40 2012-12-26 18:36:08 2012-12-22 00:00:15 3 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:28 4 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-07-07 01:01:19 5 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:35 6 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:40 7 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:31 8 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:44
这篇关于在pandas read_csv之前预处理数据文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!