将文件解密为流并将流读入大 pandas (hdf或stata) [英] Decrypting a file to a stream and reading the stream into pandas (hdf or stata)

查看:185
本文介绍了将文件解密为流并将流读入大 pandas (hdf或stata)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要做的事情的概述我有加密版本的文件,我需要阅读大熊猫。由于几个原因,将其解密成流而不是文件更好,所以这是我的兴趣,尽管我也尝试将文件解密为一个中间步骤(但这也不起作用)。 p>

我可以得到这个工作的csv,但不是hdf或stata(我会接受一个可以为hdf或stata的答案,虽然答案可能是一样的,这就是为什么我在一个问题中结合)。



加密/解密文件的代码是从另一个stackoverflow问题(我现在找不到)。

 导入熊猫作为pd 
import io
从Crypto import随机
从Crypto.Cipher导入AES

def pad(s):
返回s + b\0*(AES.block_size - len(s)%AES

$ b def加密(message,key,key_size = 256):
message = pad(message)
iv = Random.new()。read(AES.block_size )
cipher = AES.new(key, AES.MODE_CBC,iv)
return iv + cipher.encrypt(message)

def decrypt(ciphertext,key):
iv = ciphertext [:AES.block_size]
cipher = AES.new(key,AES.MODE_CBC,iv)
plaintext = cipher.decrypt(ciphertext [AES.block_size:])
return plaintext.rstrip(b\0)

def encrypt_file(file_name,key):
with open(file_name,'rb')as fo:
plaintext = fo.read()
enc = encrypt (plaintext,key)
with open(file_name +.enc,'wb')as fo:
fo.write(enc)

def decrypt_file(file_name,key )
with open(file_name,'rb')as fo:
ciphertext = fo.read()
dec = decrypt(ciphertext,key)
with open(file_name [ :-4],'wb')作为fo:
fo.write(dec)

这是我试图将解密的代码扩展到流而不是文件。

  def decrypt_stream(file_name,key) :
with o pen(file_name,'rb')as fo:
ciphertext = fo.read()
dec = decrypt(ciphertext,key)
cipherbyte = io.BytesIO()
cipherbyte .write(dec)
cipherbyte.seek(0)
return cipherbyte

最后,这里是示例程序,其示例数据尝试使其工作:

  key ='这是一个示例键'[: 16] 
df = pd.DataFrame({'x':[1,2],'y':[3,4]})

df.to_csv('test.csv ',index = False)
df.to_hdf('test.h5','test',mode ='w')
df.to_stata('test.dta')

encrypt_file('test.csv',key)
encrypt_file('test.h5',key)
encrypt_file('test.dta',key)

decrypt_file 'test.csv.enc',key)
decrypt_file('test.h5.enc',key)
decrypt_file('test.dta.enc',key)

#csv在这里工作,但是hdf和stata不
#我对这部分不感兴趣,但包括它的完整性
df_from_f ile = pd.read_csv('test.csv')
df_from_file = pd.read_hdf('test.h5','test')
df_from_file = pd.read_stata('test.dta')

#csv在这里工作,但hdf和stata不
#下面的hdf和stata行是我真正需要工作的
df_from_stream = pd.read_csv(decrypt_stream('test .csv.enc',key))
df_from_stream = pd.read_hdf(decrypt_stream('test.h5.enc',key),'test')
df_from_stream = pd.read_stata(decrypt_stream .dta.enc',key))

不幸的是,我不认为我可以缩小这个代码了并且仍然有一个完整的例子。



再次,我的希望是让所有4个非工作线上面工作(文件和流为hdf和stata),但我'很高兴接受一个适用于单独的hdf流或单独的stata流的答案。



此外,我对其他加密替代方案开放,我只是使用了一些现有的基于pycrypto的代码,我发现这里是SO。我的工作明确要求256位AES,但除此之外我开放,所以这个解决方案不需要专门针对pycrypto库或上面的具体代码示例。



我的设置信息:

  python:3.4.3 
熊猫:0.17.0(anaconda 2.3.0发行版)
mac os:10.11.3


解决方案

最大的问题是填充/打包方法。它假定空字符不能是实际内容的一部分。由于 stata / hdf 文件是二进制的,所以使用我们使用的额外字节数来编码为一个字符更安全。



此时, read_hdf 不支持从文件像对象,即使API文档声称如此。如果我们限制了 stata 格式,以下代码将执行您需要的:

  import pandas as pd 
import io
from Crypto import Random
from Crypto.Cipher import AES

def pad(s):
n = AES.block_size - len(s)%AES.block_size
return s + n * chr(n)

def unpad(s):
return s [: - ord(s [-1])]

def encrypt(message,key,key_size = 256):
message = pad(message)
iv = Random.new() .read(AES.block_size)
cipher = AES.new(key,AES.MODE_CBC,iv)
return iv + cipher.encrypt(message)

def decrypt(ciphertext ,key):
iv = ciphertext [:AES.block_size]
cipher = AES.new(key,AES.MODE_CBC,iv)
plaintext = cipher.decrypt(ciphertext [AES.block_size :])
返回unpad(明文)

def encrypt_file(file_name,key):
with open(file_name,'rb')as fo:
plaintext = fo.read()
enc = enc rypt(plaintext,key)
with open(file_name +.enc,'wb')as fo:
fo.write(enc)

def decrypt_stream(file_name, key):
with open(file_name,'rb')as fo:
ciphertext = fo.read()
dec = decrypt(ciphertext,key)
cipherbyte = io。 BytesIO()
cipherbyte.write(dec)
cipherbyte.seek(0)
return cipherbyte

key ='这是一个示例键'[:16]

df = pd.DataFrame({
'x':[1,2],
'y':[3,4]
})

df.to_stata('test.dta')

encrypt_file('test.dta',key)

打印pd.read_stata(decrypt_stream(' test.dta.enc',key))

输出:

  index xy 
0 0 1 3
1 1 2 4

在python 3中,您可以使用以下 pad unpad

  def pad(s):
n = AES.block_size - len(s)%AES.block_size
return s + bytearray([n] * n)

def unpad(s):
return s [: - s [ - 1]]


Overview of what I'm trying to do. I have encrypted versions of files that I need to read into pandas. For a couple of reasons it is much better to decrypt into a stream rather than a file, so that's my interest below although I also attempt to decrypt to a file just as an intermediate step (but this also isn't working).

I'm able to get this working for a csv, but not for either hdf or stata (I'd accept an answer that works for either hdf or stata, though the answer might be the same for both, which is why I'm combining in one question).

The code for encrypting/decrypting files is taken from another stackoverflow question (which I can't find at the moment).

import pandas as pd
import io
from Crypto import Random
from Crypto.Cipher import AES

def pad(s):
    return s + b"\0" * (AES.block_size - len(s) % AES.block_size)

def encrypt(message, key, key_size=256):
    message = pad(message)
    iv = Random.new().read(AES.block_size)
    cipher = AES.new(key, AES.MODE_CBC, iv)
    return iv + cipher.encrypt(message)

def decrypt(ciphertext, key):
    iv = ciphertext[:AES.block_size]
    cipher = AES.new(key, AES.MODE_CBC, iv)
    plaintext = cipher.decrypt(ciphertext[AES.block_size:])
    return plaintext.rstrip(b"\0")

def encrypt_file(file_name, key):
    with open(file_name, 'rb') as fo:
        plaintext = fo.read()
    enc = encrypt(plaintext, key)
    with open(file_name + ".enc", 'wb') as fo:
        fo.write(enc)

def decrypt_file(file_name, key):
    with open(file_name, 'rb') as fo:
        ciphertext = fo.read()
    dec = decrypt(ciphertext, key)
    with open(file_name[:-4], 'wb') as fo:
        fo.write(dec)

And here's my attempt to extend the code to decrypt to a stream rather than a file.

def decrypt_stream(file_name, key):
    with open(file_name, 'rb') as fo:
        ciphertext = fo.read()
    dec = decrypt(ciphertext, key)
    cipherbyte = io.BytesIO()
    cipherbyte.write(dec)
    cipherbyte.seek(0)
    return cipherbyte 

Finally, here's the sample program with sample data attempting to make this work:

key = 'this is an example key'[:16]
df = pd.DataFrame({ 'x':[1,2], 'y':[3,4] })

df.to_csv('test.csv',index=False)
df.to_hdf('test.h5','test',mode='w')
df.to_stata('test.dta')

encrypt_file('test.csv',key)
encrypt_file('test.h5',key)
encrypt_file('test.dta',key)

decrypt_file('test.csv.enc',key)
decrypt_file('test.h5.enc',key)
decrypt_file('test.dta.enc',key)

# csv works here but hdf and stata don't
# I'm less interested in this part but include it for completeness
df_from_file = pd.read_csv('test.csv')
df_from_file = pd.read_hdf('test.h5','test')
df_from_file = pd.read_stata('test.dta')

# csv works here but hdf and stata don't
# the hdf and stata lines below are what I really need to get working
df_from_stream = pd.read_csv( decrypt_stream('test.csv.enc',key) )
df_from_stream = pd.read_hdf( decrypt_stream('test.h5.enc',key), 'test' )
df_from_stream = pd.read_stata( decrypt_stream('test.dta.enc',key) )

Unfortunately I don't think I can shrink this code anymore and still have a complete example.

Again, my hope would be to have all 4 non-working lines above working (file and stream for hdf and stata) but I'm happy to accept an answer that works for either the hdf stream alone or the stata stream alone.

Also, I'm open to other encryption alternatives, I just used some existing pycrypto-based code that I found here on SO. My work explicitly requires 256-bit AES but beyond that I'm open so this solution needn't be based specifically on the pycrypto library or the specific code example above.

Info on my setup:

python: 3.4.3
pandas: 0.17.0 (anaconda 2.3.0 distribution)
mac os: 10.11.3

解决方案

The biggest issue is the padding/unpadding method. It assumes that the null character can't be part of the actual content. Since stata/hdf files are binary, it's safer to pad using the number of extra bytes we use, encoded as a character. This number will be used during unpadding.

Also for this time being, read_hdf doesn't support reading from a file like object, even if the API documentation claims so. If we restrict ourselves to the stata format, the following code will perform what you need:

import pandas as pd
import io
from Crypto import Random
from Crypto.Cipher import AES

def pad(s):
    n = AES.block_size - len(s) % AES.block_size
    return s + n * chr(n)

def unpad(s):
    return s[:-ord(s[-1])]

def encrypt(message, key, key_size=256):
    message = pad(message)
    iv = Random.new().read(AES.block_size)
    cipher = AES.new(key, AES.MODE_CBC, iv)
    return iv + cipher.encrypt(message)

def decrypt(ciphertext, key):
    iv = ciphertext[:AES.block_size]
    cipher = AES.new(key, AES.MODE_CBC, iv)
    plaintext = cipher.decrypt(ciphertext[AES.block_size:])
    return unpad(plaintext)

def encrypt_file(file_name, key):
    with open(file_name, 'rb') as fo:
        plaintext = fo.read()
    enc = encrypt(plaintext, key)
    with open(file_name + ".enc", 'wb') as fo:
        fo.write(enc)

def decrypt_stream(file_name, key):
    with open(file_name, 'rb') as fo:
        ciphertext = fo.read()
    dec = decrypt(ciphertext, key)
    cipherbyte = io.BytesIO()
    cipherbyte.write(dec)
    cipherbyte.seek(0)
    return cipherbyte

key = 'this is an example key'[:16]

df = pd.DataFrame({
    'x': [1,2],
    'y': [3,4]
})

df.to_stata('test.dta')

encrypt_file('test.dta', key)

print pd.read_stata(decrypt_stream('test.dta.enc', key))

Output:

   index  x  y
0      0  1  3
1      1  2  4

In python 3 you can use the following pad, unpad versions:

def pad(s):
    n = AES.block_size - len(s) % AES.block_size
    return s + bytearray([n] * n)

def unpad(s):
    return s[:-s[-1]]

这篇关于将文件解密为流并将流读入大 pandas (hdf或stata)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆