Python:无需剪贴板即可从Office/Excel文档访问嵌入式OLE [英] Python: Access embedded OLE from Office/Excel document without clipboard

查看:236
本文介绍了Python:无需剪贴板即可从Office/Excel文档访问嵌入式OLE的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Python从Office/Excel文档中添加和提取文件.到目前为止,添加东西很容易,但是对于提取,我还没有找到一个干净的解决方案.

I want to add and extract files from an Office/Excel document using Python. So far adding things is easy but for extracting I haven't found a clean solution.

为弄清楚我所拥有的和没有什么,我在下面写了一个小示例 test.py 并作了进一步说明.

To make clear what I've got and what not I've written the small example test.py below and explain further.

test.py

import win32com.client as win32
import os 
from tkinter import messagebox
import win32clipboard

# (0) Setup
dir_path = os.path.dirname(os.path.realpath(__file__))
print(dir_path)
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(dir_path + "\\" + "test_excel.xlsx")
ws = wb.Worksheets.Item(1)
objs = ws.OLEObjects()

# (1) Embed file
f = dir_path + "\\" + "test_txt.txt"
name = "test_txt_ole.txt"
objs.Add( Filename=f, IconLabel=name )

# (2) Access embedded file
obj = objs.Item(1) # Get single OLE from OLE list
obj.Copy()
win32clipboard.OpenClipboard()
data = win32clipboard.GetClipboardData(0xC004) # Binary access
win32clipboard.EmptyClipboard()
win32clipboard.CloseClipboard()
messagebox.showinfo(title="test_txt_ole.txt", message=str(data))

# (3) Press don't save here to keep 
# wb.Close() # Will close excel document and leave excel opened.
excel.Application.Quit() # Will close excel with all opened documents

要进行准备(第0步),它会打开一个给定的excel文档,其中包含一个以前使用excel中的新文档按钮创建的工作表.

For preparation (step 0) it opens a given excel document with one worksheet that was create before by using new document button in excel.

步骤(1)中,它使用API​​将给定的文本文件嵌入到excel文档中.之前使用文本编辑器创建了文本文件,内容为"TEST123".

In step (1) it uses API to embed a given text file to the excel document. The text file was created before with content "TEST123" using a text editor.

然后在步骤(2)中,它尝试使用剪贴板从嵌入式OLE读取内容,并打开一个消息框,该消息框显示剪贴板中OLE的内容.

Afterwards in step (2) it tries to read back content from embedded OLE using clipboard and opens a message box that shows the content from OLE in clipboard.

最后(3)程序将关闭打开的文档.要保持设置不变,请在此处按否".

Finally (3) the program closes the opened document. To keep an unchanged setup press no here.

此解决方案的最大缺点是使用剪贴板,该剪贴板会破坏剪贴板中的任何用户内容,而这在生产环境中是不好的样式.此外,它对剪贴板使用了未记录的选项.

The big disadvantage of this solution is the use of clipboard which smashes any user content in clipboard which is bad style in a productive environment. Further it uses an undocumented option for clipboard.

更好的解决方案是将OLE或OLE嵌入式文件安全地存储到python数据容器或我选择的文件中.在我的示例中,我使用了TXT文件来轻松识别文件数据.最后,我将使用ZIP来实现多合一解决方案,但是TXT文件解决方案对于base64数据就足够了.

A better solution would be to safe OLE or OLE embedded file to a python data container or to a file of my choice. In my example I've used a TXT file to easily identify file data. Finally I'll use ZIP for an all-in-one solution but a TXT file solution would be sufficient for base64 data.

0xC004的来源= 49156: https://danny.fyi/embedding-and-accessing-a-file-in-excel-with-vba-and-ole-objects-4d4e7863cfff

Source of 0xC004 = 49156: https://danny.fyi/embedding-and-accessing-a-file-in-excel-with-vba-and-ole-objects-4d4e7863cfff

这个VBA示例看起来很有趣,但是我对VBA毫无头绪:

This VBA example look interesting but I have no clue about VBA: Saving embedded OLE Object (Excel workbook) to file in Excel 2010

推荐答案

好吧,我发现Parfait的解决方案有点黑(在不好的意义上),因为

Well, I find Parfait's solution a bit hackish (in the bad sense) because

  • 它假定Excel将把嵌入另存为一个临时文件,
  • 假定此临时文件的路径始终是用户的默认临时路径,
  • 它假定您将有权在此处打开文件,
  • 它假定您使用命名约定来标识您的对象(例如,总是在名称中找到"test_txt",但不能 插入一个对象'account_data')
  • 假定该约定不受操作系统干扰(例如,它将不会更改为'〜test_tx(1)'以保存字符) 长度),
  • 它假定此约定已为计算机上的所有其他程序所了解并接受(其他人都不会使用包含"test_txt"的名称).
  • it assumes that Excel will save the embedding as a temporary file,
  • it assumes that the path of this temporary file is always the user's default temp path,
  • it assumes that you will have privileges to open files there,
  • it assumes that you use a naming convention to identify your objects (e.g. 'test_txt' is always found in the name, you can't insert an object 'account_data'),
  • it assumes that this convention is not disturbed by the operating system (e.g. it will not change it to '~test_tx(1)' to save character length),
  • it assumes that this convention is known and accepted by all other programs on the computer (no one else will uses names that contain 'test_txt').

因此,我写了一个替代解决方案.其实质是以下几点:

So, I wrote an alternative solution. The essence of this is thef following:

  1. 将.xlsx文件(或新的基于XML的任何其他Office文件)解压缩 格式,不受密码保护)到临时路径.

  1. unzip the .xlsx file (or any other Office file in the new XML-based format, which is not password protected) to a temporary path.

遍历'/xxx/embeddings'('xxx'= 'xl'或'word'或'ppt'),然后创建一个包含.bin的字典 文件的临时路径作为键,并且从中返回字典 步骤3作为值.

iterate through all .bin files inside the '/xxx/embeddings' ('xxx' = 'xl' or 'word' or 'ppt'), and create a dictionary that contains the .bin files' temporary paths as keys and the dictionaries returned from step 3 as values.

根据.bin文件从.bin文件中提取信息(不是很 有据可查的)Ole Packager格式,并以 一本字典. (不仅将原始二进制数据检索为内容", 来自.txt,但任何文件类型,例如.png)

extract information from the .bin file according to the (not very well documented) Ole Packager format, and return the information as a dictionary. (Retrieves the raw binary data as 'contents', not only from .txt but any file type, e.g. .png)

我仍在学习Python,所以这不是完美的(没有错误检查,没有性能优化),但是您可以从中获得灵感.我通过几个示例对其进行了测试. 这是我的代码:

I'm still learning Python, so this is not perfect (no error checking, no performance optimization) but you can get the idea from it. I tested it on a few examples. Here is my code:

import tempfile
import os
import shutil
import zipfile
import glob
import pythoncom
import win32com.storagecon


def read_zipped_xml_bin_embeddings( path_zipped_xml ):
    temp_dir = tempfile.mkdtemp()

    zip_file = zipfile.ZipFile( path_zipped_xml )
    zip_file.extractall( temp_dir )
    zip_file.close()

    subdir = {
            '.xlsx': 'xl',
            '.xlsm': 'xl',
            '.xltx': 'xl',
            '.xltm': 'xl',
            '.docx': 'word',
            '.dotx': 'word',
            '.docm': 'word',
            '.dotm': 'word',
            '.pptx': 'ppt',
            '.pptm': 'ppt',
            '.potx': 'ppt',
            '.potm': 'ppt',
        }[ os.path.splitext( path_zipped_xml )[ 1 ] ]
    embeddings_dir = temp_dir + '\\' + subdir + '\\embeddings\\*.bin'

    result = {}
    for bin_file in list( glob.glob( embeddings_dir ) ):
        result[ bin_file ] = bin_embedding_to_dictionary( bin_file )

    shutil.rmtree( temp_dir )

    return result


def bin_embedding_to_dictionary( bin_file ):
    storage = pythoncom.StgOpenStorage( bin_file, None, win32com.storagecon.STGM_READ | win32com.storagecon.STGM_SHARE_EXCLUSIVE )
    for stastg in storage.EnumElements():
        if stastg[ 0 ] == '\1Ole10Native':
            stream = storage.OpenStream( stastg[ 0 ], None, win32com.storagecon.STGM_READ | win32com.storagecon.STGM_SHARE_EXCLUSIVE )

            result = {}
            result[ 'original_filename' ] = '' # original filename in ANSI starts at byte 7 and is null terminated
            stream.Seek( 6, 0 )
            while True:
                ch = stream.Read( 1 )
                if ch == '\0':
                    break
                result[ 'original_filename' ] += ch

            result[ 'original_filepath' ] = '' # original filepath in ANSI is next and is null terminated
            while True:
                ch = stream.Read( 1 )
                if ch == '\0':
                    break
                result[ 'original_filepath' ] += ch

            stream.Seek( 4, 1 ) # next 4 bytes is unused

            temporary_filepath_size = 0 # size of the temporary file path in ANSI in little endian
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 0
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 8
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 16
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 24

            result[ 'temporary_filepath' ] = stream.Read( temporary_filepath_size ) # temporary file path in ANSI

            result[ 'size' ] = 0 # size of the contents in little endian
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 0
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 8
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 16
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 24

            result[ 'contents' ] = stream.Read( result[ 'size' ] ) # contents

            return result

您可以像这样使用它:

objects = read_zipped_xml_bin_embeddings( dir_path + '\\test_excel.xlsx' )
obj = objects.values()[ 0 ] # Get first element, or iterate somehow, the keys are the temporary paths
print( 'Original filename: ' + obj[ 'original_filename' ] )
print( 'Original filepath: ' + obj[ 'original_filepath' ] )
print( 'Original filepath: ' + obj[ 'temporary_filepath' ] )
print( 'Contents: ' + obj[ 'contents' ] )

这篇关于Python:无需剪贴板即可从Office/Excel文档访问嵌入式OLE的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆