如何通过C#通过OpenXML从Word(.Docx)中提取OLE文件 [英] How to extract OLE file from Word(.Docx) by OpenXML through C#

查看:182
本文介绍了如何通过C#通过OpenXML从Word(.Docx)中提取OLE文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Openxml从".docx"文件中提取"OLE包".我不知道该怎么做,在正式示例中也找不到任何示例.请帮助我.

I want to use Openxml to abstract "OLE package" from an ".docx" file. I don't know how to do it,and I didn't find any example about it in offical examples. Please help me.

这是我的尝试: 1.我通过"MS Office 2016"构建了一个名为"Test.docx"的Docx文件,并将".zip"文件插入到"Test.docx"中.我打开打开XML SDK 2.5生产率工具"以观看"Test.docx",我发现了这个(图1 ),但我没有获得有关如何通过反射代码提取此zip文件的任何信息.

This is my attempt: 1.I build a Docx file by "MS office 2016" named "Test.docx", and insert an ".zip" file into "Test.docx". I open "Open XML SDK 2.5 Productivity Tool" to Watch "Test.docx", I find this(Figure 1), but I don't get any information about how to extract this zip file through the reflect code.

2.然后,我尝试使用C#和SharpCompress.dll提取此".zip"文件,下面是代码:

2.Then I try to use C# and SharpCompress.dll to extract this ".zip" file, next is the code:

class Program
{
    static void Main(string[] args)
    {
        string filepath = @"C:\Users\宇宙无敌帅小伙\Desktop\test.docx";

        OleFileTest(filepath);
    }

    public static void OleFileTest(string filepath)
    {
        try
        {
            using (WordprocessingDocument Docx = WordprocessingDocument.Open(filepath, true))
            {
                Body body = Docx.MainDocumentPart.Document.Body;

                IEnumerable<EmbeddedObjectPart> embd1 = Docx.MainDocumentPart.EmbeddedObjectParts;

                int cnt = 0;
                foreach (EmbeddedObjectPart item in embd1)
                {
                    System.IO.Stream dt = item.GetStream(FileMode.OpenOrCreate);
                    BinaryWriter writer = new BinaryWriter(dt);
                    byte[] bt = new byte[dt.Length];

                    using (FileStream fs = File.Open($"C:\\Users\\宇宙无敌帅小伙\\Desktop\\{cnt}.zip", FileMode.Create, FileAccess.ReadWrite, FileShare.ReadWrite))
                    {

                        fs.Write(bt, 0, bt.Length);
                    }
                    cnt++;
                }
            }
        }
        catch (Exception e)
        {
            Console.WriteLine(e.Message);
        }
    }
}

但是我无法打开我解压缩的".zip"文件. 有人可以帮我吗?非常感谢!

But I can't open this ".zip" file which I extract. Can somebody help me? thanks a lot!

推荐答案

挑战在于,您从EmbeddedObjectPart提取的二进制文件不是不是您的ZIP文件.这是一个结构化的存储文件,它包含您的ZIP文件.

The challenge is that the binary file you extract from the EmbeddedObjectPart is not your ZIP file. It is a structured storage file that contains your ZIP file.

以下单元测试显示了如何使用Microsoft Word提取作为OLE对象嵌入到Word文档("Resources\\ZipContainer.docx")中的ZIP文件(例如,ZipContents.zip).请注意Ole10Native.ExtractFile()方法的用法,该方法将从嵌入在Word文档中的结构化存储文件(例如oleObject1.bin)中提取ZIP文件.

The following unit test shows how you can extract a ZIP file (e.g., ZipContents.zip) that was embedded into a Word document ("Resources\\ZipContainer.docx") as an OLE object, using Microsoft Word. Note the usage of the Ole10Native.ExtractFile() method, which extracts the ZIP file from the structured storage file (e.g., oleObject1.bin) embedded in your Word document.

using System.IO;
using CodeSnippets.Windows;
using DocumentFormat.OpenXml.Packaging;
using Xunit;

namespace CodeSnippets.Tests.OpenXml.Wordprocessing
{
    public class EmbeddedObjectPartTests
    {
        private static void ExtractFile(EmbeddedObjectPart part, string destinationFolderPath)
        {
            // Determine the file name and destination path of the binary,
            // structured storage file.
            string binaryFileName = Path.GetFileName(part.Uri.ToString());
            string binaryFilePath = Path.Combine(destinationFolderPath, binaryFileName);

            // Ensure the destination directory exists.
            Directory.CreateDirectory(destinationFolderPath);

            // Copy part contents to structured storage file.
            using (Stream partStream = part.GetStream())
            using (FileStream fileStream = File.Create(binaryFilePath))
            {
                partStream.CopyTo(fileStream);
            }

            // Extract the embedded file from the structured storage file.
            Ole10Native.ExtractFile(binaryFilePath, destinationFolderPath);

            // Remove the structured storage file.
            File.Delete(binaryFilePath);
        }

        [Fact]
        public void CanExtractEmbeddedZipFile()
        {
            const string documentPath = "Resources\\ZipContainer.docx";
            const string destinationFolderPath = "Output";
            string destinationFilePath = Path.Combine(destinationFolderPath, "ZipContents.zip");

            using WordprocessingDocument wordDocument =
                WordprocessingDocument.Open(documentPath, false);

            // Extract all embedded objects.
            foreach (EmbeddedObjectPart part in wordDocument.MainDocumentPart.EmbeddedObjectParts)
            {
                ExtractFile(part, destinationFolderPath);
            }

            Assert.True(File.Exists(destinationFilePath));
        }
    }
}

这是Ole10Native类的要点,该类曾经由Microsoft发布,但如今很难找到:

Here's the gist of the Ole10Native class, which was once published by Microsoft but is a bit hard to find nowadays:

using System;
using System.IO;
using System.Runtime.InteropServices;
using System.Runtime.InteropServices.ComTypes;
using System.Text.RegularExpressions;

namespace CodeSnippets.Windows
{
    public class Ole10Native
    {
        public static void ExtractFile(string sourceFilePath, string destinationFolder)
        {
            StgOpenStorage(sourceFilePath, null, STGM.READWRITE | STGM.SHARE_EXCLUSIVE, IntPtr.Zero, 0, out IStorage iStorage);
            ProcessPackage(iStorage, destinationFolder);
            Marshal.ReleaseComObject(iStorage);
        }

        private static void ProcessPackage(IStorage pStg, string destinationFolder)
        {
            uint numReturned;
            pStg.EnumElements(0, IntPtr.Zero, 0, out IEnumSTATSTG pEnumStatStg);
            var ss = new STATSTG[1];

            // Loop through the STATSTG structures in the storage.
            do
            {
                // Retrieve the STATSTG structure
                pEnumStatStg.Next(1, ss, out numReturned);
                if (numReturned != 0)
                {
                    //System.Runtime.InteropServices.ComTypes.STATSTG statstm;
                    var bytT = new byte[4];

                    // Check if the pwcsName contains "Ole10Native" stream which contain the actual embedded object
                    if (ss[0].pwcsName.Contains("Ole10Native"))
                    {
                        // Get the stream objectOpen the stream
                        pStg.OpenStream(ss[0].pwcsName, IntPtr.Zero, (uint) STGM.READ | (uint) STGM.SHARE_EXCLUSIVE, 0,
                            out IStream pStream);

                        //pStream.Stat(out statstm, (int) STATFLAG.STATFLAG_DEFAULT);

                        IntPtr position = IntPtr.Zero;

                        // File name starts from 7th Byte.
                        // Position the cursor to the 7th Byte.
                        pStream.Seek(6, 0, position);

                        var ulRead = new IntPtr();
                        var filename = new char[260];
                        int i;

                        // Read the File name of the embedded object
                        for (i = 0; i < 260; i++)
                        {
                            pStream.Read(bytT, 1, ulRead);
                            pStream.Seek(0, 1, position);
                            filename[i] = (char) bytT[0];
                            if (bytT[0] == 0) break;
                        }

                        var path = new string(filename, 0, i);

                        // Next part is the source path of the embedded object.
                        // Length is unknown. Hence, loop through each byte to read the 0 terminated string
                        // Read the source path.
                        for (i = 0; i < 260; i++)
                        {
                            pStream.Read(bytT, 1, ulRead);
                            pStream.Seek(0, 1, position);
                            filename[i] = (char) bytT[0];
                            if (bytT[0] == 0) break;
                        }

                        // Unknown 4 bytes
                        pStream.Seek(4, 1, position);

                        // Next 4 byte gives the length of the temporary file path
                        // (Office uses a temporary location to copy the files before inserting to the document)
                        // The length is in little endian format. Hence conversion is needed
                        pStream.Read(bytT, 4, ulRead);
                        ulong dwSize = 0;
                        dwSize += (ulong) (bytT[3] << 24);
                        dwSize += (ulong) (bytT[2] << 16);
                        dwSize += (ulong) (bytT[1] << 8);
                        dwSize += bytT[0];

                        // Skip the temporary file path
                        pStream.Seek((long) dwSize, 1, position);

                        // Next four bytes gives the size of the actual data in little endian format.
                        // Convert the format.
                        pStream.Read(bytT, 4, ulRead);
                        dwSize = 0;
                        dwSize += (ulong) (bytT[3] << 24);
                        dwSize += (ulong) (bytT[2] << 16);
                        dwSize += (ulong) (bytT[1] << 8);
                        dwSize += bytT[0];

                        // Read the actual file content
                        var byData = new byte[dwSize];
                        pStream.Read(byData, (int) dwSize, ulRead);

                        // Create the file
                        var bWriter = new BinaryWriter(File.Open(Path.Combine(destinationFolder, GetFileName(path)),
                            FileMode.Create));
                        bWriter.Write(byData);
                        bWriter.Close();
                    }
                }
            } while (numReturned > 0);

            Marshal.ReleaseComObject(pEnumStatStg);
        }

        private static string GetFileName(string filePath)
        {
            return Regex.Replace(filePath, @"^.*[\\]", "");
        }
    }
}

您可以在我的 CodeSnippets GitHub中找到完整的源代码(包括Ole10Native类)存储库.

You can find the full source code (including the Ole10Native class) in my CodeSnippets GitHub repository.

这篇关于如何通过C#通过OpenXML从Word(.Docx)中提取OLE文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆