如何在Excel工作表中下载嵌入式PDF文件? [英] How to download embedded PDF files in an excel worksheet?

查看:111
本文介绍了如何在Excel工作表中下载嵌入式PDF文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有这个程序,可以解析出excel数据(使用Gembox).但是,有时候我想下载/提取一些工作表中的嵌入式PDF文件.我还没有找到检测和下载这些对象的方法.任何人都可以为我指出实现该目标的正确方向?我知道微软有一个可以读取excel文件的Office文档提取器,但是它只能检测到word等文件.

So I have this program that I made that parses out excel data (using Gembox). However sometimes there are embedded PDF files in some of the worksheets that I would like to download/extract. I have not been able to find a way to detect and download these objects. Anyone able to point me in the right direction on how this is achieved? I know Microsoft has an Office Document extractor that reads excel files, but it only detects Office files like word and such.

林不要求任何人做我的工作,我写出来的代码,我只是丢在这里,好像它是一个相当复杂的过程.

Im not asking anyone to do my work for me and write out code, Im just lost here, seems like its a pretty complex process.

推荐答案

更新(2020-03-28)

较新版本的GemBox.Spreadsheet支持 ExcelWorksheet.EmbeddedObjects .

Newer versions of GemBox.Spreadsheet have support for ExcelWorksheet.EmbeddedObjects.

因此,您现在可以使用以下内容:

So, you can now use the following:

var workbook = ExcelFile.Load("input.xlsx");
var worksheet = workbook.Worksheets[0];

var embeddedObjects = worksheet.EmbeddedObjects;

for (int index = 0; index < embeddedObjects.Count; index++)
{
    ExcelEmbeddedObject embeddedObject = embeddedObjects[index];
    if (embeddedObject.ContentType != "application/vnd.openxmlformats-officedocument.oleObject")
        continue;

    byte[] embeddedBytes;
    using (var memoryStream = new MemoryStream())
    {
        embeddedObject.Data.CopyTo(memoryStream);
        embeddedBytes = memoryStream.ToArray();
    }

    string embeddedContent = Encoding.ASCII.GetString(embeddedBytes);
    int pdfHeaderIndex = embeddedContent.IndexOf("%PDF");
    if (pdfHeaderIndex < 0)
        continue;

    byte[] pdfBytes = new byte[embeddedBytes.Length - pdfHeaderIndex];
    Array.Copy(embeddedBytes, pdfHeaderIndex, pdfBytes, 0, pdfBytes.Length);

    File.WriteAllBytes($"embedded-pdf-{index}.pdf", pdfBytes);
}

原始

GemBox.Spreadsheet当前不支持此功能,但是您可以使用WindowsBase.dll程序集中的System.IO.Packaging命名空间来满足您的要求.

GemBox.Spreadsheet currently does not have a support for this, but you can achive your requiroment with a System.IO.Packaging namespace in WindowsBase.dll assembly.

尝试以下代码示例:

using System;
using System.IO;
using System.IO.Packaging;
using System.Text;

static class PdfExtractor
{
    public static void ExtractPdf(string packagePath, string destinationDirectory)
    {
        using (var package = Package.Open(packagePath))
        {
            int i = 1;
            foreach (var part in package.GetParts())
                if (part.ContentType == "application/vnd.openxmlformats-officedocument.oleObject")
                {
                    // PDF data is embedded into OLE Object package part.

                    var pdfContent = GetPdfContent(part.GetStream());
                    if (pdfContent != null)
                        File.WriteAllBytes(Path.Combine(destinationDirectory, "EmbeddedPdf" + (i++) + ".pdf"), pdfContent);
                }
        }
    }

    private static byte[] GetPdfContent(Stream stream)
    {
        // Every PDF file/data starts with '%PDF' and ends with '%%EOF'.
        const string pdfStart = "%PDF", pdfEnd = "%%EOF";

        byte[] bytes = ConvertStreamToArray(stream);

        string text = Encoding.ASCII.GetString(bytes);

        int startIndex = text.IndexOf(pdfStart, StringComparison.Ordinal);
        if (startIndex < 0)
            return null;

        int endIndex = text.LastIndexOf(pdfEnd, StringComparison.Ordinal);
        if (endIndex < 0)
            return null;

        var pdfBytes = new byte[endIndex + pdfEnd.Length - startIndex];
        Array.Copy(bytes, startIndex, pdfBytes, 0, pdfBytes.Length);

        return pdfBytes;
    }

    private static byte[] ConvertStreamToArray(Stream stream)
    {
        var buffer = new byte[16 * 1024];
        using (var ms = new MemoryStream())
        {
            int read;
            while ((read = stream.Read(buffer, 0, buffer.Length)) > 0)
                ms.Write(buffer, 0, read);

            return ms.ToArray();
        }
    }
}

这篇关于如何在Excel工作表中下载嵌入式PDF文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆