如何在Linux(Mac)中从WORD文档中提取嵌入式PDF [英] How to extract embedded PDF from WORD document in Linux (Mac)

查看:360
本文介绍了如何在Linux(Mac)中从WORD文档中提取嵌入式PDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Mac中也遇到过此类问题,只想在bash脚本文件中共享我的解决方案,而无需其他应用程序!

I have encountered such problem too in Mac, and just wanted to share my solution in bash script file with no addition needed application!

推荐答案

此脚本将提取word文档中嵌入的所有pdf文件.

This script will extract all the pdf files embedded inside the word document.

只需将脚本文件放在word.docx文件所在的位置,然后运行它(首先赋予它权限),就像这样:

Simply put the script file where your word.docx file is and run it (give it permissions first) like:

./extract_docx_objects.sh word.docx

提取的文件将位于子文件夹 docx_zip/word/embeddings/.

The extracted files will be in the sub folder docx_zip/word/embeddings/.

代码如下:

docx=$1
echo $docx
rm -rf docx_zip
mkdir -p docx_zip
cp $docx docx_zip/temp.zip
cd docx_zip/
unzip temp.zip
cd word/embeddings/
FILES=*.bin
echo `ls -la $FILES`
for f in $FILES
do
    echo "processing $f..."
    fname=${f%.*}
    dd if=$f of=$fname.pdf bs=1
    start=`xxd -b $f|grep %PDF -n|awk -F: '{print $1}'`
    start1=$(((start-1)*6))
    end=`xxd -b $f|grep %%EOF -n|awk -F: '{print $1}'`
    end1=$(((end-1)*6+5*2))
    dd skip=$start1 count=$end1 if=$f of=$fname.pdf bs=1
done

您可以在删除文件夹之前添加一个检查文件夹是否已经存在(如我所没有的文件夹).

You can add a check if the folder already exists (as I didn't here) before deleting it.

享受!

[INFO]

如果您需要Windows中的VBA宏来执行相同的操作,这是我的解决方案:

If you need a VBA macro in Windows to do the same, here's my solution:

VBA中有一个部分解决方案,在运行之前需要进行准备:

There is a partial solution in VBA, and it needs preparation before you can run it:

  1. 将您的Word.docx文件复制/重命名为Word.zip
  2. 使用您的zip软件,解压缩Word.zip(相同的文件夹或其他文件夹)
  3. 从任何Word文档运行VBA宏-它将询问您解压缩的文件夹位置在何处.
  4. 完成后,ODF文件将位于Word/word/embedddings子文件夹中.

VBA宏:

Sub export_PDFs()
    Dim Contents As String
    Dim PDF As String
    Dim hFile As Integer
    Dim i As Long, j As Long
    Dim ExtractedZippedDocxFolder, FileNameBin, FileNamePDF, BinFolderPath As String
    Dim fileIndex As Integer
   
    Dim dlgOpen As FileDialog
    Set dlgOpen = Application.FileDialog( _
    FileDialogType:=msoFileDialogFolderPicker)
    With dlgOpen
        .AllowMultiSelect = False
        .Title = "Select the unzipped docx folder to extract PDF file(s) from"
        .InitialFileName = "*.docx"
        .Show
    End With
    ExtractedZippedDocxFolder = dlgOpen.SelectedItems.Item(1)
    BinFolderPath = ExtractedZippedDocxFolder + "\word\embeddings"
    Set objFSO = CreateObject("Scripting.FileSystemObject")
    Set objFolder = objFSO.GetFolder(BinFolderPath)
    fileIndex = 0
   
    For Each objFile In objFolder.Files
        If LCase$(Right$(objFile.Name, 4)) = ".bin" Then
            FileNameIndex = Left$(objFile.Name, Len(objFile.Name) - Len(".bin"))
            FileNameBin = BinFolderPath + "\" + FileNameIndex + ".bin"
            FileNamePDF = BinFolderPath + "\" + FileNameIndex + ".pdf"
       
            hFile = FreeFile
            Open FileNameBin For Binary Access Read As #hFile
            Contents = String(LOF(hFile), vbNullChar)
            Get #hFile, , Contents
            Close #hFile
       
            i = InStrB(1, Contents, "%PDF")
            j = InStrB(i, Contents, "%%EOF")
            If (InStrB(j + 1, Contents, "%%EOF") > 0) Then j = InStrB(j + 1, Contents, "%%EOF")
       
            PDF = MidB(Contents, i, j + 5 - i + 12)
       
            Open FileNamePDF For Binary Access Write As #hFile
            Put #hFile, , PDF
            Close #hFile
            fileIndex = fileIndex + 1
        End If
    Next
    If fileIndex = 0 Then
        MsgBox "Unable to find any bin file in the givven unzipped docx file content"
    Else
        MsgBox Str(fileIndex) + "  files were processed"
    End If

End Sub
   

这篇关于如何在Linux(Mac)中从WORD文档中提取嵌入式PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆