如何在Linux(Mac)中从WORD文档中提取嵌入式PDF [英] How to extract embedded PDF from WORD document in Linux (Mac)
问题描述
我在Mac中也遇到过此类问题,只想在bash脚本文件中共享我的解决方案,而无需其他应用程序!
I have encountered such problem too in Mac, and just wanted to share my solution in bash script file with no addition needed application!
推荐答案
此脚本将提取word文档中嵌入的所有pdf文件.
This script will extract all the pdf files embedded inside the word document.
只需将脚本文件放在word.docx文件所在的位置,然后运行它(首先赋予它权限),就像这样:
Simply put the script file where your word.docx file is and run it (give it permissions first) like:
./extract_docx_objects.sh word.docx
提取的文件将位于子文件夹 docx_zip/word/embeddings/.
The extracted files will be in the sub folder docx_zip/word/embeddings/.
代码如下:
docx=$1
echo $docx
rm -rf docx_zip
mkdir -p docx_zip
cp $docx docx_zip/temp.zip
cd docx_zip/
unzip temp.zip
cd word/embeddings/
FILES=*.bin
echo `ls -la $FILES`
for f in $FILES
do
echo "processing $f..."
fname=${f%.*}
dd if=$f of=$fname.pdf bs=1
start=`xxd -b $f|grep %PDF -n|awk -F: '{print $1}'`
start1=$(((start-1)*6))
end=`xxd -b $f|grep %%EOF -n|awk -F: '{print $1}'`
end1=$(((end-1)*6+5*2))
dd skip=$start1 count=$end1 if=$f of=$fname.pdf bs=1
done
您可以在删除文件夹之前添加一个检查文件夹是否已经存在(如我所没有的文件夹).
You can add a check if the folder already exists (as I didn't here) before deleting it.
享受!
[INFO]
如果您需要Windows中的VBA宏来执行相同的操作,这是我的解决方案:
If you need a VBA macro in Windows to do the same, here's my solution:
VBA中有一个部分解决方案,在运行之前需要进行准备:
There is a partial solution in VBA, and it needs preparation before you can run it:
- 将您的Word.docx文件复制/重命名为Word.zip
- 使用您的zip软件,解压缩Word.zip(相同的文件夹或其他文件夹)
- 从任何Word文档运行VBA宏-它将询问您解压缩的文件夹位置在何处.
- 完成后,ODF文件将位于Word/word/embedddings子文件夹中.
VBA宏:
Sub export_PDFs()
Dim Contents As String
Dim PDF As String
Dim hFile As Integer
Dim i As Long, j As Long
Dim ExtractedZippedDocxFolder, FileNameBin, FileNamePDF, BinFolderPath As String
Dim fileIndex As Integer
Dim dlgOpen As FileDialog
Set dlgOpen = Application.FileDialog( _
FileDialogType:=msoFileDialogFolderPicker)
With dlgOpen
.AllowMultiSelect = False
.Title = "Select the unzipped docx folder to extract PDF file(s) from"
.InitialFileName = "*.docx"
.Show
End With
ExtractedZippedDocxFolder = dlgOpen.SelectedItems.Item(1)
BinFolderPath = ExtractedZippedDocxFolder + "\word\embeddings"
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFolder = objFSO.GetFolder(BinFolderPath)
fileIndex = 0
For Each objFile In objFolder.Files
If LCase$(Right$(objFile.Name, 4)) = ".bin" Then
FileNameIndex = Left$(objFile.Name, Len(objFile.Name) - Len(".bin"))
FileNameBin = BinFolderPath + "\" + FileNameIndex + ".bin"
FileNamePDF = BinFolderPath + "\" + FileNameIndex + ".pdf"
hFile = FreeFile
Open FileNameBin For Binary Access Read As #hFile
Contents = String(LOF(hFile), vbNullChar)
Get #hFile, , Contents
Close #hFile
i = InStrB(1, Contents, "%PDF")
j = InStrB(i, Contents, "%%EOF")
If (InStrB(j + 1, Contents, "%%EOF") > 0) Then j = InStrB(j + 1, Contents, "%%EOF")
PDF = MidB(Contents, i, j + 5 - i + 12)
Open FileNamePDF For Binary Access Write As #hFile
Put #hFile, , PDF
Close #hFile
fileIndex = fileIndex + 1
End If
Next
If fileIndex = 0 Then
MsgBox "Unable to find any bin file in the givven unzipped docx file content"
Else
MsgBox Str(fileIndex) + " files were processed"
End If
End Sub
这篇关于如何在Linux(Mac)中从WORD文档中提取嵌入式PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!