使用Python从Word和Excel中提取图片 [英] Extract pictures from Word and Excel with Python

查看:930
本文介绍了使用Python从Word和Excel中提取图片的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种从这些文件类型中删除图片的方法,这就是我想出的解决方案.它遍历给定的目录结构,复制具有适当扩展名的所有文件,然后将副本重命名为filename.zip.然后,它会浏览zip结构,并提取所有具有适当扩展名的图片类型文件,并将其重命名为原始文件名,并带有一个数字(唯一性).最后,它将删除它创建的提取的目录树.

I was searching for a way to strip out pictures from these file types and this is the solution I came up with. It iterates through a given directory structure, copies any files with the proper extension, and renames the copy to filename.zip. Then it navigates through the zip structure and extracts all picture type files with the proper extension, and renames them to the original file name, with a number for uniqueness. Finally, it deletes the extracted directory trees it created.

从文本文档中提取图片是我的工作之一,因此从长远来看,这实际上将为我的公司节省数千小时.

Extracting pictures from text documents is part of my job, so this will actually save my company thousands of hours in the long run.

所有代码都在下面,我真正要问的是:有没有更好的方法?有效率更高的东西吗?可以缩放以包含其他格式吗?可以将文本提取到txt中-在单词和记事本上加载时间吗?

All of the code is below, and what I'm really asking is: Is there a better way? Is there something more efficient? Can it be scaled to include other formats? Could the text be extracted into a txt - for loading times on word vs notepad?

该解决方案可以在我的Linux机器上运行,并且可以提取图片,但尚未在Windows系统上进行测试.

This solution works on my Linux machine, and I can extract the pictures, but I've yet to test on a Windows system.

#!/usr/bin/python3

import shutil
import os
import zipfile

def zipDoc(aFile,dirPath):
    dotNDX = aFile.index(".") # position of the .
    shortFN = aFile[:dotNDX] # name of the file before .
    zipName = dirPath + shortFN + ".zip" # name and path of the file only .zip
    shutil.copy2(dirPath + aFile, zipName) # copies all data from original into .zip format
    useZIP = zipfile.ZipFile(zipName) # the usable zip file
    return useZIP # returns the zipped file 

def hasPicExtension(aFile): # if a file ends in a typical picture file extension, returns true
    picEndings = [".jpeg",".jpg",".png",".bmp",".JPEG"".JPG",".BMP",".PNG"] # list of photo extensions
    if aFile.endswith(tuple(picEndings)): # turn the list into a tuple, because .endswith accepts that
        return True     
    else: # if it doesn't end in a picture extension
        return False

def delDOCXEvidence(somePath): # removes the .docx file structures generated
    ##################################################################
    # Working Linux code:
    os.rmdir(somePath + "/word/media") # removes directory
    os.rmdir(somePath + "/word") # removes more directory
    ##################################################################

    ##################################################################
    # Untested windows code:
    # os.rmdir(somePath + "\\\\word\\\\media") # removes directory
    # os.rmdir(somePath + "\\\\word") #removes more directory
    ##################################################################

def delXLSXEvidence(somePath): # removes the .xlsx file structures generated
    ##################################################################
    # Working Linux code:
    os.rmdir(somePath + "/xl/media") # removes directory
    os.rmdir(somePath + "/xl") # removes more directory
    ##################################################################

    ##################################################################
    # Untested windows code:
    # os.rmdir(somePath + "\\\\xl\\\\media") # removes directory
    # os.rmdir(somePath + "\\\\xl") #removes more directory
    ##################################################################

def extractPicsFromDir(dirPath=""):
# when given a directory path, will extract all images from all .docx and .xlsx file types
    if os.path.isdir(dirPath): # if the given path is a directory
        for dirFile in os.listdir(dirPath): # loops through all files in the directory
            dirFileName = os.fsdecode(dirFile) # strips out the file name
            if dirFileName.endswith(".docx"):
                useZIP = zipDoc(dirFile,dirPath) # turns it into a zip
                picNum = 1 # number of pictures in file
                for zippedFile in useZIP.namelist(): # loops through all files in the directory
                    if hasPicExtension(zippedFile): # if it ends with photo
                        useZIP.extract(zippedFile, path=dirPath) # extracts the picture to the path + word/media/
                        shutil.move(dirPath + str(zippedFile),dirPath + dirFileName[:dirFileName.index(".")] + " - " + str(picNum)) # moves the picture out
                        picNum += 1
                delDOCXEvidence(dirPath) # removes the extracted file structure
                os.remove(useZIP.filename) # removes zip file
                # no evidence
            if dirFileName.endswith(".xlsx"):
                useZIP = zipDoc(dirFile,dirPath) # turns it into a zip
                picNum = 1 # number of pictures in file
                for zippedFile in useZIP.namelist(): # loops through all files in the directory
                    if hasPicExtension(zippedFile): # if it ends with photo
                        useZIP.extract(zippedFile, path=dirPath) # extracts the picture to the path + word/media/
                        shutil.move(dirPath + str(zippedFile),dirPath + dirFileName[:dirFileName.index(".")] + " - " + str(picNum)) # moves the picture out
                        picNum += 1
                delXLSXEvidence(dirPath) # removes the extracted file structure
                os.remove(useZIP.filename) # removes zip file
                # no evidence

    else:
        print("Not a directory path!")
        exit(1)


uDir = input("Enter your directory: ")
extractPicsFromDir(uDir)

推荐答案

Excel文件采用zip文件的形式.很容易从excel或docx文件中提取图像:

Excel files are in the form of zip file.It is easy to extract images from excel or docx file:

import zipfile
from PIL import Image, ImageFilter
import io

blur = ImageFilter.GaussianBlur(40)

def redact_images(filename,FilePath):
    outfile = filename.replace(".xlsx", "_redacted.xlsx")
    with zipfile.ZipFile(filename) as inzip:
        with zipfile.ZipFile(outfile, "w") as outzip:
            i = 0
            for info in inzip.infolist():
                name = info.filename
                content = inzip.read(info)
                if name.endswith((".png", ".jpeg", ".gif")):
                        fmt = name.split(".")[-1]
                        Name = name.split("/")[-1]
                        img = Image.open(io.BytesIO(content))
                        img.save(FilePath + str(Name))
                        outb = io.BytesIO()
                        img.save(outb, fmt)
                        content = outb.getvalue()
                        info.file_size = len(content)
                        info.CRC = zipfile.crc32(content)
                        i += 1
                outzip.writestr(info, content)

文件名:输入的excel文件的位置

filename : Location of input excel file

FilePath:保存提取图像的位置

FilePath : Location to save extracted images

这篇关于使用Python从Word和Excel中提取图片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆