从Jupyter Notebook中的ipyWidgets通过FileUpload上传的MS Word文档中提取文本 [英] Extracting text from MS Word Document uploaded through FileUpload from ipyWidgets in Jupyter Notebook
问题描述
我试图允许用户上传MS Word文件,然后运行某个将字符串作为输入参数的函数.我正在通过FileUpload上传Word文件,但是我得到了一个编码对象.我无法使用字节UTF-8进行解码,也无法使用upload.value或upload.data仅返回编码后的文本
I am trying to allow user to upload MS Word file and then I run a certain function that takes a string as input argument. I am uploading Word file through FileUpload however I am getting a coded object. I am unable to decode using byte UTF-8 and using upload.value or upload.data just returns coded text
有什么想法可以从上传的Word文件中提取内容吗?
Any ideas how I can extract content from uploaded Word File?
> upload = widgets.FileUpload()
> upload
#I select the file I want to upload
> upload.value #Returns coded text
> upload.data #Returns coded text
> #Previously upload['content'] worked, but I read this no longer works in IPYWidgets 8.0
推荐答案
现代ms字文件(.docx
)实际上是zip文件.
Modern ms-word files (.docx
) are actually zip-files.
文本(而不是页面标题)实际上位于zip文件中名为word/document.xml
的XML文档中.
The text (but not the page headers) are actually inside an XML document called word/document.xml
in the zip-file.
python-docx
模块可用于从这些文档中提取文本.它主要用于创建文档,但可以读取现有文档.来自的示例这里.
The python-docx
module can be used to extract text from these documents. It is mainly used for creating documents, but it can read existing ones. Example from here.
>>> import docx
>>> gkzDoc = docx.Document('grokonez.docx')
>>> fullText = []
>>> for paragraph in doc.paragraphs:
... fullText.append(paragraph.text)
...
请注意,这只会从段落中提取文本.不是例如表格中的文字.
Note that this will only extract the text from paragraphs. Not e.g. the text from tables.
修改:
我希望能够通过FileUpload小部件上传MS文件.
I want to be able to upload the MS file through the FileUpload widget.
有两种方法可以做到这一点.
There are a couple of ways you can do that.
首先,隔离实际文件数据. upload.data
实际上是一本字典,请参见此处一个>.因此,请执行以下操作:
First, isolate the actual file data. upload.data
is actually a dictionary, see here. So do something like:
rawdata = upload.data[0]
(请注意,此格式已针对不同版本的ipywidgets进行了更改.上面的示例摘自最新版本的文档.请阅读文档的相关版本,或研究IPython中的数据,并进行相应的调整.)
(Note that this format has changed over different version of ipywidgets. The above example is from the documentation of the latest version. Read the relevant version of the documentation, or investigate the data in IPython, and adjust accordingly.)
- 将
rawdata
写入例如foo.docx
并打开它.当然可以,但是看起来确实不太优雅. -
docx.Document
可以处理类似文件的对象.因此,您可以创建一个io.BytesIO
对象,并使用它.
- write
rawdata
to e.g.foo.docx
and open that. That would certainly work, but it does seem somewhat un-elegant. docx.Document
can work with file-like objects. So you could create anio.BytesIO
object, and use that.
赞:
foo = io.BytesIO(rawdata)
doc = docx.Document(foo)
这篇关于从Jupyter Notebook中的ipyWidgets通过FileUpload上传的MS Word文档中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!