从Jupyter Notebook中的ipyWidgets通过FileUpload上传的MS Word文档中提取文本 [英] Extracting text from MS Word Document uploaded through FileUpload from ipyWidgets in Jupyter Notebook

查看:160
本文介绍了从Jupyter Notebook中的ipyWidgets通过FileUpload上传的MS Word文档中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图允许用户上传MS Word文件,然后运行某个将字符串作为输入参数的函数.我正在通过FileUpload上传Word文件,但是我得到了一个编码对象.我无法使用字节UTF-8进行解码,也无法使用upload.value或upload.data仅返回编码后的文本

I am trying to allow user to upload MS Word file and then I run a certain function that takes a string as input argument. I am uploading Word file through FileUpload however I am getting a coded object. I am unable to decode using byte UTF-8 and using upload.value or upload.data just returns coded text

有什么想法可以从上传的Word文件中提取内容吗?

Any ideas how I can extract content from uploaded Word File?

      > upload = widgets.FileUpload() 
        > upload
#I select the file I want to upload
        > upload.value #Returns coded text 
        > upload.data #Returns coded text

        > #Previously upload['content'] worked, but I read this no longer works in IPYWidgets 8.0

推荐答案

现代ms字文件(.docx)实际上是zip文件.

Modern ms-word files (.docx) are actually zip-files.

文本(而不是页面标题)实际上位于zip文件中名为word/document.xml的XML文档中.

The text (but not the page headers) are actually inside an XML document called word/document.xml in the zip-file.

python-docx模块可用于从这些文档中提取文本.它主要用于创建文档,但可以读取现有文档.来自的示例这里.

The python-docx module can be used to extract text from these documents. It is mainly used for creating documents, but it can read existing ones. Example from here.

>>> import docx
>>> gkzDoc = docx.Document('grokonez.docx')

>>> fullText = []
>>> for paragraph in doc.paragraphs:
...     fullText.append(paragraph.text)
...

请注意,这只会从段落中提取文本.不是例如表格中的文字.

Note that this will only extract the text from paragraphs. Not e.g. the text from tables.

修改:

我希望能够通过FileUpload小部件上传MS文件.

I want to be able to upload the MS file through the FileUpload widget.

有两种方法可以做到这一点.

There are a couple of ways you can do that.

首先,隔离实际文件数据. upload.data实际上是一本字典,请参见此处.因此,请执行以下操作:

First, isolate the actual file data. upload.data is actually a dictionary, see here. So do something like:

rawdata = upload.data[0]

(请注意,此格式已针对不同版本的ipywidgets进行了更改.上面的示例摘自最新版本的文档.请阅读文档的相关版本,或研究IPython中的数据,并进行相应的调整.)

(Note that this format has changed over different version of ipywidgets. The above example is from the documentation of the latest version. Read the relevant version of the documentation, or investigate the data in IPython, and adjust accordingly.)

  1. rawdata写入例如foo.docx并打开它.当然可以,但是看起来确实不太优雅.
  2. docx.Document可以处理类似文件的对象.因此,您可以创建一个io.BytesIO对象,并使用它.
  1. write rawdata to e.g. foo.docx and open that. That would certainly work, but it does seem somewhat un-elegant.
  2. docx.Document can work with file-like objects. So you could create an io.BytesIO object, and use that.

赞:

foo = io.BytesIO(rawdata)
doc = docx.Document(foo)

这篇关于从Jupyter Notebook中的ipyWidgets通过FileUpload上传的MS Word文档中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆