使用 Python 从简历(.Docx、.Doc、PDF)中提取粗体文本 [英] Extracting bold text from Resumes( .Docx,.Doc,PDF) using Python

查看：64 发布时间：2021/9/6 19:32:21 python text-extraction

本文介绍了使用 Python 从简历(.Docx、.Doc、PDF)中提取粗体文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有数以千计的任何格式的简历，例如带有 .doc、.docx 和 pdf 的 word.

I have thousands of resumes in any format like word with .doc, .docx and pdf.

我想使用 python 中的 textract 库从这些文档中提取粗体文本.有没有办法用textract提取?

I want to extract bold text from these documents using textract library in python. is there a way to extract using textract?

推荐答案

一个简单的解决方案是使用 python-docx 包.使用 ( !pip install python-docx )

An easy solution would be to use the python-docx package. install the package using ( !pip install python-docx )

您需要将 pdf 文件转换为 .docx .您可以使用任何在线 pdf 到 docx 转换器或使用 python 来做到这一点那.

You'll need to convert your pdf files to .docx . you can do that using any online pdf to docx converter or use python to do that.

以下代码行将提取您简历的所有粗体和斜体内容，并将它们保存在名为boltalic_Dict的字典中.您可以稍后检索.

the following lines of codes will extract all bold and italic contents of your resumes and save them in a dictionary called boltalic_Dict. you may retrieve either later on.

from docx import *

document = Document('path_to_your_files')
bolds=[]
italics=[]
for para in document.paragraphs:
    for run in para.runs:
        if run.italic :
            italics.append(run.text)
        if run.bold :
            bolds.append(run.text)

boltalic_Dict={'bold_phrases':bolds,
              'italic_phrases':italics}

我希望这会有所帮助.

这篇关于使用 Python 从简历(.Docx、.Doc、PDF)中提取粗体文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 Python 从简历(.Docx、.Doc、PDF)中提取粗体文本 [英] Extracting bold text from Resumes( .Docx,.Doc,PDF) using Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 Python 从简历(.Docx、.Doc、PDF)中提取粗体文本 [英] Extracting bold text from Resumes( .Docx,.Doc,PDF) using Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭