必须从Word文件中提取数据 [英] have to extract data from a word file

查看:139
本文介绍了必须从Word文件中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个特殊的问题,那就是我必须从Word文件中提取信息.比如说我有一份简历,需要提取nameemail addressphone no.addressuniversityExperience等.

I have a peculiar problem in that I have to extract information from a word file. Say for example I have a resume and need to extract name, email address, phone no., address, university,Experience etc.

每个人的简历可能都以不同的格式出现,那么我有什么办法可以以编程方式提取我需要的信息呢?

Every other person may be having their resume in a different format.So is there any way by which I can programmatically extract the information I need?

我需要此信息来填写注册表格.

I need this information to fill-up a form for registration.

推荐答案

使用aspose .net将Word文档转换为html.
然后,您可以使用正则表达式搜索单词和/或pdf文档.
或者,您可以使用HTMLAgilityPack解析创建的HTML文档,并搜索特定的部分/路径.

Convert the word document to html, with aspose .net.
Then you can use regular expressions to search the word and/or pdf documents.
Or you can use HTMLAgilityPack to parse the created HTML documents, and search for specific sections/paths.

PS:
如果您的电子邮件正则表达式短于一页,则该正则表达式不正确.
只要您只支持一个国家/地区,电话应该是可管理的.
至于名字和地址,祝你好运.

PS:
If you have a regex for email that's shorter than one page, then the regex is incorrect.
Phone should be manageable, as long as you have to support only one country.
As for name and address, good luck with that.


像这样


Like this

VB.NET:

Dim doc As New Aspose.Words.Document("filename.docORdocx")
doc.Save("filename.html", Aspose.Words.SaveFormat.Html)

C#:

Aspose.Words.Document doc = new Aspose.Words.Document("filename.docORdocx");
doc.Save("filename.html", Aspose.Words.SaveFormat.Html);

组件在这里:
http://www.aspose.com/.net/word-component.aspx

The component is here:
http://www.aspose.com/.net/word-component.aspx

要找出有效的电子邮件地址,请阅读RFC 822:
http://www.faqs.org/rfcs/rfc822.html

To find out what a valid email address is, read RFC 822:
http://www.faqs.org/rfcs/rfc822.html

这篇关于必须从Word文件中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆