文本处理帮助 - 正则表达式/ NLP /内容搜索? [英] Text Processing Help - Regex/NLP/Content Search ?

查看:75
本文介绍了文本处理帮助 - 正则表达式/ NLP /内容搜索?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我有一个问题,我有大量的文件,我想在包含特定信息的文件中找到特定的句子/段落。

我知道这不是一个小问题,但是我想要一些可以帮助我朝着正确方向前进的材料参考。



例如:我有大量的文本文件,其中包含许多公司宣布其新产品的新闻稿。我想得到说明发布日期和产品名称的句子。



我尝试过使用正则表达式,但构造似乎非常复杂,根本不能正常工作。我知道这是一个复杂的问题,但在引用网站或视频讲座方面的任何帮助都会非常有帮助。即使是样本项目也会非常感激。

Hi,
I have a problem where I have a large number of files and I want to find a particular sentence/paragraph in the files which contain particular information.
I know that this is not a small problem, however I would like some references to material which can help me proceed in the right direction.

For example : I have a large number of text files which contain press releases from many companies announcing their new product. I would like to get the sentences which tell the date of release and the name of the product.

I have tried using regular expressions but the constructs seem to be very complex and do not work well at all. I know that this is a complex problem but any help in terms of references to websites or video lectures will be very helpful. Even sample projects would be very appreciated.

推荐答案

正如我所说,自然语言处理对于这个论坛来说是一个非常严肃的话题。尽管有明显的进步(自动翻译),但该领域的世界计算机科学的整体水平允许仅将实际结果称为实验。即使某些技术甚至商业化,更先进的应用程序或仅仅尝试将现有技术应用于更复杂的语言也会产生荒谬的结果。



首先,你需要更清醒地估计你的可能性。在我看来,它可能很容易占用你的一生,而且可能需要与世界一流的科学家(计算机科学和语言学)和工程师一起工作。



要获得一些想法,请从这里开始,然后点击链接:

http://en.wikipedia.org/wiki/Natural_language_processing [ ^ ]。



-SA
As I say, Natural Language Processing is a too serious topic for this forum. Despite of apparent progress (automatic translation), overall level of the world computer science in this field allows to call the available result merely experimental. Even though some technologies are even commercialized, more advanced applications or just the attempt to apply existing technology to some more complex languages shows ridiculous results.

First of all, you need more sober estimation of your possibilities. In my opinion, it may easily take your whole life time, and it may take working in the team with world-class scientists (computer science and linguistics) and engineers.

To get some ideas, start here, then follow the links:
http://en.wikipedia.org/wiki/Natural_language_processing[^].

—SA


你应该与Regex一起使用。你只需要找到一个模式。



例如 - C# - 在文件中找到一行(正则表达式)并根据另一个文件获取完整的文本块正则表达式 [ ^ ]



他在这里搜索标题栏。

因此,模式应该像<$ href =http://stackoverflow.com中建议的 title [^!] * / a / 10225243/1099247>答案 [ ^ 那里。

You should go with Regex. You just need to find a pattern.

For Example - C# - find a line (regex) in a file and get the complete block of text according to another regex[^]

Here he is searching for title block.
So, the pattern should be like title[^!]* as suggested in the answers[^] there.
Quote:

你的正则表达式也被改为包含未知字符,比如

Your Regex be changed to contain the unknown characters as well, like



  • 第一个标题
  • 然后 [^!] * [^] 表示不在此集合中的内容,因此 [^!] * 除了之外的任何数字)


  • first title
  • then [^!]* ([^ ] means something not in this set, so [^!]* is everything except ! in any number)

Regex regex = new Regex("title[^!]*", RegexOptions.SingleLine);
MatcheCollection matches = regex.Matches(text);



所以,如果你可以创建一个模式根据要求,您可以轻松获得下一部分。您只需要识别开始和结束字符/字符串。


So, if you can create a pattern as per the requirement, then you can easily get the portion of the next. You just need to identify the start and end characters/strings.


这篇关于文本处理帮助 - 正则表达式/ NLP /内容搜索?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆