算法找到一个字符串的关键词和关键短语 [英] Algorithm to find keywords and keyphrases in a string

查看:145
本文介绍了算法找到一个字符串的关键词和关键短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要建议或指示如何写一个算法,会发现在一个字符串的关键字或关键短语

I need advice or directions on how to write an algorithm which will find keywords or keyphrases in a string.

该字符串包含:

  • 用英文写的(GB)技术资料
  • 词用空格隔开大多
  • A 关键字不包含空格,但它可能包含一个连字符,撇号,结肠癌等。
  • A 的关键词可以包含空格,逗号或其他标点符号
  • 如果两个或更多关键字一起出现,然后它可能是一个关键词的如变频驱动器
  • 在文本还包含HTML,但是这可以事先在必要
  • 删除
  • 非关键字将会像和,的,我们,看,看等字样。
  • 关键词是不区分大小写例如逆变器和变频是相同的关键字
  • Technical information written in English (GB)
  • Words are mostly separated by spaces
  • A keyword does not contain a space but it may contain a hyphen, apostrophe, colon etc.
  • A keyphrase may contain a space, a comma or other punctuation
  • If two or more keywords appear together then it is likely a keyphrase e.g. "inverter drive"
  • The text also contains HTML but this can be removed beforehand if necessary
  • Non-keywords would be words like "and", "the", "we", "see", "look" etc.
  • Keywords are case-insensitive e.g. "Inverter" and "inverter" are the same keyword

该算法具有以下要求:

  1. 工作在批处理场景如运行一次或一天两次
  2. 在处理字符串的长度在不同大致 200至7000字
  3. 过程1000串,不到1个小时
  4. 将一台服务器上有适度良好的动力执行
  5. 写在下列之一: C#,VB.NET或T-SQL 甚至是F#,Python或Lua中等等
  6. 靠predefined关键字或关键短语的列表上
  7. 但是能靠关键字排除如名单上和,中,走出去等。
  8. 理想的情况下转让以其他语言如:不依赖于特定语言的功能如元编程
  9. 在输出的关键字句(降序频率)的列表之后的关键字列表(降序频率)
  1. Operate in a batch-processing scenario e.g. run once or twice a day
  2. Process strings varying in length from roughly 200 to 7000 characters
  3. Process 1000 strings in less than 1 hour
  4. Will execute on a server with moderately good power
  5. Written in one of the following: C#, VB.NET, or T-SQL maybe even F#, Python or Lua etc.
  6. Does not rely on a list of predefined keywords or keyphrases
  7. But can rely on a list of keyword exclusions e.g. "and", "the", "go" etc.
  8. Ideally transferable to other languages e.g. doesn't rely on language-specific features e.g. metaprogramming
  9. Output a list of keyphrases (descending order of frequency) followed by a list of keywords (descending order of frequency)

这将是额外的冷静,如果它可以处理多达8000个字符在几秒钟之内,这样就可以实时运行,但是我已经问够了!

It would be extra cool if it could process up to 8000 characters in a matter of seconds, so that it could be run in real-time, but I'm already asking enough!

只是在寻找建议和方向:

  • 如果这被看作是两个独立的算法?
  • 是否有任何既定的算法,我可以遵循?
  • 是我的要求是否可行?

非常感谢。

P.S。该字符串将被从SQL Server 2008 R2数据库检索,因此,最好的语言将不得不支持这一点,如果没有,那么它必须能够读取/写入stdout,管道,流或文件等。

P.S. The strings will be retrieved from a SQL Server 2008 R2 database, so ideally the language would have support for this, if not then it must be able to read/write to STDOUT, a pipe, a stream or a file etc.

推荐答案

涉及的逻辑,使得复杂的T-SQL编程。选择像C#语言。第一次尝试做一个简单的桌面应用程序。以后,如果你发现所有记录加载到这个应用程序是太慢了,你可以写的是SQL服务器上执行一个C#的存储过程。根据不同的SQL服务器的安全策略,它需要有一个较强的键

The logic involved makes it complicated to be programmed in T-SQL. Choose a language like C#. First try to make a simple desktop application. Later, if you find that loading all the records to this application is too slow, you could write a C# stored procedure that is executed on the SQL-Server. Depending on the security policy of the SQL-Server, it will need to have a strong key.

现在的算法。排除的关键字列表通常被称为停用词列表。如果你做一些谷歌上搜索这个搜索词,你可能会发现车站的单词列表,你可以开始。这些停用词添加到的HashSet< T> (我将使用C#这里)

To the algorithm now. A list of excluded words is commonly called a stop word list. If you do some googling for this search term, you might find stop word lists you can start with. Add these stop words to a HashSet<T> (I'll be using C# here)

HashSet<string> stopWords = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
string[] lines = File.ReadAllLines("C:\stopwords.txt");
foreach (string s in lines) {
    stopWords.Add(s); // Assuming that each line contains one stop word.
}

之后,你可以看看,如果一个关键词候选人处于停止单词列表

Later you can look if a keyword candidate is in the stop word list with

If (!stopWords.Contains(candidate)) {
    // We have a keyword
}

HashSets快。它们具有O(1)的访问时间,这意味着做一个查找所需的时间不依赖于在数项包含

HashSets are fast. They have an access time of O(1), meaning that the time required to do a lookup does not depend on the number items it contains.

寻找的关键字可以很容易地用正则表达式完成。

Looking for the keywords can easily be done with Regex.

string text = ...; // Load text from DB
MatchCollection matches = Regex.Matches(text, "[a-z]([:']?[a-z])*",
                                        RegexOptions.IgnoreCase);
foreach (Match match in matches) {
    if (!stopWords.Contains(match.Value)) {
        ProcessKeyword(match.Value); // Do whatever you need to do here
    }
}

如果你发现亚利桑那州的限制太大,信件,需要重音字母,您可以更改正则表达式EX pression到 @\ p {L〕(?[:] \ p { L〕)*。字符类 \ p {L〕包含了所有字母和字母的改性剂。

If you find that a-z is too restrictive for letters and need accented letters you can change the regex expression to @"\p{L}([:']?\p{L})*". The character class \p{L} contains all letters and letter modifiers.

短语比较复杂。你可以尝试将文本首先被分为短语,然后应用关键字搜索这些短语,而不是搜索关键字的全部文本。这将给你的关键字数量在一个短语的同时

The phrases are more complicated. You could try to split the text into phrases first and then apply the keyword search on these phrases instead of searching the keywords in the whole text. This would give you the number of keywords in a phrase at the same time.

分割文成短语包括搜索句子结尾的。要么 ?要么 !要么 :。你应该排除出现一个字内点和冒号。

Splitting the text into phrases involves searching for sentences ending with "." or "?" or "!" or ":". You should exclude dots and colons that appear within a word.

string[] phrases = Regex.Split(text, @"[\.\?!:](\s|$)");

这将搜索标点符号其次无论是一个空白或​​行结束。但我必须承认,这是不完美的。它可能会错误地检测为缩写句末。你将不得不做实验,以完善的分裂机制。

This searches punctuations followed either by a whitespace or an end of line. But I must agree that this is not perfect. It might erroneously detect abbreviations as sentence end. You will have to make experiments in order to refine the splitting mechanism.

这篇关于算法找到一个字符串的关键词和关键短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆