如何在Ruby中将该字符串标记化? [英] How do I tokenize this string in Ruby?

查看：75 发布时间：2020/5/25 0:31:03 ruby parsing tokenize text-parsing

本文介绍了如何在Ruby中将该字符串标记化?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有这个字符串:

%{Children^10 Health "sanitation management"^5}

我想将其转换为将其标记为哈希数组:

And I want to convert it to tokenize this into an array of hashes:

[{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}]

我知道StringScanner和语法gem ，但是我都找不到足够的代码示例

I'm aware of StringScanner and the Syntax gem but I can't find enough code examples for both.

有指针吗?

推荐答案

对于一种真正的语言，词法分析器是最好的选择-

For a real language, a lexer's the way to go - like Guss said. But if the full language is only as complicated as your example, you can use this quick hack:

irb> text = %{Children^10 Health "sanitation management"^5}
irb> text.scan(/(?:(\w+)|"((?:\\.|[^\\"])*)")(?:\^(\d+))?/).map do |word,phrase,boost|
       { :keywords => (word || phrase).downcase, :boost => (boost.nil? ? nil : boost.to_i) }
     end
#=> [{:boost=>10, :keywords=>"children"}, {:boost=>nil, :keywords=>"health"}, {:boost=>5, :keywords=>"sanitation management"}]

如果您尝试解析常规语言，则此方法就足够了-尽管不会使该语言成为非常规语言也不会带来更多麻烦.

If you're trying to parse a regular language then this method will suffice - though it wouldn't take many more complications to make the language non-regular.

正则表达式的快速分解:

A quick breakdown of the regex:

\w+匹配任何单项关键字
(?:\\.|[^\\"]])*使用非捕获括号((?:...))来匹配转义的双引号字符串的内容-转义的符号(\n，\"，\\等)或任何单个不是转义符号或引号的字符.
"((?:\\.|[^\\"]])*)"仅捕获引用的关键字词组的内容.
(?:(\w+)|"((?:\\.|[^\\"])*)")匹配任何关键字-单个词或短语，将单个词捕获到$1中并将短语内容捕获到$2
\d+匹配数字.
\^(\d+)捕获插入符号后的数字(^).由于这是捕获括号的第三组，因此将其标注为$3.
(?:\^(\d+))?捕获插入符号后的数字(如果存在)，否则匹配空字符串.

\w+ matches any single-term keywords
(?:\\.|[^\\"]])* uses non-capturing parentheses ((?:...)) to match the contents of an escaped double quoted string - either an escaped symbol (\n, \", \\, etc.) or any single character that's not an escape symbol or an end quote.
"((?:\\.|[^\\"]])*)" captures only the contents of a quoted keyword phrase.
(?:(\w+)|"((?:\\.|[^\\"])*)") matches any keyword - single term or phrase, capturing single terms into $1 and phrase contents into $2
\d+ matches a number.
\^(\d+) captures a number following a caret (^). Since this is the third set of capturing parentheses, it will be caputred into $3.
(?:\^(\d+))? captures a number following a caret if it's there, matches the empty string otherwise.

String#scan(regex)尽可能多地将正则表达式与字符串匹配，输出匹配"数组.如果正则表达式包含捕获括号，则"match"是捕获的项的数组-因此$1变为match[0]，$2变为match[1]，依此类推.任何未与括号的一部分匹配的捕获括号字符串映射到结果匹配项"中的nil条目.

String#scan(regex) matches the regex against the string as many times as possible, outputing an array of "matches". If the regex contains capturing parens, a "match" is an array of items captured - so $1 becomes match[0], $2 becomes match[1], etc. Any capturing parenthesis that doesn't get matched against part of the string maps to a nil entry in the resulting "match".

然后#map进行这些匹配，使用一些块魔术将每个捕获的术语分解为不同的变量(我们可以做do |match| ; word,phrase,boost = *match)，然后创建所需的哈希.正好word或phrase之一将是nil，因为两者都无法与输入匹配，因此(word || phrase)将返回非nil的一个，而#downcase会将其转换为全部小写. boost.to_i将字符串转换为整数，而(boost.nil? ? nil : boost.to_i)将确保nil升压保持在nil.

The #map then takes these matches, uses some block magic to break each captured term into different variables (we could have done do |match| ; word,phrase,boost = *match), and then creates your desired hashes. Exactly one of word or phrase will be nil, since both can't be matched against the input, so (word || phrase) will return the non-nil one, and #downcase will convert it to all lowercase. boost.to_i will convert a string to an integer while (boost.nil? ? nil : boost.to_i) will ensure that nil boosts stay nil.

这篇关于如何在Ruby中将该字符串标记化?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在Ruby中将该字符串标记化? [英] How do I tokenize this string in Ruby?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在Ruby中将该字符串标记化? [英] How do I tokenize this string in Ruby?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭