从Rails中的字符串解析/提取文本? [英] Parsing / Extracting Text from String in Rails?
问题描述
我在Rails中有一个字符串,例如这是一条Twitter消息.#books《战争与和平》,列夫·托尔斯泰.我喜欢这本书!",我想解析文本,仅提取某些短语,例如《战争与和平》,列夫·托尔斯泰".
I have a string in Rails, e.g. "This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book!", and I want to parse the text and extract only certain phrases, like "War & Peace by Leo Tolstoy".
这是使用Regex并将"#books"之间的文本提升为."的问题吗?
Is this a matter of using Regex and lifting the text between "#books" to "."?
如果消息没有任何结构,例如: 这是一条推特讯息,列奥·托尔斯泰的《战争与和平》一书,我很喜欢这本书!"或者 这是一条Twitter消息.我喜欢列夫·托尔斯泰#books所著的《战争与和平》 我如何能可靠地拉出列夫·托尔斯泰的战争与和平"一语而又不知道事前.
What if there's no structure to the message, like: "This is a Twitter message #books War & Peace by Leo Tolstoy I love this book!" or "This is a Twitter message. I love the book War & Peace by Leo Tolstoy #books" How can I reliably pull the phrase "War & Peace by Leo Tolstoy" without knowing the phrase ex ante.
是否有任何宝石,方法等可以帮助我做到这一点?
Are there any gems, methods, etc. that can help me do this?
至少,您会称呼我要做什么?这将帮助我在Google上搜索解决方案.我尝试过一些解析"搜索,但是没有运气.
At the very least, what would you call what I'm trying to do? It will help me search for a solution on Google. I've tried a few searches on "parsing" with no luck.
-编辑- 基于@rogeliog的建议,我将添加以下内容:
--- edit --- based on @rogeliog suggestion, I will add the following:
我可以忍受#books之后出现的垃圾文本,但之前没有.我尝试了匹配.(/#books.*/)"-结果在这里: www.rubular.com/r/gM7oSZxF5M .
I can live with the garbage text that comes after #books, but nothing before. I tried "match.(/#books.*/)" -- results here: www.rubular.com/r/gM7oSZxF5M.
但是如何捕获结果6? (例如,当某人将#books放在句子的末尾时)?
But how can I capture Result #6? (e.g., when someone puts #books at the end of the sentence)?
我可以用正则表达式进行if-then吗?像这样:
Is there a way for me to do an if-then with regex? Something like:
如果[#books在邮件末尾],
if [#books is at the end of the message],
然后[在#books之前使用最后10个单词,
then [take the last 10 words preceding #books],
其他[match.(/#books.*/)]
else [match.(/#books.*/)]
如果您提供正则表达式,请使用rubular.com通过永久链接发布您的解决方案
If you offer a regex, please post your solution via a permalink using rubular.com
推荐答案
我认为您正在尝试解析一些非常复杂的变体.您是否有一个包含所有书名的数据库?这将有助于分配.
I Think that you are trying to parse some pretty complex variations. Do you have a DB with all the book titles? That will help allot.
要从第一个示例中获得标题(这是一条Twitter消息.列奥·托尔斯泰#books《战争与和平》.我喜欢这本书!"),您可以简单地:
To get out the title from the first example("This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book!") you can simply:
"This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book".match(/#book.*\./).to_s.gsub("#books",'')
这将返回:列夫·托尔斯泰的战争与和平".
That will return: " War & Peace by Leo Tolstoy."
如果要根据#books是否在结尾处执行if else语句,则可以:
If you want to do an if else statement depending if #books is at the end or not, you can:
if text.match(/#books$/)
puts text.match(/([^\s]*\s){10}(#books$)/).to_s
else
puts text.match(/#books.*/).to_s.gsub("#books",'')
end
如果#books位于末尾,则将为您提供书籍前面的最后10个单词,如果#books位于末尾,则将为您提供#books之后的所有内容
That will give you the last 10 words preceding books if #books is at the end, and whatever it is after #books if it is not at the end
我真的没有更好的主意,希望对您有用,让我知道:)
I dont really have a better idea, hope that works for you, let me know:)
这篇关于从Rails中的字符串解析/提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!