MongoDB - 从正则表达式中提取数据 [英] MongoDB - Extract Data from Regex
问题描述
我有一项作业,需要使用 MongoDB 从一些 twitter 帖子中检索数据,并且已经解决了几个小时的问题.我需要提取提到的用户(在 Twitter 中,你 @TheirUsername 提到他们),并且很难这样做,我尝试使用 substrCP,并找到@"开始位置的索引,但无法弄清楚如何找到@"停止的位置,因为名称的长度不同,并且名称结束后可以有任何字符,例如?",."等
I have an assignment where I need to retrieve data from some twitter posts using MongoDB, and have been sitting with a problem for a few hours now. I need to extract the mentioned user (In Twitter you @TheirUsername to mention them), and have a hard time doing so, I've tried using substrCP, and finding indexes for where the "@" begins, but can't figure out how to find where the "@" stops, as names have a different length, and there can be any character after the name ends, such as "?", "." etc.
因此,我使用正则表达式模式:/@\w+/来确定推文是否包含一串字符,其中包含 @ 符号,后跟某个单词.这在确定推文中是否包含 @Someone 方面非常有效,但我仍然不知道如何提取"它.
Therefore I was using the regex pattern: /@\w+/ to find out if the tweet has a string of characters that includes an @ symbol, followed by some word. This works really well in finding out if the tweet has an @Someone in it, but I still cannot figure out how to "extract" it.
(顺便说一句.我一直在使用聚合来做到这一点,所以我可以通过 $match、$project 和最后 $sort 将它通过管道传输)
(Btw. I've been using aggregate to do this, so I could pipe it through $match, then $project, and finally $sort)
看起来像这样:
https://hastebin.com/adohogedil.bash
需要提取用户名的字符串示例是:
该死!@white_cat22 我错过了 11:11"
An example of a string that needs to extract the username is:
"damnnn! @white_cat22 i missed 11:11"
我只想要@white_cat22"部分.
Where I only want the "@white_cat22" part.
在谷歌上搜索了一下之后,我认为更好的描述方式如下,我需要在正在测试的字符串上检索匹配的正则表达式模式.
After googling a bit, I think a better way to describe it is as follows, I need to retrieve the matched regex pattern on the string that is being tested on.
如何提取提到的用户名?任何帮助将不胜感激!(已编辑)
What can I do to extract the mentioned username? Any help would be greatly appreciated! (edited)
推荐答案
它有点棘手,你必须使用 $split
和 $unwind
运算符,然后 $match
和 @
如下:
Its tittle bit tricky, you have to use $split
and $unwind
operator and then $match
with @
as below:
db.tweets.aggregate([
{
$match: { tweet: /@\w+/ }
},
{
$project: {tweet: {$split: ["$tweet", " "]}}
},
{
$unwind: "$tweet"
},
{
$match: { tweet: /@\w+/ }
}
])
它产生的结果是,几乎与您的要求相似:
{ "_id" : ObjectId("5c61aee91765cd7b27eb473e"), "tweet" : "@white_cat22" }
{ "_id" : ObjectId("5c61aeee1765cd7b27eb473f"), "tweet" : "@white_cat23" }
{ "_id" : ObjectId("5c61aef61765cd7b27eb4740"), "tweet" : "@cat23" }
{ "_id" : ObjectId("5c61aefd1765cd7b27eb4741"), "tweet" : "@KP" }
{ "_id" : ObjectId("5c61af051765cd7b27eb4742"), "tweet" : "@kpTesting" }
{ "_id" : ObjectId("5c61af091765cd7b27eb4743"), "tweet" : "@kpTesting12" }
{ "_id" : ObjectId("5c61b4791765cd7b27eb4744"), "tweet" : "@kpTesting12" }
有关更多信息,我对上述使用过的集合的简单查找查询是:
For more information, my simple find query on above used collection are:
> db.tweets.find()
{ "_id" : ObjectId("5c61aee91765cd7b27eb473e"), "tweet" : "damnnn! @white_cat22 i missed 11:11" }
{ "_id" : ObjectId("5c61aeee1765cd7b27eb473f"), "tweet" : "damnnn! @white_cat23 i missed 11:11" }
{ "_id" : ObjectId("5c61aef61765cd7b27eb4740"), "tweet" : "damnnn! @cat23 i missed 11:11" }
{ "_id" : ObjectId("5c61aefd1765cd7b27eb4741"), "tweet" : "damnnn! @KP i missed 11:11" }
{ "_id" : ObjectId("5c61af051765cd7b27eb4742"), "tweet" : "damnnn! @kpTesting i missed 11:11" }
{ "_id" : ObjectId("5c61af091765cd7b27eb4743"), "tweet" : "damnnn! @kpTesting12 i missed 11:11" }
{ "_id" : ObjectId("5c61b4791765cd7b27eb4744"), "tweet" : "@kpTesting12 i missed 11:11" }
>
它首先包含用户名,即 @
单词,如果用户名出现在推文句子的最后,它也将起作用.
It contains the username i.e @
word at first place as well, it will also work if the username present at the last of the tweet sentences.
它可能有帮助,但您可以随时优化此查询,我在这里发布只是为了您的理解,我不会为您提供所需的优化解决方案.
It might be helpful for, but you can always optimized this query, I am posting here just for your understanding, I am not providing you the optimized solution of what you required.
有关更多详细信息,请查看以下参考资料:
For more details please check the below reference:
这篇关于MongoDB - 从正则表达式中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!