在正则表达式中,如何写出最少重复次数的匹配? [英] In regex, how to write the match of the least number of repetitions?

查看:156
本文介绍了在正则表达式中,如何写出最少重复次数的匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个说文字的程序:



文字到适用于Windows的演讲 [ ^ ]



我想要一个正则表达式来检测来源,比如



(Lee& others,2012) 









 (Lee& others,2012a)









在文本中跳过说话。



我尝试过:



我试过这个正则表达式,但效果不好...



*?\,\\\\\\ \\\([ad]?)\)



我希望将。*?部分替换为重复次数最少的部分,所以我没有正常的文字没有说出口,你知道我的正则表达式有时会跳过第一个开头的第一个圆括号rce到第二个来源的第二个左括号。



。*?部分并不贪心,但我仍然需要多次重复,我希望2个括号之间的重复次数最少。

解决方案

我认为你需要检查:当我在Expresso中尝试你的正则表达式,或者一个简单的应用程序时:

 string input = @我想要一个正则表达式来检测我想要的来源(Lee& others,2012)一个正则表达式检测来源(Lee& others,2012)我想要一个正则表达式来检测像(Lee& others,2012)这样的来源; 
string output = Regex.Replace(input,@\(。*?\,\\\\\\\([ad]?)\), %%);
Console.WriteLine(输出);

我得到了我期望的结果,你要求:

我想让正则表达式检测像%% I这样的来源希望正则表达式检测像%%这样的来源我希望正则表达式检测像%% 

这样的来源它正确检测每个支架对(%%只在那里,所以你可以看到删除了什么 - 它也适用于)。



所以看看你的样本数据,并尝试将其删除,直到你有一个仍然显示问题的最小子集 - 它可能不是你认为的那样!



请你试试这个:



和她的同事

(2013)使用碳约会测量

大脑中新神经元的增长

19至92岁的人群。超过50年

前,核弹的地上测试
将碳-14或C-14释放到大气中。由于这些核试验在1963年被禁止,因此大气层中的C-14水平已经下降到常规水平。

众所周知的速度。因此测量神经元中C-14浓度的数量

为神经元提供了一个时间戳,使研究人员可以确定何时/> $ b神经元产生了$ B.

碳-14标记是明白无误的:斯伯丁和她的同事发现

出生后神经发生的明确证据,事实上,整个过程中寿命。他们还能够计算出大脑特定区域的神经发生率,称为海马体,这是一个参与学习和记忆的大脑区域。它变成了每天平均产生1,400个新神经元的价值。随着年龄的增长,

的汇率仅略有下降。然而,大脑的其他区域没有显示神经发生的证据。

人类和动物神经发生的研究发现了一些

有趣的发现。现在人们普遍认为,新生神经元在人脑的至少两个区域 - 海马体中发育成为成熟的功能神经元 - b $ b和嗅球,负责气味感知(Lee& others,2012)。

这些新生成的




嗯,这也有效 - 它消除了一切在逗号和空格之后的第一个开括号和第一个关闭括号之间 - 这正是你告诉它要做的!

解决这个问题的一种方法是使用平衡组 - 我是*不*解释:笑: - 并尝试删除括号内以四位数结尾的任何内容:

 \((?> \( (小于c取代;)| [^()] + | \)(小于?-c>))*((C)(?))\d {4} \)



但是......一个正则表达式可能是错误的方法 - 你可能需要一个更复杂的自然语言处理器,否则你可能会错过一些边缘情况。



我建议你得到一份 Expresso [ ^ ] - 它是免费的,它会检查并生成正则表达式。



我使用Expresso,抱歉,你给的正则表达式根本不起作用......



:doh:我错过了它捕获的第一个(2013年),但不是第二个...在错误的地方有数字:O

试试这个:

 \\ \\((大于\((小于c取代;?)| [^()] * \d {4} | \)(小于?-c>))*((C)(? !))\)

哪个都应该捕获这两种情况。很抱歉......



对不起,我不太明白。如果没有解释,你很难做到:添加年份号码后有1个字母的可能性吗?我曾经和Regex Matches和团队合作过,但现在不需要它们......



试试:

 \((?> \((?< c>)| [^()] * \d {4} [a- ZA-Z] | \)(小于????-c>))*((C)(?))\)


I have a program that speaks text:

Text To Speech For Windows[^]

I want a Regex to detect sources like

(Lee & others, 2012)



or

(Lee & others, 2012a)





In text to skip speaking them.

What I have tried:

I have tried this Regex but it doesn't work well...

*?\,\s\d\d\d\d([a-d]?)\)"

I want the .*? part be replaced with something that says the least number of repetitions, so I don't get normal text unspoken, you know my Regex sometimes skips from the first open parentheses of the first source to the second open parenthesis of the second source.

The .*? part is not greedy, but I still get many repetitions, I want the least number of repetition possible between the 2 parenthesis.

解决方案

I think you need to check that: when I try your Regex in Expresso, or a simple app:

string input = @"I want a Regex to detect sources like (Lee & others, 2012) I want a Regex to detect sources like (Lee & others, 2012) I want a Regex to detect sources like (Lee & others, 2012)";
string output = Regex.Replace(input, @"\(.*?\,\s\d\d\d\d([a-d]?)\)", "%%");
Console.WriteLine(output);

I get exactly what I expect, and you ask for:

I want a Regex to detect sources like %% I want a Regex to detect sources like %% I want a Regex to detect sources like %%

I.e. it detects each bracket pair correctly (the "%%" are only there so you can see what was removed - it works fine with "" as well).

So have a look at your sample data, and try cutting it down until you have a minimum subset which still displays the problem - it may not be what you think it is!

"Would you please try on this:

and her colleagues
(2013) used carbon dating to measure the
growth of new neurons in the brains of
people aged 19 to 92. More than 50 years
ago, aboveground testing of nuclear bombs
released carbon-14, or C-14, into the atmosphere. Since these nuclear tests were
banned in 1963, levels of C-14 in the atmosphere have declined at a regular and
well-known rate. Measuring the amount of C-14 concentration in neurons therefore
provided a "time-stamp" for the neurons, allowing researchers to determine when
the neurons had been generated.
The carbon-14 signature was unmistakable: Spalding and her colleagues found
clear evidence of neurogenesis after birth and, in fact, throughout the lifespan. They
were also able to calculate the rate of neurogenesis in a specific region of the brain,
called the hippocampus, a brain region involved in learning and memory. It turned
out that an average of 1,400 new neurons were being generated each day. The rate
declined only slightly with age. Other regions of the brain, however, did not show
evidence of neurogenesis.
Research on neurogenesis in humans and animals has uncovered a number of
intriguing findings. It is now generally accepted that newborn neurons develop into
mature functioning neurons in at least two regions of the human brain—the hippocampus
and the olfactory bulb, responsible for odor perception (Lee & others, 2012).
These newly generated"


Well that works as well - it eliminates everything between the first open bracket and the first close bracket after a comma and a space - which is exactly what you told it to do!
One way to solve this is to use Balancing Groups - which I'm *not* going to explain :laugh: - and try and remove anything inside the brackets that ends with four digits:

\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\d{4}\)


But ... a regex is probably the wrong approach - you may need a more complicated natural language processor or you will possibly miss some edge cases.

I'd suggest that you get a copy of Expresso[^] - it's free, and it examines and generates Regular expressions.

"I use Expresso, sorry, the regex you gave doesn't work at all..."

:doh: I missed that it captured the first one (2013) but not the second ... had teh digits in the wrong place :O
Try this:

\((?>\((?<c>)|[^()]*\d{4}|\)(?<-c>))*(?(c)(?!))\)

Which should catch both cases. Sorry about that ...

"I'm sorry, I din't quite understand. Is it hard for you to do without explaining: adding the possibility of 1 letter after the year number? I once worked with Regex Matches and groups, but they're not needed now..."

Try:

\((?>\((?<c>)|[^()]*\d{4}[a-zA-Z]?|\)(?<-c>))*(?(c)(?!))\)


这篇关于在正则表达式中,如何写出最少重复次数的匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆