正则表达式匹配第一个和最后一个单词或任何单词 [英] regex match first and last word or any word

查看:211
本文介绍了正则表达式匹配第一个和最后一个单词或任何单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  #fabulous 7.526 2301 2 
#excellent 7.247 2612 3
#superb 7.199 1660 2
#perfection 7.099 3004 4
#terrific 6.922 629 1

我有一个包含这样的句子列表的文件:

 太棒了Theo Walcott还是拉屎,星期六看Rafa和Johnny和他交易。 
不是我是普惠制的粉丝,而是一个非常棒的b $ b伊朗将军说,以色列的铁顶不能用J Davlar 11号处理他们的导弹
。主要的对手是波兰队。

我想用正则表达式来检查以下内容:

如果每个句子中的第一个单词与文件中的任何单词相匹配,那么
例如,如果出现在文件中的是好的,那么它就是伊朗语

/ li>
  • 如果句子中的最后一个单词与文件中的任何单词相匹配
    例如,如果星期六,神话般的导弹,波兰出现在文件中或者不存在

    li>
  • 如果两个或三个字符在句子中的单个词语的前缀和后缀匹配2或3个字符前缀和后缀在文件
    例如,如果Ter,它,Ira,wi匹配到文件中任何2或3个单词的前缀或不是。同样适用于后缀。


  • 我对于正则表达式很陌生,我可以这样想,但没有得到结果:
    term2.lower()是文件中的第一列


      wordanalysis [trail] =如果re.match [-1],term2.lower())else else(found)
    wordanalysis [lead] =如果re.match(sentence [0],term2.lower())else else(found) b $ b


    解决方案

    更新:通过@justhalf,不需要使用正则表达式来分割单词。删除 .lower(),如果您想要区分大小写的匹配项。

    这将匹配第一个字和最后一个字数据列表中的单词(不包括任何标点符号或尾部空白符号):
    $ b $ p

    (^ \ s?\ w + \b |( \b\w +)[\。!!\s] * $)



    匹配:

      MATCH 1-1。棒极了
    MATCH 2-1。星期六。
    2.星期六
    比赛3-1。其
    比赛4-1。神话般的
    2.神话般的
    比赛5-1。伊朗
    比赛6-1。导弹
    2.导弹
    MATCH 7-1。与
    比赛8-1。波兰。
    2.波兰

    执行:

      import re,string 

    sentences = open(sentences.txt)。read()。splitlines()
    data = open(data.txt)。read()
    pattern = re.compile(r(^ \ s?\ w + \b |(\ b\w +) $ \\ b















    $ b $ $($ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ b打印Found+ first +in data.txt
    if(re.search(last,data,re.I)):
    printFound+ last +in data.txt

    这可能不是最优雅的做法,但是你明白了。 / p>

    代码已经过测试并且可以正常工作,输出结果是:

    $ p $ data.txt
    在data.txt中找到了惊人的结果

    这不能达到你的第三个标准,测试一下,看看它是否工作到目前为止。


    I have a huge file with a list of data such as this:

     #fabulous       7.526   2301    2
     #excellent      7.247   2612    3
     #superb 7.199   1660    2
     #perfection     7.099   3004    4
     #terrific       6.922   629     1
    

    I have a file containing a list of sentences like this:

    Terrific Theo Walcott is still shit, watch Rafa and Johnny deal with him on Saturday.
    its not that I'm a GSP fan, fabulous
    Iranian general says Israel's Iron Dome can't deal with their missiles 
    with J Davlar 11th. Main rivals are team Poland. 
    

    I want to check with regex the following:

    1. if first word in every sentence matches any words in the file Example, if Terrific, its, Iranian, with occur in the file or not

    2. if last word in sentence matches any words in the file Example, if saturday, fabulous, missiles , Poland occur in the file or not

    3. if the 2 or 3 characters prefix and suffix of individual words in sentences matches 2 or 3 characters prefix and suffix in file example if Ter, its, Ira, wi matches to any 2 or 3 prefix of words in the file or not. Same applies to suffix.

    I am so new to regex that I could think of this way but not getting the result: term2.lower() is the first column in the file

        wordanalysis["trail"] = found if re.match(sentence[-1],term2.lower()) else not(found)
        wordanalysis["lead"] = found  if re.match(sentence[0],term2.lower()) else not(found)
    

    解决方案

    Update: Per awesome suggestion by @justhalf, no need to use regex for splitting the words. Remove the .lower() if you want a case sensitive match.

    This will match the first word and last words (excluding any punctuation or trailing whitespace) in your list of data:

    (^\s?\w+\b|(\b\w+)[\.?!\s]*$)

    Matches:

    MATCH 1-1. Terrific
    MATCH 2-1. Saturday.
            2. Saturday
    MATCH 3-1. its
    MATCH 4-1. fabulous
            2. fabulous
    MATCH 5-1. Iranian
    MATCH 6-1. missiles 
            2. missiles
    MATCH 7-1. with
    MATCH 8-1. Poland. 
            2. Poland
    

    Implementation:

    import re, string
    
    sentences = open("sentences.txt").read().splitlines()
    data = open("data.txt").read()
    pattern = re.compile(r"(^\s?\w+\b|(\b\w+)[\.?!\s]*$)")
    for line in sentences:
        words = line.strip().split()
        first = words[0].lower()
        last = words[-1].translate(None, string.punctuation).lower()
        if (re.search(first, data, re.I)):
            print "Found " + first + " in data.txt"
        if (re.search(last, data, re.I)):
            print "Found " + last + " in data.txt"
    

    This probably isn't the most elegant way of doing it, but you get the idea.

    Code is tested and works, output is:

    Found Terrific in data.txt
    Found fabulous in data.txt
    

    Also this doesn't accomplish your 3rd criteria, test this out and see if it's working so far for you.

    这篇关于正则表达式匹配第一个和最后一个单词或任何单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆