使用Regex匹配电视和电影文件名 [英] Matching TV and Movie File names with Regex

查看:214
本文介绍了使用Regex匹配电视和电影文件名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在努力获取正则表达式,以获取电视节目或电影的名称,播出的年份(如果存在的话),季节#和情节#的视频文件名.我有一个正则表达式(如下),对于电影和电视节目具有两年日期的节目(年份在电影/电影名称中的另一年份是在播出的年份)中似乎很有效.对于电视节目,如果格式为SXXEXX或XXX,则可以获取季节和剧集编号.我已经在 regex101.com 测试引擎中对其进行了测试.我苦苦挣扎的地方是,如果文件名中不存在年份,则表达式不会返回任何内容.同样,如果文件名具有4位数字,而该数字实际上是节目名称的一部分,则它认为这是播出的年份日期(即"4400").如何修改此表达式以能够处理我描述的额外条件?

I've been working on getting a regular expression to grab the TV Show or Movie name, the year it was aired if it exist, the season #and the episode # from the file name of a video. I have a regular expression (below) that seems to work well for shows with double year dates (one of the years is in the show/movie name the other is the year it aired) for both movies and TV show. For TV Show it is able to grab the season and episode numbers if the format is in SXXEXX or XXX. I've been testing it out in the regex101.com test engine. Where I'm struggling is the expression won't return anything if a year does not exist in the filename. Also if the filename has a 4 digit number that's actually part of the show name it thinks that is the aired year date (i.e. "The 4400"). How can I modify this expression to be able to handle the extra conditions that I described?

最终目标是,我希望将此文件放入一个Python脚本中,以查询TheTVDB.com之类的网站(如果该文件是电影或电视节目),以便将庞大的视频库分类到电视节目"和电影"文件夹中./p>

The end goal is I want to put this into a python script that queries a site like TheTVDB.com if the file is a movie or TV show so that I can sort my vast video library into TV Show and Movies folders.

(?P<ShowName>.*)[ (_.]#Show Name
       (?=19[0-9]\d|20[0-4]\d|2050) #If after the show name is a year
          (?P<ShowYear>\d{4,4}) # Get the show year
          | # Else
          (?=S\d{1,2}E\d{1,2}) 
             S(?P<Season>\d{1,2})E(?P<Episode>\d{1,2}) #Get the season and Episode information
             |
             (\d{1})E(\d{1,2})

这是我正在使用的测试数据

Here is my test data I'm using

  • archer.2009.S04E13
  • space 1999 1975
  • Space:1999年(1975年)
  • Space.1999.1975.S01E01
  • space 1999.(1975年)
  • The.4400.204.mkv
  • space 1999(1975)v.2009.S01E13.the.title.avi
  • Teen.wolf.S04E12.HDTV.x264
  • Se7en.(1995).avi
  • 如何训练龙2
  • archer.2009.S04E13
  • space 1999 1975
  • Space: 1999 (1975)
  • Space.1999.1975.S01E01
  • space 1999.(1975)
  • The.4400.204.mkv
  • space 1999 (1975) v.2009.S01E13.the.title.avi
  • Teen.wolf.S04E12.HDTV.x264
  • Se7en.(1995).avi
  • How to train your dragon 2

正则表达式不适用于以下测试数据:

The regular expression does not work properly with the following test data:

  • The.4400.204.mkv
  • Teen.wolf.S04E12.HDTV.x264
  • 如何训练龙2

更新:这是基于注释的新表达式.它的效果要好得多,但是却在表达式下面列出的3个文件名中苦苦挣扎.

Update: Here is the new expression based on the comments. It works much better but is struggling with the 3 file names listed below the expressions.

(?P<ShowName>.*)#Show Name
(
   [ (_.]
   (
       (?=\d{4,4}) #If after the show name is a year
          (?P<ShowYear>\d{4})  # Get the show year
          | # Else no year in the file name then just grab the name
          (?P<otherShowName>.*) # Grab Show Name
          (?=S\d{1,2}E\d{1,2}) # If the Season Episode patterns matches SX{1,2}EX{1,2}, Then
             S(?P<Season>\d{1,2})E(?P<Episode>\d{1,2}) #Get the season and Episode information
             | # Else
             (?P<Alt_S_E>\d{3,4}) # Get the season and Episode that looks like 211
   )
|$)

  • Se7en
  • 10,000BC(2010)
  • v.2009.S01E13.the.title.avi
  • archer.2009.S04E13
    • Se7en
    • 10,000BC (2010)
    • v.2009.S01E13.the.title.avi
    • archer.2009.S04E13
    • 推荐答案

      我对您的正则表达式做了一些修改,如果我对您的理解正确的话,它似乎可以正常工作.

      I made some modifications to your regex, and it seems to work, if I understood you correctly.

      ^(
        (?P<ShowNameA>.*[^ (_.]) # Show name
          [ (_.]+
          ( # Year with possible Season and Episode
            (?P<ShowYearA>\d{4})
            ([ (_.]+S(?P<SeasonA>\d{1,2})E(?P<EpisodeA>\d{1,2}))?
          | # Season and Episode only
            (?<!\d{4}[ (_.])
            S(?P<SeasonB>\d{1,2})E(?P<EpisodeB>\d{1,2})
          | # Alternate format for episode
            (?P<EpisodeC>\d{3})
          )
      |
        # Show name with no other information
        (?P<ShowNameB>.+)
      )
      

      请参见 regex101

      编辑:我已经更新了正则表达式,以处理您在评论中提到的最后3种情况.

      I've updated the regex to handle those last 3 situations you mentioned in comments.

      一个主要问题是您在主要变更周围没有括号,因此它包含了整个正则表达式.我还必须添加一个替代名称,以允许在名称之后不添加任何年份/短片格式.

      One main problem was that you had no parens around the main alternation, so it included the whole regex. I also had to add an alternation to allow for none of the year/episode formats following the name.

      由于您有太多可能彼此冲突的不同布局,因此正则表达式最终导致了许多不同场景的交替.例如,要匹配完全没有年份或情节信息的标题,我必须在整个正则表达式周围添加一个替代项,即如果找不到任何已知模式,则只需匹配整个内容即可.

      Because you have so many different possible layouts that possibly conflict with each other, the regex ended up being lots of alternation of different scenarios. For example, to match a title that has no year or episode information at all, I had to add an alternation around the whole regex that if it can't find any known pattern, just match the whole thing.

      注意:由于您似乎已将放映年限扩大到可以匹配任意四位数字,因此无需超前查询.换句话说,(?=\d{4,4})(?P<ShowYear>\d{4})(?P<ShowYear>\d{4})相同.这也意味着您要替换的剧集格式只能与3位数字匹配,而不能与4位数字匹配.否则,无法将独立的4位数字序列区分为年份或情节.

      Note: now that you seem to have expanded show years to match any four digits, there's no need for the lookahead. In other words, (?=\d{4,4})(?P<ShowYear>\d{4}) is the same as (?P<ShowYear>\d{4}). This also means that your alternate format for episode must match 3 digits only, not 4. Otherwise, there's no way to distinguish a stand-alone 4-digit sequence as a year or episode.

      常规模式:

      [ (_.]+                   the delimiter used throughout
      (?P<ShowNameA>.*[^ (_.])  the show name, greedy but not including a delimiter
      (?P<ShowNameB>.+)         the show name when it's the whole line
      

      格式A(可能包含季节和情节的年份):

      Format A (Year with possible Season and Episode):

      (?P<ShowYearA>\d{4})
      ([ (_.]+S(?P<SeasonA>\d{1,2})E(?P<EpisodeA>\d{1,2}))?
      

      格式B(仅适用于季节和剧集):

      Format B (Season and Episode only):

      (?<!\d{4}[ (_.])
      S(?P<SeasonB>\d{1,2})E(?P<EpisodeB>\d{1,2})
      

      格式C(剧集的备用格式):

      Format C (Alternate format for episode):

      (?P<EpisodeC>\d{3})
      

      这篇关于使用Regex匹配电视和电影文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆