正则表达式解析 imdb 页面并获取名称 [英] Regex to parse an imdb page and get the name

查看:53
本文介绍了正则表达式解析 imdb 页面并获取名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不太擅长正则表达式,我到处都看.我可以使用一些帮助来解析此页面(http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year,desc&ref_=nv_ch_osc_3) 以获取电影名称.P.S:也可以使用虚拟正则表达式.

I'm not very good at regex and looked everywhere i could. I could use some help to parse this page (http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year,desc&ref_=nv_ch_osc_3) to get the movies name . P.S: Could use a dummy regex too.

推荐答案

简答

这与您之前的问题几乎是相同的问题,答案是相同的......尽管使用了修改过的正则表达式.

Short Answer

This is almost the same problem as your previous question and the answer is the same... Albeit with a modified regex.

#<td class="number">(\d+).</td>.*?<a href="/title/tt\d+/">(.*?)</a>#s

https://stackoverflow.com/a/19600974/2573622

有关更多信息,您可能需要查看以下链接:

For more information you might want to check out the following link:

http://www.regular-expressions.info/

点击顶部菜单栏上的教程,这里有关于正则表达式的几乎所有内容的解释.

Click on Tutorial on the top menu bar and there are explanations about just about everything regex.

首先,您必须从页面中获取相关的 html(对于一部电影)...

Firstly, you have to get the relevant html (for one movie) from the page...

<td class="number">RANK.</td>
  <td class="image">
    <a href="/title/tt000000/" title="FILM TITLE (YEAR)"><img src="http://imdb.com/path-to-image.jpg" height="74" width="54" alt="FILM TITLE (YEAR)" title="FILM TITLE (YEAR)"></a>
  </td>
  <td class="title">
    

<span class="wlb_wrapper" data-tconst="tt000000" data-size="small" data-caller-name="search"></span>

    <a href="/title/tt000000/">FILM TITLE</a>

然后你去掉噪音/可变信息......

You then strip out the noise/changeable info...

<td class="number">RANK.</td>.*?<a href="/title/tt\d+/">FILM TITLE</a>

然后添加您的捕获组...

Then add your capture groups...

<td class="number">(RANK).</td>.*?<a href="/title/tt\d+/">(FILM TITLE)</a>

就是这样:

 #<td class="number">(\d+).</td>.*?<a href="/title/tt\d+/">(.*?)</a>#s

结束模式定界符之后的 s 修饰符使正则表达式引擎使 . 也匹配新行

The s modifier after the ending pattern delimiter makes the regex engine make . match new lines as well

与之前的答案相同(带有修改的正则表达式)

$page = file_get_contents('http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year,desc&ref_=nv_ch_osc_3');

preg_match_all('#<td class="number">(\d+).</td>.*?<a href="/title/tt\d+/">(.*?)</a>#s', $page, $matches);


$filmList = array_combine($matches[1], $matches[2]);

然后你可以这样做:

echo $filmList[1];

/**
Output:

Argo

*/

echo array_search("The Artist", $filmList);

/**
Output:

2

*/

http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://php.net/file_get_contents
http://php.net/preg_match_all
http://php.net/array_combine
http://php.net/array_search

这篇关于正则表达式解析 imdb 页面并获取名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆