正则表达式解析 imdb 页面并获取名称 [英] Regex to parse an imdb page and get the name
问题描述
我不太擅长正则表达式,我到处都看.我可以使用一些帮助来解析此页面(http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year,desc&ref_=nv_ch_osc_3) 以获取电影名称.P.S:也可以使用虚拟正则表达式.
I'm not very good at regex and looked everywhere i could. I could use some help to parse this page (http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year,desc&ref_=nv_ch_osc_3) to get the movies name . P.S: Could use a dummy regex too.
推荐答案
简答
这与您之前的问题几乎是相同的问题,答案是相同的......尽管使用了修改过的正则表达式.
Short Answer
This is almost the same problem as your previous question and the answer is the same... Albeit with a modified regex.
#<td class="number">(\d+).</td>.*?<a href="/title/tt\d+/">(.*?)</a>#s
https://stackoverflow.com/a/19600974/2573622
有关更多信息,您可能需要查看以下链接:
For more information you might want to check out the following link:
http://www.regular-expressions.info/
点击顶部菜单栏上的教程,这里有关于正则表达式的几乎所有内容的解释.
Click on Tutorial on the top menu bar and there are explanations about just about everything regex.
首先,您必须从页面中获取相关的 html(对于一部电影)...
Firstly, you have to get the relevant html (for one movie) from the page...
<td class="number">RANK.</td>
<td class="image">
<a href="/title/tt000000/" title="FILM TITLE (YEAR)"><img src="http://imdb.com/path-to-image.jpg" height="74" width="54" alt="FILM TITLE (YEAR)" title="FILM TITLE (YEAR)"></a>
</td>
<td class="title">
<span class="wlb_wrapper" data-tconst="tt000000" data-size="small" data-caller-name="search"></span>
<a href="/title/tt000000/">FILM TITLE</a>
然后你去掉噪音/可变信息......
You then strip out the noise/changeable info...
<td class="number">RANK.</td>.*?<a href="/title/tt\d+/">FILM TITLE</a>
然后添加您的捕获组...
Then add your capture groups...
<td class="number">(RANK).</td>.*?<a href="/title/tt\d+/">(FILM TITLE)</a>
就是这样:
#<td class="number">(\d+).</td>.*?<a href="/title/tt\d+/">(.*?)</a>#s
结束模式定界符之后的 s
修饰符使正则表达式引擎使 .
也匹配新行
The s
modifier after the ending pattern delimiter makes the regex engine make .
match new lines as well
与之前的答案相同(带有修改的正则表达式)
$page = file_get_contents('http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year,desc&ref_=nv_ch_osc_3');
preg_match_all('#<td class="number">(\d+).</td>.*?<a href="/title/tt\d+/">(.*?)</a>#s', $page, $matches);
$filmList = array_combine($matches[1], $matches[2]);
然后你可以这样做:
echo $filmList[1];
/**
Output:
Argo
*/
echo array_search("The Artist", $filmList);
/**
Output:
2
*/
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://php.net/file_get_contents
http://php.net/preg_match_all
http://php.net/array_combine
http://php.net/array_search
这篇关于正则表达式解析 imdb 页面并获取名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!