搜索&通过PHP提取外部网页中的特定文本? [英] Searching & Extracting Specific text in external webpage via PHP?

查看:72
本文介绍了搜索&通过PHP提取外部网页中的特定文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在试图从电视剧集跟踪网站中提取下一集。以下是一个示例页面:

I've been trying to simply extract "the next episode number" from a TV episodes tracking website. Here's an example page:

示例页面

向下滚动您会看到倒计时,日期,季节和数字。我想提取这个数字。

Scroll down and you'll see "Countdown", "Date", "Season" and "number". I'd like to extract that number.

我一直在看源代码以及简单的HTML DOM来尝试工作,但我失败了多次。 数字的类别为nextEpInfo,但倒计时,季节等都有相同的类。

I've been looking at the source code as well as Simple HTML DOM to try and work something out but I failed multiple times. The "number" has the class "nextEpInfo" but the "Countdown", "season"...etc have the same class as well.

提取它?

如果可能,我真的很感激一些很好的参考,解释你推荐的方法,因为我最好想学习如何处理这些情况在未来,我需要提取的内容被包裹在不同的类,divs ...等。

Also if possible I'd really appreciate some good references that explain the method that you recommend as I'd ideally like to learn how to deal with these situations in the future when content I need extracted is wrapped inside different classes, divs...etc.

推荐答案

如果你有您要解析的页面的HTML可以使用preg_match来查找。

If you have the raw HTML of the page you want to parse you can use a preg_match to find it.

如果您没有HTML,则应该可以帮助您:如何获取网页的HTML代码在PHP?

If you don't have the HTML this should help you: How do I get the HTML code of a web page in PHP?

preg_match()

这个函数可以让你用正规的ex压力模式建议只得到一小部分HTML来解析,而不是所有的页面。例如,在这种情况下,我会尝试获取第一张表的HTML(没有上一集的信息)。

This function lets you parse a string with a regular expression pattern. It would be recommended to get only a fraction of the HTML to parse, not all the page. For example, in this case I would try to get the HTML of the first table (the one that doesn't have info of the previous episode).

$subject="the HTML of the url you want to parse";
$pattern='/Number:<\/td><td.+?>(\d+)<\//';
if(preg_match($pattern, $subject, $hits)){
    echo "Number: $hits[0]";
}

如果您不知道正则表达式是如何工作的:

In case you don't know how a regular expression works:

'。'是一个保留字符,表示任何字符,+表示一个或多个,?表示正则表达式非贪婪。所以如果我们总结一下'。+?'是指一个或多个任何字符,但是尽可能短。

'.' is a reserved character that means 'any character', the '+' right after it means 'one or more than one' and the '?' makes the regular expression non-greedy. So if we sum it up '.+?' means 'one or more of any character, but make it as short as possible'.

'('和') '表示我们想要检索它们之间的内容,'\d'表示一个数字。所以'(\d +)'表示'把这个组合的数字放在$ hits数组中'

'(' and ')' indicates we want to retrieve what is between them, and '\d' means a number. So '(\d+)' means 'put that combination of numbers in the $hits array'.

如果你使用相同的正则表达式,但是使用preg_match_all,所有网页的数字遵循相同的模式,它们将在$ hits数组中。

If you use the same regular expression but with preg_match_all you would retrieve all the numbers of the web that follow that same pattern, they would be inside the $hits array.

这篇关于搜索&amp;通过PHP提取外部网页中的特定文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆