检查链接是否有效,如果没有在视觉上将其识别为损坏 [英] Check link works and if not visually identify it as broken

查看:111
本文介绍了检查链接是否有效,如果没有在视觉上将其识别为损坏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开展一个项目,该项目列出了来自Oron,filespost,depositfiles等文件共享网址,这些网址报告了我的网络中已识别内容所有者和版权所有者共享受版权保护的资料。

I am working on a project which lists file sharing urls from the likes of Oron, filespost, depositfiles etc that reports sharing of copyrighted materials to identified content owners and rights holders in my network.

为了更好地改进服务,该服务目前位于MySQL数据库填充的表格中,并在php中内置了一些过滤器,我希望能够识别已经停止运行的链接。

To better improve the service, which currently stands at a table populated from MySQL database with some filters built in to the php, I want to be able to identify the links that have ceased to function.

我的想法是,当从MySQL数据库中检索数据时,将检查下载URL列条目(文件或文件主机页面的URL)以查看是否它们链接到允许用户开始下载过程的实际文件共享页面,如果它们正在工作并提供下载文件的能力,它们应该保留,链接文本或单元格颜色变为绿色,如果文件站点显示文件不是发现或类似的链接文本或单元格背景颜色应该变为红色。

My thoughts are that when the data is retrieved from the MySQL database the download URL column entries (the url to the file or file host page) will be checked to see if they link to the actual file sharing page that allows users to start the download process, if they are working and provide the ability to download the file they should be left, link text or the cell colour turned green, if the file site displays file not found or similar the link text or cell background colour should turn red.

目前没有快速简单的活动或非活动链接的可视化表示。

At present there is no quick and easy visual representation of active or inactive links.

我根据是否收到404错误对网址进行了简单的验证,但很快就意识到,如果这些网站没有404或重定向,它们将无法工作,它们会动态更改生成的页面说文件不可用或文件已被删除等。

I have a simple validation on the url based on if a 404 error is received but quickly realised that won't work given that these sites don't 404 or redirect even, they change the dynamically generated page to say the file is not available or file has been removed etc.

我还合并了一个使用第三个pa的链接检查脚本rt文件共享链接检查服务,但这需要手动检查和手动更新数据库。

I have also incorporated a link checker script that uses a third part file share link checking service but this would require manual checks and manual updating of the database.

我还检查了我是否可以找到特定字段或单词页面,但给定的网站范围和网站上使用的更广泛的术语已被证明是准确的,并且难以在所有链接上实现。

I have also checked to see if I can find specific fields or words on the page, but the given the range of sites and the broader range of terms used on the sites this to has been proven to be accurate and difficult to implement on all links.

如果可以根据活动状态过滤掉网址,也会有所帮助。我猜测颜色变化是由链接类还是单元类样式管理的,我可以根据类过滤列,例如:link-dead或link-active。我想我可以做到这一点,所以不一定需要帮助最后一点基于类的过滤。

It would also be helpful if urls could then be filtered down based on the active status. I'm guessing if the colour change was managed by a link class or cell class style I could filter the column based on class eg: link-dead or link-active. I think I can do this so help with this last bit on filtering based on class is not necessarily required.

任何帮助将不胜感激。

推荐答案

由于您要检查的网站是由不同的人创建的,因此无法通过单行检测链接是否损坏在大量网站上。

As the sites you want to check are created by different people, there is unlikely to be a one-liner to detect if a link is broken or not over a vast number of sites.

我建议您为每个网站创建一个简单的函数,以检测该特定网站的链接是否已损坏。当您想要检查链接时,您将根据域名决定在外部网站的HTML上运行哪个功能。

I suggest that you create a simple function for each site that detects if the link is broken for that particular site. When you want to check a link, you would decide which function to run on the external site's HTML based on the domain name.

您可以使用 parse_url()从文件链接中提取域/主机:

You can use parse_url() to extract the domain/host from the file links:

// Get your url from the database. Here I'll just set it:
$file_url_from_database = 'http://example.com/link/to/file?var=1&hello=world#file'

$parsed_link = parse_url($file_url_from_database);
$domain = $parsed_link['host']; // $domain now equals 'example.com'

您可以将函数名存储在关联数组中并以这种方式调用它们:

You could store the function names in an associative array and call them that way:

function check_domain_com(){ ... }
function check_example_com(){ ... }

$link_checkers = array();
$link_checkers['domain.com'] = 'check_domain_com';
$link_checkers['example.com'] = 'check_example_com';

或将函数存储在数组中(PHP> = 5.3)。

or store the functions in the array (PHP >=5.3).

$link_checkers = array();
$link_checkers['domain.com'] = function(){ ... };
$link_checkers['example.com'] = function(){ ... };

并使用

if(isset($link_checkers[$domain]))
    // call the function stored under the index 'example.com'
    call_user_func($link_checkers[$domain]); 
else
    throw( new Exception("I don't know how to check the domain $domain") );

或者你可以使用一堆if语句

Alternatively you could just use a bunch of if statements

if($domain == 'domain.com')
    check_domain_com();
else if($domain == 'example.com')
    check_example_com(); // this function is called

这些函数可以返回一个布尔值(true或false; 0或1)使用,或者在需要时自己调用另一个函数(例如为断开的链接添加一个额外的CSS类)。

The functions could return a boolean (true or false; 0 or 1) to use, or call another function themselves if needed (for example to add an extra CSS class to broken links).

我最近做了类似的事,尽管我是在获取元数据从多个网站的股票摄影。我使用了一个抽象类,因为我有一些函数可以为每个站点运行。

I did something similar recently, though I was fetching metadata for stock photography from multiple sites. I used an abstract class because I had a few functions to run for each site.

作为旁注,最后将检查日期存储在数据库中是明智的。并且将检查率限制为24或48小时(或根据您的需要进一步分开)。

As a side note, it would be wise to store the last checked date in your database and limit the checking rate to something like 24 or 48 hours (or further apart depending on your needs).

编辑以澄清实施:

由于向其他网站发出HTTP请求的速度可能非常慢,因此您需要独立于页面加载检查和更新链接状态。您可以这样实现:

As making HTTP requests to other websites is potentially very slow, you will want to check and update link statuses independently of page loads. You could achieve this like this:


  • 脚本可以每12小时运行一次并检查上次检查超过24的数据库中的所有链接小时前。对于每个旧链接,它会相应地更新数据库中的活动 last_checked 列。

  • 当有人请求页面时,您的脚本将从数据库中的 active 列中读取,而不是每次都下载远程页面进行检查。

  • (额外的想法)当提交新链接时,会立即在脚本中检查它,或者将其添加到队列中以便服务器尽快检查。

  • A script could run every 12 hours and check all links from the database that were last checked more than 24 hours ago. For each 'old' link, it would update the active and last_checked columns in your database appropriately.
  • When someone requests a page, your script would read from the active column in your database instead of downloading the remote page to check every time.
  • (extra thought) When a new link is submitted, it is checked immediately in the script, or added to a queue to be checked by the server as soon as possible.

由于人们可以轻松点击链接来检查其当前状态,因此允许他们点击按钮从您的页面进行检查是多余的(没有任何反对但是想法。

As people can easily click a link to check it's current state, it would be redundant to allow them to click a button to check from your page (nothing against the idea though).

请注意,潜在的资源密集型更新所有脚本应可通过网络执行(可访问)。

Note that the potentially resource-heavy update-all script should not be executable (accessible) via web.

这篇关于检查链接是否有效,如果没有在视觉上将其识别为损坏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆