使用简单的html dom获取网站的语言 [英] Get language of a website using simple html dom
问题描述
我正在使用PHP构建搜索引擎和网络抓取工具,我想检测网站的语言,如何通过以下方式检测页面的语言:
I am building a search engine and webcrawler using PHP, and i would like to detect the language of a website, how would i detect the language of a page by:
- 检查 https://twitter.com/?lang=jap的URL
如果没有设置,那么我想: - 检查URL https://www.google.co.jp/
如果我仍然找不到任何东西,那么我将默认设置为英语
if i still can't find anything then i would to set default to English
到目前为止,我用于抓取页面的代码是:
the code i have so far for scraping pages is:
function crawl($url){
$html = file_get_html($url);
if($html && is_object($html) && isset($html->nodes)){
$weblinks[]=$url;
foreach($html->find('a') as $element) {
global $weblinks;
$link = $element->href;
$base_url = parse_url($url, PHP_URL_HOST);
if(substr($link,0,7)=="http://"){
$link = $link;
}else if(substr($link,0,8)=="https://"){
$link = $link;
}else if(substr($link,0,2)=="//"){
$link = substr($link, 2);
}else if(substr($link,0,1)=="#"){
$link = $html;
}else if(substr($link,0,7)=="mailto:"){
$link = "";
}else if(substr($link,0,11)=="javascript:"){
$link = "";
}else{
if(substr($link, 0, 1) != "/"){
$link = $base_url."/".$link;
}else{
$link = $base_url . $link;
}
}
if(substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && $link != ""){
if(substr($url, 0, 8) == "https://"){
$link = "https://".$link;
}else{
$link = "http://".$link;
}
}
if(!in_array($link, $weblinks)){
$weblinks[]=$link;
}
}
$html->clear();
}else{
}
}
function info($weblinks){
foreach($weblinks as $link) {
$linkhtml = file_get_html("$link");
if($linkhtml && is_object($linkhtml) && isset($linkhtml->nodes)){
$titleraw = $linkhtml->find('title',0);
$title = $titleraw->innertext;
$des = $linkhtml->find("meta[name='description']",0)->content;
//detect language here
echo "<tr><td>".$title."</td><td>".$link."</td><td>".$des."</td></tr>";
$sql = mysql_query("INSERT into web once");
$title = "";
$des = "";
$linkhtml->clear();
}
}
}
推荐答案
要从?lang =
中获取语言:
$url = 'www.domain.org?lang=IT';
$url_parts = parse_url($url);
$lang = parse_str($url_parts['lang']);
然后,您应该使用switch/case语句和您支持的语言列表来对此进行验证,例如:
You should then validate this with a switch/case statement and a list of languages that you support, like this:
switch ($lang) {
case 'EN':
//language is English
break;
case 'IT':
//language is Italian
break;
case 'FR':
//language is French
break;
default:
//?lang query was empty, or contained an unsupported language
$lang = FALSE;
} //end switch
之后,您可以使用此逻辑来确定是否需要检查该语言的URL:
After that, you can use this logic to determine whether you need to check the URL for the language:
if ($lang == FALSE) {
//code to determine language from TLD
}
希望这将帮助您入门,尽管这是您打开的一大蠕虫.除了您提到的内容外,还需要检查其他内容才能确定网站的语言.其中之一是语言元标记,它是这样的:< meta name ="language" content ="english">
并位于网页的顶部,尽管并非所有网站都使用它.
Hopefully this will help get you started, although this is a big can of worms you're opening up. There are other things you need to check in order to be certain of the language of a website in addition to what you've mentioned. One of them is the language meta tag, which is like this: <meta name="language" content="english">
and goes in the head of the webpage, though not all websites use it.
一些类似我的多语言网站使用的子域为 http://it.website.com
或 http://fr.website.com
Some multilingual websites, like mine, use a subdomain like http://it.website.com
or http://fr.website.com
其他人使用与?lang =
不同的查询字符串.因此,您需要进行大量研究才能涵盖您的所有基础.
Others use query strings that are different from ?lang=
. So you'll need to do a significant amount of research to cover all your bases.
这篇关于使用简单的html dom获取网站的语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!