搜索引擎继续忽略LANG标记 [英] Search engines continue to ignore LANG markup

查看:98
本文介绍了搜索引擎继续忽略LANG标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有三个测试页面,标记为意大利语,西班牙语,

葡萄牙语,resp。通过


内容 - 语言:它

< html lang =" it">

< body lang = 它>


和es相同和pt。


雅虎将所有三个页面视为意大利语:
http://search.yahoo.com/search?p =%22 ... l = 1& vl = lang_it


Google认为一个是英语(What ??),两个是西班牙语:
http://www.google.com/search?q=% 22id ...%22& lr = lang_en
http://www.google.com/search?q=%22id...%22&lr=lang_es


: - (


-

在记忆中Alan J. Flavell
http://groups.google.com/groups/sear...Alan.J.Flavell

解决方案

Andreas Prilop< An *************** @ trashmail.netwrote:


>我有三个测试页面,标记为意大利语,西班牙语,
葡萄牙语,分别为。通过


内容 - 语言:它

< html lang =" it">

< body lang = 它>

和es相同和pt。

雅虎将所有三个页面视为意大利语:
http://search.yahoo.com/search?p=%22...l=1& vl = lang_it

Google认为一个是英语(What ??),两个是西班牙语:
http://www.google.com/search?q=%22id...%22&lr = lang_en
http://www.google.com/search?q=%22id...%22&lr=lang_es



我会感到惊讶如果作者在

网站上提供了类似语言信息的元数据,那么它是可靠的。我希望通过使用

启发式算法来确定文档的语言会有更好的结果。所以我希望SEs能够使用

启发式,它会更好地为用户服务。


我不会说任何测试语言,但是比较两个测试

页面,在我看来,它们不包含每种语言特有的单词,实际上内容似乎是

选择混淆启发式猜测。


选择使用单词列表而不是自然语言可能

也阻碍了启发式猜测,因为它使得不可能在各种语言中使用

上下文来表示类似的单词。


-

Spartanicus


2007年2月28日星期三,Spartanicus写道:


如果作者提供了类似语言信息的元数据,我会感到惊讶

网络广泛可靠。



大多数情况下,LANG标记在文档中缺失*。但是,如果作者

提供LANG标记,则应将其视为......嗯...具有权威性。

作者最了解他所写的语言。


我希望通过使用

启发式算法来确定文档的语言会有更好的结果。所以我希望SEs能够使用

启发式,它可以更好地为用户服务。



这与Internet Explorer 6使用的参数相同:


|服务器发送text / plain。但我拿text / html

|因为它对我来说似乎更有意义。


当LANG标记缺失*时,他们仍然可以猜测。


事实上,内容似乎被选择来混淆启发式猜测。



确实。


选择使用单词列表而不是自然语言可能

也阻碍了启发式猜测,因为它使得不可能对各种语言中的类似单词使用

上下文。



但是只有这样的单词列表,你可以使用不同的LANG

参数。所有单词都有意大利语,西班牙语,葡萄牙语。

每页可以是IT或ES或PT。


-

在记忆中Alan J. Flavell
http://groups.google.com/groups/sear...Alan.J.Flavell


Andreas Prilop写道:
< blockquote class =post_quotes>
我有三个测试页面,标记为意大利语,西班牙语,

葡萄牙语,resp。通过


内容 - 语言:它

< html lang =" it">

< body lang = 它>


和es相同和pt。


雅虎将所有三个页面视为意大利语:
http://search.yahoo.com/search?p =%22 ... l = 1& vl = lang_it


Google认为一个是英语(What ??),两个是西班牙语:
http://www.google.com/search?q=% 22id ...%22& lr = lang_en
http://www.google.com/search?q=%22id...%22&lr=lang_es


:-(



难道你不应该< meta lang =" it" /在头部而不是指定

元素的语言?

-

am


laurus:rhodophyta:brethoneg:smalltalk:sta rgate


-

通过 http://www.teranews.com


I have three test pages that are marked as Italian, Spanish,
Portuguese, resp. by

Content-Language: it
<html lang="it">
<body lang="it">

and the same for "es" and "pt".

Yahoo regards all three pages as Italian:
http://search.yahoo.com/search?p=%22...l=1&vl=lang_it

Google regards one as English (What??) and two as Spanish:
http://www.google.com/search?q=%22id...%22&lr=lang_en
http://www.google.com/search?q=%22id...%22&lr=lang_es

:-(

--
In memoriam Alan J. Flavell
http://groups.google.com/groups/sear...Alan.J.Flavell

解决方案

Andreas Prilop <An***************@trashmail.netwrote:

>I have three test pages that are marked as Italian, Spanish,
Portuguese, resp. by

Content-Language: it
<html lang="it">
<body lang="it">

and the same for "es" and "pt".

Yahoo regards all three pages as Italian:
http://search.yahoo.com/search?p=%22...l=1&vl=lang_it

Google regards one as English (What??) and two as Spanish:
http://www.google.com/search?q=%22id...%22&lr=lang_en
http://www.google.com/search?q=%22id...%22&lr=lang_es

I''d be surprised if author provided meta data like language info on the
web was broadly reliable. I''d expect better results from using
heuristics to determine a document''s language. So I''d expect SEs to use
heuristics, it serves their users better.

I don''t speak any of the test languages, but comparing two of the test
pages it seems to me that they do not contain words that are
characteristic for each language, in fact the content appears to be
chosen to confuse heuristic guessing.

The choice of using a list of words instead of natural language probably
also hinders heuristic guessing since it makes it impossible to use
context for similar words in the various languages.

--
Spartanicus


On Wed, 28 Feb 2007, Spartanicus wrote:

I''d be surprised if author provided meta data like language info on the
web was broadly reliable.

Mostly, LANG markup is *missing* from documents. However, if the author
supplies LANG markup, it should be taken as ... well ... authoritative.
The author knows best in which language he writes.

I''d expect better results from using
heuristics to determine a document''s language. So I''d expect SEs to use
heuristics, it serves their users better.

That''s the same argument used by Internet Explorer 6:

| The server sends "text/plain" but I take "text/html"
| because it seems to make more sense to me.

They can still guess when LANG markup is *missing*.

in fact the content appears to be chosen to confuse heuristic guessing.

Exactly.

The choice of using a list of words instead of natural language probably
also hinders heuristic guessing since it makes it impossible to use
context for similar words in the various languages.

But only with such a list of words, you can take different LANG
parameters. All the words exist in Italian, Spanish, Portuguese.
Each page could be IT or ES or PT.

--
In memoriam Alan J. Flavell
http://groups.google.com/groups/sear...Alan.J.Flavell


Andreas Prilop wrote:

I have three test pages that are marked as Italian, Spanish,
Portuguese, resp. by

Content-Language: it
<html lang="it">
<body lang="it">

and the same for "es" and "pt".

Yahoo regards all three pages as Italian:
http://search.yahoo.com/search?p=%22...l=1&vl=lang_it

Google regards one as English (What??) and two as Spanish:
http://www.google.com/search?q=%22id...%22&lr=lang_en
http://www.google.com/search?q=%22id...%22&lr=lang_es

:-(

Shouldn''t you have <meta lang="it"/in the head rather than specifying
the language of elements?
--
am

laurus : rhodophyta : brethoneg : smalltalk : stargate

--
Posted via a free Usenet account from http://www.teranews.com


这篇关于搜索引擎继续忽略LANG标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆