多语种网站和网络抓取工具 [英] Multilingual websites and web-crawlers

查看:61
本文介绍了多语种网站和网络抓取工具的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望用几种语言翻译我的网站,并使用语言协商的优势来让用户选择其首选的

版本的网站。 />

我最感兴趣地阅读了关于
http://www.cs.tut.fi/~jkorpela/ http://ppewww.ph.gla.ac.uk/~flavell/www

http://webtips.dan.info/ ,但我还有一些问题(也许我是

忽略了这些网站上的答案,但......)。


首先,我的ISP运行Apache服务器,但据我所知,MultiViews

没有激活(我还在等待一些确认)所以我决定使用类型映射方法

:我将每个页面关联到一个变体

文件指向Apache所需的页面版本,该页面在

转回发送回用户代理,即cave.var指向cave.fr.html

和洞穴。例如en.html。


然后,如果用户代理要求 http://server/cave.var
http:// serve / cave ,它根据

获得其语言设置的cave.en.html或cave.fr.html。好的。


现在我的cave.en.html和cave.fr.html包含了其他页面的链接,这些页面是自己翻译的。然后链接的href属性是一个

泛型属性,例如href =" lascaux"而不是href =" lascaux.var" (并且

lascaux.var反过来指向lascaux.fr.html和lascaux.en.html)。


这是我的问题:让''我假设我的主页是cave.fr.html和

cave.en.html,我将这些页面提交给一个网络爬虫,它将分析它们和b $ b找到lascaux的链接并尝试扫描一个不存在的假设

lascaux.html,以便在没有索引的情况下停止

lascaux.fr.html或lascaux.en.html。 ..


我错了吗?如果没有,是否有一种解决方法告诉机器人扫描所有

页面,即使HTML文件中没有明确引用它们

(除了提交所有这些页面外)给机器人......)?


谢谢,


Vincent。

I would like to have my website translated in several languages and take
advantage of language negotiation to let the user choose its preferred
version of the site.

I read with most interest the invaluable informations on
http://www.cs.tut.fi/~jkorpela/, http://ppewww.ph.gla.ac.uk/~flavell/www
and http://webtips.dan.info/, but I still have some questions (maybe I
overlooked the answers on these sites, though...).

First, my ISP runs an Apache server but as far as I can see, MultiViews
is not activated (I''m still waiting for some confirmation) so I decided
to use the type-map method: I associate to each of my pages a variant
file that directs Apache to the desired version of the page which is in
turn sent back to the user agent, i.e. cave.var points to cave.fr.html
and cave.en.html for example.

Then if the user agent asks for http://server/cave.var or
http://serve/cave, it gets either cave.en.html or cave.fr.html according
to its language settings. Fine.

Now my cave.en.html and cave.fr.html contain links to other pages that
are themselves translated. The href attribute of the link is then a
generic one, say href="lascaux" instead of href="lascaux.var" (and
lascaux.var in turn points to lascaux.fr.html and lascaux.en.html).

Here is my problem: let''s assume my home pages are cave.fr.html and
cave.en.html, I submit these pages to a web-crawler that is going to
analyze them and find a link to "lascaux" and try to scan a hypothetical
lascaux.html that doesn''t exist so that it will stop without indexing
lascaux.fr.html nor lascaux.en.html...

Am I wrong ? If not, is there a workaround to tell a robot to scan all
pages even though there is no explicit reference to them in HTML files
(apart from submitting all of them to the robot...) ?

Thanks,

Vincent.

推荐答案

Vincent< vi ************ @ wanadoo.fr>写道:
Vincent <vi************@wanadoo.fr> writes:
现在我的cave.en.html和cave.fr.html包含了自己翻译的其他页面的链接。然后链接的href属性是通用的,例如href =" lascaux"而不是href =" lascaux.var" (并且
lascaux.var反过来指向lascaux.fr.html和lascaux.en.html)。

这是我的问题:让我们假设我的主页是洞穴。 fr.html和
cave.en.html,我将这些页面提交给网络爬虫,它将分析它们并找到lascaux的链接。并尝试扫描一个不存在的假设lascaux.html,以便在没有索引lascaux.fr.html或lascaux.en.html ...
Now my cave.en.html and cave.fr.html contain links to other pages that
are themselves translated. The href attribute of the link is then a
generic one, say href="lascaux" instead of href="lascaux.var" (and
lascaux.var in turn points to lascaux.fr.html and lascaux.en.html).

Here is my problem: let''s assume my home pages are cave.fr.html and
cave.en.html, I submit these pages to a web-crawler that is going to
analyze them and find a link to "lascaux" and try to scan a
hypothetical lascaux.html that doesn''t exist so that it will stop
without indexing lascaux.fr.html nor lascaux.en.html...

/>

等一下,为什么网页抓取工具会尝试访问lascaux.html

找到lascaux的链接?


为什么不会lascaux-> lascaux.var-> lascaux。(en | fr).html对于网络爬虫和其他任何用户代理一样工作

。 />

我唯一能想到的就是爬虫没有发送任何语言

首选项,但是不应该.var回归其中一个其他人在这种情况下适合

,而不是随意重定向到

lascaux.html?


尝试用类似的东西wget或telnet,很容易设置

自定义标题进行测试,但我认为不应该有问题。


- -

Chris



Hold on, why would a web crawler attempt to access lascaux.html on
finding a link to lascaux?

And why wouldn''t lascaux->lascaux.var->lascaux.(en|fr).html work the
same for a web crawler as for any other user agent.

Only thing I can think of is if the crawler doesn''t send any language
preferences, but shouldn''t .var fall back to one of the others as
appropriate in that case, rather than arbitrarily redirecting to
lascaux.html?

Try it out with something like wget or telnet where it''s easy to set
custom headers for testing, but I don''t think there should be a problem.

--
Chris


9月1日星期一,V刻在永恒卷轴上的oncent:
On Mon, Sep 1, Vincent inscribed on the eternal scroll:
首先,我的ISP运行Apache服务器,但据我所见,MultiViews
未激活(我还在等待一些确认)所以我决定使用类型映射方法:


我有点疑惑为什么应该设置一个允许你的服务器

使用字体图但阻止您使用多视图。但也许

它就是这样出来的,没有他们真正考虑过它 - 谁b / b
知道吗?


你至少可以试试坚持.htaccess文件,看看

发生了什么。这里有一个线索:


1.暂时将一些完整的垃圾放入.htaccess文件中,然后

尝试访问它控制的一个页面。你应该得到一个

服务器错误。如果你不是,那么显然服务器不会对.htaccess支付任何费用,而且你无法在这个方向上做什么。




2.取出垃圾,然后输入


选项+ MultiViews


代替。再试一次。如果你仍然收到服务器错误,那么它b / b
表明服务器被配置为禁止你的指令在你的
..htaccess中。太糟糕了。如果服务器响应正常,另一方面,

则应该为MultiViews设置全部。

我将每个页面关联到一个变体
文件将Apache指向所需的页面版本,然后将其发送回用户代理,即cave.var指向cave.fr.html
和cave.en.html。


这个想法,如果你使用的是类型图,是的。

然后,如果用户代理要求 http://server/cave.var
http:// serve / cave ,它根据语言设置获得cave.en.html或cave.fr.html。精细。


嗯,用户代理只会合理地要求您在链接中指定的URL

。 (或者你的意思是

已经有其他网站链接到你的cave.fr.html等网址

明确吗?)

现在我的cave.en.html和cave.fr.html包含指向其他页面的链接。
本身已翻译。然后链接的href属性是通用的,例如href =" lascaux"而不是href =" lascaux.var"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


您使用哪种机制来实现这一目标? MultiViews会为你做那个
,但是你说你没有使用它:mod_speling会为你做b $ b,我想,但是通过重定向"拉斯科"到lascaux.var,

所以你可以通过直接链接到

lascaux.var来保存网络交易。

(而lascaux.var又指向lascaux.fr.html和
lascaux.en.html)。


当然......

这是我的问题:让我们假设我的主页是cave.fr.html和
洞穴。 en.html,我将这些页面提交给网页抓取工具,该网页抓取工具将对其进行分析,并找到指向lascaux的链接。并尝试扫描一个不存在的假设的lascaux.html,以便在没有索引的情况下停止
lascaux.fr.html也不会lascaux.en.html ...
First, my ISP runs an Apache server but as far as I can see, MultiViews
is not activated (I''m still waiting for some confirmation) so I decided
to use the type-map method:
I''m a bit puzzled as to why a server should be set up that allows you
to use typemaps but prevents you from using multiviews. But maybe
it came out that way without them really thinking about it - who
knows?

You could at least try sticking-in a .htaccess file to see what
happens. Here''s a clue:

1. put some complete junk into a .htaccess file temporarily, and then
try accessing one of the pages that it controls. You should get a
server error. If you don''t, then clearly the server is paying no
attention to the .htaccess, and there''s nothing further you can do in
this direction.

2. take out the junk, and put in

Options +MultiViews

instead. Try again. If you still get a server error, then it
suggests the server is configured to prohibit that directive in your
..htaccess. Too bad. If the server responds ok, on the other hand,
then you should be all set for MultiViews.
I associate to each of my pages a variant
file that directs Apache to the desired version of the page which is in
turn sent back to the user agent, i.e. cave.var points to cave.fr.html
and cave.en.html for example.
That''s the idea, if you''re using a typemap, yes.
Then if the user agent asks for http://server/cave.var or
http://serve/cave, it gets either cave.en.html or cave.fr.html according
to its language settings. Fine.
Well, the user agent is only reasonably going to ask for the URLs
which you nominate in your links. (Or do you mean that there are
already other sites linking to your cave.fr.html etc. URLs
explicitly?)
Now my cave.en.html and cave.fr.html contain links to other pages that
are themselves translated. The href attribute of the link is then a
generic one, say href="lascaux" instead of href="lascaux.var" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Which mechanism are you using to achieve that? MultiViews would do
that for you, but you''re saying you aren''t using it: mod_speling would
do it for you, I think, but by redirecting "lascaux" to "lascaux.var",
so you could save a network transaction by linking directly to
lascaux.var in the first place.
(and lascaux.var in turn points to lascaux.fr.html and
lascaux.en.html).
Sure...
Here is my problem: let''s assume my home pages are cave.fr.html and
cave.en.html, I submit these pages to a web-crawler that is going to
analyze them and find a link to "lascaux" and try to scan a hypothetical
lascaux.html that doesn''t exist so that it will stop without indexing
lascaux.fr.html nor lascaux.en.html...




它可以提供URL(它从你的href =" ...",好吗?)到服务器的
,并得到任何一个如果他们有b / b
提出相同的请求,浏览器就会有。


为什么你认为索引器会附加一个未经请求的.html

到您的网址?这是不合适的!


唯一的问题是索引器可能在没有

的情况下发出请求,包括Accept-language首选项。但是你肯定会为这些请求设置一个默认的

语言吗?


无论如何,最好是在其他情况下包含显式链接

语言版本,以便读者可以切换语言暂时

如果需要,无需重新配置他们的浏览器。所以

索引器应该可以很自然地看到所有页面的链接,因为它会浏览你的网站。


祝你好运



It whould present the URL (which it got from your href="...", OK?) to
the server, and get whatever a browser would have got if they had
presented the same request.

Why do you suppose the indexer would append an unsolicited ".html"
to your URL? That would be improper of it!

The only issue is that the indexer might be making the request without
including Accept-language preferences. But surely you have a default
language set for such requests?

In any case, it''s good practice to include explicit links to the other
language versions, so that readers can "switch" languages temporarily
if they want, without needing to reconfigure their browsers. So the
indexer should get to see links to all of the pages quite naturally as
it browses around your site.

good luck


Alan J. Flavell写道:
Alan J. Flavell wrote:
选项+ MultiViews


这是我尝试过的,但是我收到以下消息:

" .htaccess文件限制问题
这个目录中的
..htaccess文件是无效的,不能由Web服务器解释为




也许我的ISP没有AllowOverride用于Options指令?我给他们发了一封邮件,但是我还在等待答案...


如果我删除了选项行,那么eveything工作正常。 ..
Options +MultiViews
This is what I tried, but I get the following message:
"Problem on restriction by .htaccess file
..htaccess file in this directory is not valid and cannot be interpreted
by the web server."

Maybe my ISP didn''t AllowOverride for the Options directive ? I sent
them a mail, but I''m still waiting for the answer...

If I remove the Options line, eveything works fine...
现在我的cave.en.html和cave.fr.html包含了自己翻译的其他页面的链接。然后链接的href属性是通用的,例如href =" lascaux"而不是href =" lascaux.var"
Now my cave.en.html and cave.fr.html contain links to other pages that
are themselves translated. The href attribute of the link is then a
generic one, say href="lascaux" instead of href="lascaux.var"



^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^

您使用哪种机制来实现这一目标? MultiViews会为你做那个,但是你说你没用它



^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Which mechanism are you using to achieve that? MultiViews would do
that for you, but you''re saying you aren''t using it




嗯,我只是在用这个类型-map机制,即我添加到我的

..htaccess文件行:

AddHandler type-map .var

然后结果我是否有链接

href =" lascaux"或者href =" lascaux.var"。

我不知道这是不是标准行为,但这就是我得到的......


阅读完解释后,我明白我可以安全地使用链接的

lascaux.var版本:我不清楚的是网页的

行为-crawler。根据你的说法,它只是另一个用户

代理,可以获得与Web浏览器相同的结果。由于

lascaux.var版本可以为我的网页浏览器生成正确的结果,因此它将与机器人一起工作:很棒。

唯一的问题是索引器可能在没有
包括Accept-language首选项的情况下发出请求。但是你肯定为这些请求设置了默认的语言?


这是通过使用DefaultLanguage指令来实现的吗?

在任何情况下,最好包含显式链接到另一个如果他们想要,而无需重新配置他们的浏览器。所以
索引器应该可以很自然地看到所有页面的链接,因为它会浏览你的网站。


是的,但我已经阅读了你的网站,所以我知道这个:-)

祝你好运



Well, I''m just using the type-map mechanism, i.e. I added to my
..htaccess file the line :
AddHandler type-map .var
and then the result is the same whether I have a link with an
href="lascaux" or href="lascaux.var".
I don''t know if this is the standard behaviour, but this is what I get...

After reading your explanations, I understand that I can safely use the
lascaux.var version of the link: what was unclear to me was the
behaviour of a web-crawler. From what you say, it''s just another user
agent that gets the same results as a web browser would. Since the
lascaux.var version yields the correct result to my web browser, it will
also work with a robot: great.
The only issue is that the indexer might be making the request without
including Accept-language preferences. But surely you have a default
language set for such requests?
This is achieved by using the DefaultLanguage directive I guess ?
In any case, it''s good practice to include explicit links to the other
language versions, so that readers can "switch" languages temporarily
if they want, without needing to reconfigure their browsers. So the
indexer should get to see links to all of the pages quite naturally as
it browses around your site.
Yes, but I have already read your site, so I knew this :-)
good luck



谢谢



Thanks


这篇关于多语种网站和网络抓取工具的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆