一些带有虚假链接的网站如何显示在搜索引擎的结果中 [英] How some site with fake links show up in Search Engine's results

查看:35
本文介绍了一些带有虚假链接的网站如何显示在搜索引擎的结果中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,我遇到了几个 Google 搜索结果,其中包含的网站链接与我的搜索词完全匹配.网站怎么可能动态地改变他们的内容,或者更确切地说,他们是如何欺骗谷歌为我的关键字索引他们的页面的.我读过关于内容农场的内容,但这似乎不是一个正确的答案.有人可以让我知道这种技术叫什么吗?我会努力了解更多.

解决方案

我的理解是,进入 Google 或任何其他索引引擎的唯一方法是让机器人实际抓取您的网站并生成结果.显然,Google 可以抓取动态网站:

不过,我认为就您的问题而言,这是一种进化而非革命性的变化.

我认为幕后发生的是这些事情的结合:

  • 内容索引
  • 准备好的索引
  • 用户提交的内容
  • 推荐人搜索更新

我将尝试在一个销售音乐的虚构网站上逐一解释这些内容 - 您有很多示例可以比较体验.它当然会在 example.com 域上.

内容索引

显然,作为一个想要提供一些东西的网站,您实际上拥有一些内容.通常,您以某种方式对这些内容进行分组.假设我们的音乐网站可以按不同类别对内容进行分组:

  • 作者
  • 音乐类型
  • 用户提交
  • 内容分级

每一个都可以抽象地表示为一个标签.例如,我们的网站可以选择使用 example.com/tags/eagles 来代表 Eagles 或使用 example.com/tags/rock 来代表所有摇滚乐队.Google 可以将这些内容编入索引,因此任何潜在的搜索都可以生成指向我们网站的链接.

准备好的索引

Prepared index 类似,但是是通用索引而不是真实内容.这可以通过多种方式准备,例如:

  • 拿一本字典并添加所有单词
  • 从网络上抓取数百万个页面(可能使用搜索引擎提供的链接!)并从中获取经常重复的短语
  • 从免费论坛中获取内容
  • 使用维基百科
  • 从免费提供的书籍中获取文本,例如来自 Project Gutenberg

例如,我们的网站会从以任何方式与音乐相关的文本中获取任何单词,并制作与之前的标签类似的标签.例如.只需爬取维基百科上的摇滚音乐页面,您就可以获得很多标签.>

用户提交的内容

这通常是在您的网站启动并运行之后出现的.假设我们在我们的网站上放置了一个搜索框,然后用户进入并输入摇滚音乐".Doh,我们已经知道了,所以从搜索中没有什么好处.但是,假设我们查看了我们的 Web 服务器日志并看到了一些对 langeleik 的搜索.现在,这将是我们之前可能没有索引的东西.很酷,刚刚在我们的网站上生成了另一个标签.

显然,Google 不知道这一点 - 因此我们在 站点地图,它在另一次 Googlebot 抓取后出现.当用户在 Google 上搜索langeleik"时,其中一个链接可能是指向 example.com/tags/langeleik 的链接.

还有其他可能更有价值的用户输入形式 - 评论、论坛帖子等.因此,有许多通用论坛除了托管论坛外没有其他用途.这是一个很好的数据源,您可以免费获得新内容.

最后,所有这些都应该转到您的站点站点地图.您可以拥有庞大的站点地图,请参阅:

推荐

最后一件事是推荐.再次在您的网站启动并运行后,一些 Google 搜索将直接出现在您身上.那时您可以利用 HTTP Referer 标头(是的,这是一个拼写错误 - 在 Wikipedia),请看:

请注意,Google 搜索两者兼有:

  • 不完整
  • 模糊

因此,您可以在上面搜索langeleik",但某些链接的标题为例如朗格莱克和哈普".没有什么不寻常的,但也请注意相反的情况 - 如果您搜索langeleik 和 harpe",它不仅会找到所有带有两个术语的页面,还会找到带有一个或另一个的页面.如果我们知道 harpe,但不知道 langeleik,并且有人搜索langeleik and harpe",我们将通过 HTTP Referer 标头获得一个 q 参数,例如 q=langeleik+harpe.很酷 - 如果我们愿意,刚刚有另一个词可以添加到我们的站点地图中.

至于模糊性,请注意,当您搜索老鹰"时,您可以获得从鸟类到 NFL 球队到摇滚乐队的所有信息.因此,即使我们是一个音乐网站,我们也可以将我们的视野(如果需要)扩展到最新的 NFL 新闻 - 一些完全不相关但对某些网站非常有用的内容.

结论 - 这是一种错觉

我认为所有这些的组合是一个非常丰富的站点地图构建源.使用上述技术,您可以非常轻松地生成数百万个唯一标签.因此,您输入的任何内容"都可以在 example.com/tags 上找到.

但是,您必须注意,这只是一种错觉.例如,如果您搜索ertfghedctgb"(很容易在普通 QWERTY 键盘上输入 - ert + fgh + edc + tgb),您很可能不会从 Google 获得任何信息(我目前没有).对于任何人来说,将它放在他们的站点地图中是不够常见的(或者对于搜索引擎索引它来说不够常见).

These days I come across several Google search results that contain sites with links that exactly match my search words. How is it possible for the sites to dynamically change their content or rather how are they fooling google into indexing their page for my keyword. I've read about content farms but that doesn't seem to be a right answer. Can someone let me know what this technique is called? I'll try to understand more about it.

解决方案

My understanding is that the only way to get on Google or any other indexing engine is to have the robot actually crawl your site and generate results. Obviously, Google can crawl dynamic sites:

however I find this to be an evolutionary rather then revolutionary change with regard to your question.

What I think is happening behind the scenes is the combination of these things:

  • Content index
  • Prepared index
  • User submitted content
  • Referrer search updates

I'll try to explain each of these on a fictional site that sells music - you have plenty of examples to compare the experience. It will of course be on example.com domain.

Content index

Obviously, as a site that wants to offer something, you actually have some content. Usually, you group this contents somehow. Let's assume our music site can group content by different categories:

  • Author
  • Music genre
  • User submitted
  • Content ratings

Each of these can be represented abstractly as a tag. For example, our site could choose to have example.com/tags/eagles to represent Eagles or example.com/tags/rock to represent all rock bands. Google would be able to index these, so any potential search could yield a link to our site.

Prepared index

Prepared index is similar, but is a generic index instead of real content. This can be prepared in several ways, such as:

  • Take a dictionary and add all words
  • Crawl a few million pages from the Web (possibly using links provided by search engines!) and get often repeated phrases from there
  • Grab content from free forums
  • Use Wikipeda
  • Get text from freely available books, such as those from Project Gutenberg

Our site would, for example, get any words from texts that are related to music in any way and make tags similar to the previous ones. E.g. just by crawling the Rock music page on Wikipedia, you can get a lot of tags.

User submitted content

This is something that usually comes after your site is up and running. Let's say that we put a search box on our site and then users come in and type "rock music". Doh, we already knew that, so nothing good from that search. However, let's say we go throughout our Web server logs and see some searches for langeleik. Now, that would be something we might not have indexed before. Cool, just generated another tag on our site.

Obviously, Google doesn't know that - so we create an entry in our sitemap and it's there after another Googlebot crawl. When an user searches on Google for "langeleik", one of the links might be a link to example.com/tags/langeleik.

There are other and possibly far more valuable forms of user input - comments, forum posts, etc. Hence the reason there are many generic forums that have no other purpose except hosting forums. It's a great data source and you get new content for free.

At the end, all this should go to your site sitemap. You can have huge sitemaps, see this:

Referrals

The last thing is referrals. Again after your site is up and running, some of the Google searches will come directly to you. That's when you can take advantage of the HTTP Referer header (yes, it's a misspelling - check it out on Wikipedia), see this:

Note that Google search is both:

  • Incomplete
  • Fuzzy

Thus, you can search for "langeleik" above, but some of the links have the title of e.g. "Langeleik and Harpe". Nothing unusual, but note also the reverse - if you search for "langeleik and harpe", it will not only find all pages with both terms, but also pages with one or another. If our we know for harpe, but not for langeleik, and somebody searches for "langeleik and harpe", we will get through HTTP Referer header a q paramter such as q=langeleik+harpe. Cool - just got another word to add to our sitemap, if we want.

As for fuzziness, note that when you search for "eagles", you can get everything from birds through NFL teams to a rock band. Thus, even though we are a music site, we might expand our horizon (if desired) to latest NFL news - something totally unrelated and very useful for some sites.

Conclusion - it's an illusion

I consider the combination of all these a very rich sitemap building source. You can very easily generate millions of unique tags using the above techniques. Thus, "anything" you type will be found on example.com/tags.

However, you have to note that this is just an illusion. For example, if you search for "ertfghedctgb" (easily typed on regular QWERTY keyboard - ert + fgh + edc + tgb), you will most likely not get anything from Google (I do not currently). It just was not common enough for anybody to put this in their sitemaps (or not common enough for search engines to index it).

这篇关于一些带有虚假链接的网站如何显示在搜索引擎的结果中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆