搜索引擎如何找到相关内容? [英] How do search engines find relevant content?

查看：132 发布时间：2020/5/25 0:56:43 php parsing screen-scraping relevance

本文介绍了搜索引擎如何找到相关内容?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Google在解析网络时如何找到相关内容?

How does Google find relevant content when it's parsing the web?

例如，假设Google使用PHP本机DOM库解析内容.他们将使用什么方法在网页上找到最相关的内容?

Let's say, for instance, Google uses the PHP native DOM Library to parse content. What methods would they be for it to find the most relevant content on a web page?

我的想法是，它将搜索所有段落，并按每个段落的长度排序，然后从可能的搜索字符串和查询参数中得出每个段落相关性的百分比.

My thoughts would be that it would search for all paragraphs, order by the length of each paragraph and then from possible search strings and query params work out the percentage of relevance each paragraph is.

假设我们有以下网址:

http://domain.tld/posts/stackoverflow-dominates-the-world-wide-web.html

现在通过该URL，我可以确定HTML文件名具有很高的相关性，因此，我将看到该字符串与页面中所有段落相比有多近！

Now from that URL I would work out that the HTML file name would be of high relevance so then I would see how close that string compares with all the paragraphs in the page!

当您共享页面时，Facebook共享就是一个很好的例子. Facebook快速使链接成为僵尸，并带回图像，内容等，等等.

A really good example of this would be Facebook share, when you share a page. Facebook quickly bots the link and brings back images, content, etc., etc.

我当时认为，最好的一种计算方法是根据周围的元素和元数据计算出相关百分比.

I was thinking that some sort of calculative method would be best, to work out the % of relevancy depending on surrounding elements and meta data.

是否有任何有关内容解析最佳实践的书籍/信息，涵盖了如何从网站获取最佳内容，可能讨论的算法或深入的答复?

Are there any books / information on the best practices of content parsing that covers how to get the best content from a site, any algorithms that may be talked about or any in-depth reply?

我想到的一些想法是:

按纯文本长度查找所有段落和顺序
以某种方式找到div容器的宽度和高度，并按(W + H)-@Benoit排序
在段落中检查元关键字，标题，描述并检查相关性
按最大查找所有图像标签和顺序，并查找距主要段落的节点长度
检查对象数据(例如视频)并计算最大段落/内容div中的节点
找出已解析的前几页的相似之处

Find all paragraphs and order by plain text length
Somehow find the Width and Height of div containers and order by (W+H) - @Benoit
Check meta keywords, title, description and check relevancy within the paragraphs
Find all image tags and order by largest, and length of nodes away from main paragraph
Check for object data, such as videos and count the nodes from the largest paragraph / content div
Work out resemblances from previous pages parsed

我需要此信息的原因:

我正在建立一个网站，网站管理员将链接发送给我们，然后我们列出他们的页面，但是我希望网站管理员提交链接，然后我就对该页面进行爬网以查找以下信息.

I'm building a website where webmasters send us links and then we list their pages, but I want the webmaster to submit a link, then I go and crawl that page finding the following information.

图片(如果有)
A<最佳文字片段中的255段
将用于我们的搜索引擎的关键字(堆栈溢出样式)
元数据关键字，描述，所有图像，更改日志(用于审核和管理目的)

希望你们能理解这不是针对搜索引擎的，而是搜索引擎解决内容发现的方式与我所需要的内容相同.

Hope you guys can understand that this is not for a search engine but the way search engines tackle content discovery is in the same context as what I need it for.

我不是在问商业秘密，而是在问你个人对此的处理方式.

I'm not asking for trade secrets, I'm asking what your personal approach to this would be.

推荐答案

这是一个非常笼统的问题，但却是一个很好的话题！绝对赞成:) 但是，我对到目前为止提供的答案不满意，因此我决定为此写一个相当冗长的答案.

This is a very general question but a very nice topic! Definitely upvoted :) However I am not satisfied with the answers provided so far, so I decided to write a rather lengthy answer on this.

我不满意的原因是答案基本上都是正确的(我特别喜欢科夫申宁(+1)的答案，这与图论非常相关...)，但是所有这些在某些方面都太具体了因素还是太笼统.

The reason I am not satisfied is that the answers are basically all true (I especially like the answer of kovshenin (+1), which is very graph theory related...), but the all are either too specific on certain factors or too general.

这就像问如何烤蛋糕，您会得到以下答案:

It's like asking how to bake a cake and you get the following answers:

您做一个蛋糕，然后放入烤箱.
您肯定需要加糖！
什么是蛋糕?
蛋糕是骗人的！

您不会满意的，因为您不知道什么是好蛋糕. 当然，有很多方法.

You won't be satisfied because you wan't to know what makes a good cake. And of course there are a lot or recipies.

当然Google是最重要的参与者，但是根据使用情况，搜索引擎可能会包含非常不同的因素或将它们的权重不同.

Of course Google is the most important player, but, depending on the use case, a search engine might include very different factors or weight them differently.

例如，用于发现新的独立音乐艺术家的搜索引擎可能会对您造成恶意艺术家网站上有很多外部链接.

For example a search engine for discovering new independent music artists may put a malus on artists websites with a lots of external links in.

主流搜索引擎可能会采取完全相反的操作来为您提供相关结果".

A mainstream search engine will probably do the exact opposite to provide you with "relevant results".

Google已经发布了200多个因素(如上所述). 因此，网站管理员知道如何优化其网站. 很有可能还有很多公众不知道的事情(以Google为例).

There are (as already said) over 200 factors that are published by Google. So webmasters know how to optimize their websites. There are very likely many many more that the public is not aware of (in Google's case).

但是在非常抽象的术语 SEO 优化中，通常可以将重要的优化分开分为两组:

But in the very borad and abstract term SEO optimazation you can generally break the important ones apart into two groups:

答案与问题的吻合程度如何?或者: 页面内容与搜索字词的匹配程度如何?

How well does the answer match the question? Or: How well does the pages content match the search terms?

答案有多受欢迎?或者: 什么是pagerank?

How popular/good is the answer? Or: What's the pagerank?

在这两种情况下，重要的是我不是在谈论整个网站或域，而是在谈论具有唯一URL的单个页面.

In both cases the important thing is that I am not talking about whole websites or domains, I am talking about single pages with a unique URL.

同样重要的是，pagerank不能代表所有因素，而不能代表Google归类为Popularity的那些因素.好的，我的意思是其他与人气无关的因素.

It's also important that pagerank doesn't represent all factors, only the ones that Google categorizes as Popularity. And by good I mean other factors that just have nothing to do with popularity.

对于Google，官方声明是他们希望向用户提供相关结果. 意味着所有算法都将针对用户的需求进行优化.

In case of Google the official statement is that they want to give relevant results to the user. Meaning that all algorithms will be optimized towards what the user wants.

因此，经过漫长的介绍(很高兴您仍然与我在一起...)，我将给您列出一些我认为非常重要的因素(目前):

So after this long introduction (glad you are still with me...) I will give you a list of factors that I consider to be very important (at the moment):

类别1(答案与问题的匹配程度如何?

您会注意到文档的结构很多！

You will notice that a lot comes down to the structure of the document!

该页面主要处理确切的问题.

意思:问词出现在页面标题文本或标题段落段落中. 这些关键字的位置也是如此.页面越早越好. 同样也要经常重复(如果不是太多，则以关键字填充的名义出现).

Meaning: the question words appear in the pages title text or in heading paragraphs paragraphs. The same goes for the position of theese keywords. The earlier in the page the better. Repeated often as well (if not too much which goes under the name of keywords stuffing).

整个网站都涉及该主题(关键字显示在域/子域中)

The whole website deals with the topic (keywords appear in the domain/subdomain)

单词是此页面中的重要主题(内部链接锚文本跳到关键字的位置或锚文本/链接文本包含关键字).

The words are an important topic in this page (internal links anchor texts jump to positions of the keyword or anchor texts / link texts contain the keyword).

如果外部链接使用链接文本中的关键字链接到此页面，则同样如此

The same goes if external links use the keywords in link text to link to this page

类别2(页面的重要性/受欢迎程度?)

您会注意到，并非所有因素都指向这个确切的目标. 包括某些内容(尤其是Google的内容)只是为了增强页面质量，那...好...那是应得的/获得的.

You will notice that not all factors point towards this exact goal. Some are included (especially by Google) just to give pages a boost, that... well... that just deserved/earned it.

内容为王！

在网络的其余部分中找不到或只有很少的独特内容的存在促进了人们的发展. 这主要是通过网站上通常很少使用的单词(重要单词)的无序组合来衡量的.但是，还有很多更复杂的方法.

The existence of unique content that can't be found or only very little in the rest of the web gives a boost. This is mostly measured by unordered combinations of words on a website that are generally used very little (important words). But there are much more sophisticated methods as well.

新近度更好-

Recency - newer is better

历史更改(页面过去更新的频率.更改很好.)

Historical change (how often the page has updated in the past. Changing is good.)

外部链接的受欢迎程度(其中有多少个链接?)

External link popularity (how many links in?)

如果一个页面链接了另一个页面，则该页面本身具有较高的排名，则链接的价值更高.

If a page links another page the link is worth more if the page itself has a high pagerank.

外部链接分集

基本上是来自不同根域的链接，但其他因素也起作用. 甚至有多远的因素都是根据地理位置(根据其ip地址)链接网站的网络服务器.

basically links from different root domains, but other factors play a role too. Factors like even how seperated are the webservers of linking sites geographically (according to their ip address).

信任等级

例如，如果具有引述内容的大型，受信任且已建立的网站链接到您，则您将获得信任等级. 这就是为什么《纽约时报》的链接比某些陌生的新网站更有价值的原因，即使它的PageRank更高！

For example if big, trusted, established sites with redactional content link to you, you get a trust rank. That's why a link from The New York Times is worth much more than some strange new website, even if it's PageRank is higher!

域信任

如果您的域名是受信任的，则整个网站都可以增强您的内容. 这里有很多不同的因素.当然，从受信任的关系到您的域的链接，但是如果您与重要网站位于同一数据中心，则该链接甚至会很好.

Your whole website gives a boost to your content if your domain is trusted. Well different factors count here. Of course links from trusted sties to your domain, but it will even do good if you are in the same datacenter as important websites.

特定主题链接.

如果可以解决某个主题的网站链接到您，并且查询也可以解决该主题，那么这很好.

If websites that can be resolved to a topic link to you and the query can be resolved to this topic as well, it's good.

随着时间的推移分配链接.

如果您在短时间内获得了很多链接，那么这对您现在和以后的将来都是有益的.但是后来还不太好. 如果您稳定稳定地赚取链接，那么对永恒"的内容将大有裨益.

If you earned a lot of links in in a short period of time, this will do you good at this time and the near future afterwards. But not so good later in time. If you slow and steady earn links it will do you good for content that is "timeless".

来自还原域的链接

来自.gov域的链接非常有价值.

A link from a .gov domain is worth a lot.

用户点击行为

搜索结果的点击率是多少?

Whats the clickrate of your search result?

在网站上花费的时间

Google Analytics(分析)跟踪等.还可以跟踪用户打开您的目录后是否单击了返回或单击另一个结果.

Google analytics tracking, etc. It's also tracked if the user clicks back or clicks another result after opening yours.

收集的用户数据

投票，评级等，Gmail中的引用等.

Votes, rating, etc., references in Gmail, etc.

现在，我将介绍第三个类别，上面的一两个点会进入该类别，但是我还没有想到……类别是:

Now I will introduce a third category, and one or two points from above would go into this category, but I haven't thought of that... The category is:

**您的网站总体上有多重要/好吗**

** How important/good is your website in general **

您所有页面的排名都会根据您网站的质量而定

All your pages will be ranked up a bit depending on the quality of your websites

因素包括:

良好的网站架构(易于导航，结构化.站点地图等)

Good site architecture (easy to navgite, structured. Sitemaps, etc...)

如何建立(现有的长期域名价值更高).

How established (long existing domains are worth more).

托管人信息(您附近还托管了哪些其他网站?

Hoster information (what other websites are hosted near you?

您的全名搜索频率.

最后但并非最不重要的一点，我想说的是，这些语义因素中的许多因素可以通过语义技术加以丰富，并可以引入新的因素.

Last, but not least, I want to say that a lot of these theese factors can be enriched by semantic technology and new ones can be introduced.

例如，某人可能会搜索泰坦尼克号"，而您拥有一个有关冰山的网站...可以将其设置为可以反映出来的相关性.

For example someone may search for Titanic and you have a website about icebergs ... that can be set into correlation which may be reflected.

新引入的语义标识符.例如， OWL 标签在将来可能会产生巨大影响.

Newly introduced semantic identifiers. For example OWL tags may have a huge impact in the future.

例如，有关电影《泰坦尼克号》的博客可以在该页面上放置一个标志，使其内容与关于同一部电影的Wikipedia文章相同.

For example a blog about the movie Titanic could put a sign on this page that it's the same content as on the Wikipedia article about the same movie.

这种链接目前正在大量开发和建立中，没有人知道如何使用它.

This kind of linking is currently under heavy development and establishment and nobody knows how it will be used.

也许重复的内容被过滤掉，并且只显示最重要的相同内容?或许反过来呢?您会看到很多与您的查询相匹配的页面.即使它们不包含您的关键字?

Maybe duplicate content is filtered, and only the most important of same content is displayed? Or maybe the other way round? That you get presented a lot of pages that match your query. Even if they dont contain your keywords?

Google甚至会根据搜索查询的主题应用不同相关性的因素！

Google even applies factors in different relevance depending on the topic of your search query!

这篇关于搜索引擎如何找到相关内容?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

搜索引擎如何找到相关内容? [英] How do search engines find relevant content?

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

搜索引擎如何找到相关内容? [英] How do search engines find relevant content?

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭