Google App Engine - 为社交网络创建SiteMap [英] Google App Engine - SiteMap Creation for a social network

查看:123
本文介绍了Google App Engine - 为社交网络创建SiteMap的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个社交工具 - 我希望允许搜索引擎接收公共用户配置文件 - 比如twitter和脸书。



我见过所有的协议信息在 http://www.sitemaps.org ,我明白这一点以及如何建立这样一个文件 - 以及一个索引,如果我超过了50K的限制。



在哪里我挣扎是我如何运行这个概念。



我的一般站点页面的站点地图非常简单,我可以使用工具创建文件 - 或脚本 - 托管文件 - 提交文件并完成。



然后我需要一个脚本来创建用户配置文件的站点地图。我认为这会是这样的:

 <?xml version =1.0encoding =UTF-8?> ; 
< urlset xmlns =http://www.sitemaps.org/schemas/sitemap/0.9>
< url>
< loc> http://www.socialsite.com/profile/spidee< / loc>
< lastmod> 2010-5-12< / lastmod>
< changefreq> ???< / changefreq>
< priority> ???< / priority>
< / url>
< url>
< loc> http://www.socialsite.com/profile/webbsterisback< / loc>
< lastmod> 2010-5-12< / lastmod>
< changefreq> ???< / changefreq>
< priority> ???< / priority>
< / url>
< / urlset>

我添加了一些?因为我不知道应该如何根据以下内容为我的个人档案设置这些设置: -



创建新配置文件时,必须将其添加到网站 - 地图。如果配置文件发生变化或某些属性发生变化 - 那么我不知道我是否更新地图中的条目 - 或者做其他事情? (更新将是一场噩梦!)



有些用户可能会更改他们的个人资料。在与搜索引擎的关联性方面,谷歌或雅虎搜索将找到用户(针对我的需求)个人资料的唯一方式是例如通过[用户名]和[位置]的方式,因此一旦该个人资料的条目已被添加到地图文件的唯一原因是让搜索机器人重新编制配置文件将是如果用户更改他们的用户名 - 他们不能。或他们的位置 - 或设置他们的设置,以便他们的个人资料将被搜索引擎隐藏。



我假设我的地图创建需要是动态的。从我上面所说的话我会想象创建一个新的配置文件和可能的编辑某些属性可能会标记为需要添加/更新在站点地图。



假设我将数百万个配置文件添加/正在编辑如何以合理的方式管理这个。

我知道我需要一个脚本,可以在创建每个配置文件时添加url
i知道脚本将成为一个任务 - 以一个固定频率运行 - 也许配置文件具有索引等属性,当配置文件添加到地图时,任务将其设置为true。
我没有看到存储地图的最佳方式 - 我是否将其存储在数据存储区中;即

model = sitemaps



属性



key_name = sitemap_xml_1(以及我的地图sitemap_index_xml)

mapxml = blobstore(原始xml映射或ror映射)

full = boolean(当url数为50时设置为true)#可能需要这个作为碎片会告诉我们



为了实现这个目标,我的想法是:
$ b $ m将当前的站点地图结构缓存为sitemap_xml
keep当我的任务执行
时,url的碎片计数为
1.为前100个标记为index == false的URL创建xml结构(每次可以运行多少个)
2.测试当前mcache站点地图是否已满(shardcounter + 100> 50K)
3.a如果地图已接近完整在模型sitemap_xml_2中创建新地图项 - 更新map_index文件(也存储在我的模型为sitemap_index开始一个新的分片 - 或者重置。
3.b如果地图不完整,请从mcac中抓取它他
4.append 100 url xml结构
5.save / m缓存地图

我现在可以使用url地图添加处理程序/ route like / sitemaps / *



获取我的*作为地图名称,并即时从blobstore / mache提供地图。



现在我的问题是这个工作 - 这是正确的方式还是开始的好方法?这是否会处理确保搜索机器人在用户更改其个人资料时进行更新的情况 - 可能是通过正确设置更改频率? - 我需要一个更高级的系统:(或者我重新发明了轮子!)

我希望这一切都很清楚,并且有一些形式的感觉:-)

解决方案

更新频率



缓存失效是一个难题,请参阅:缓存失效 - 是否有一个通用解决方案?



据我所知,您需要确定您希望搜索漫游器重新抓取您的网站的频率,而不是实际更改的频率。如果用户的页面可能包含他们希望在短时间内删除的信息,那么您希望搜索机器人能够在几天内重新抓取,即使配置文件平均很少发生更改。



保留最新的地图



由于您的网站速度现在在Google PageRank中显示,因此值得更新一个静态文件以准备投放直到蜘蛛。也许有一个脚本不断更新带有站点地图条目的数据库表,另一个脚本定期从数据库表中重新生成静态文件。这样,蜘蛛总有一个静态版本,它可以异步地发生。



App Engine上的静态页面



我忘记了App Engine上不能有静态页面文件。根据这个SO问题,最好的方法是使用生成文件和将其推送到memcache。另请参阅在App Engine中使用memcache 的文档


I am creating a social tool - I want to allow search engines to pick up "public" user profiles - like twitter and face-book.

I have seen all the protocol info at http://www.sitemaps.org and i understand this and how to build such a file - along with an index if i exceed the 50K limit.

Where i am struggling is the concept of how i make this run.

The site map for my general site pages is simple i can use a tool to create the file - or a script - host the file - submit the file and done.

What i then need is a script that will create the site-maps of user profiles. I assume this would be something like:

    <?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.socialsite.com/profile/spidee</loc>
      <lastmod>2010-5-12</lastmod>
      <changefreq>???</changefreq>
      <priority>???</priority>
   </url>
   <url>
      <loc>http://www.socialsite.com/profile/webbsterisback</loc>
      <lastmod>2010-5-12</lastmod>
      <changefreq>???</changefreq>
      <priority>???</priority>
   </url>
</urlset>

Ive added some ??? as i don't know how i should set these settings for my profiles based on the following:-

When a new profile is created it must be added to a site-map. If the profile is changed or if "certain" properties are changed - then i don't know if i update the entry in the map - or do something else? (updating would be a nightmare!)

Some users may change their profile. In terms of relevance to the search engine the only way a google or yahoo search will find the users (for my requirement) profile would be for example by means of [user name] and [location] so once the entry for the profile has been added to the map file the only reason to have the search-bot re-index the profile would be if the user changed their user-name - which they cant. or their location - and or set their settings so that their profile would be "hidden" from search engines.

I assume my map creation will need to be dynamic. From what i have said above i would imagine that creating a new profile and possible editing certain properties could mark it as needing adding/updating in the sitemap.

Assuming i will have millions of profiles added/being edited how can i manage this in a sensible manner.

i know i need a script that can append urls as each profile is created i know the script will prob be a TASK - running at a set freq - perhaps the profiles have a property like "indexed" and the TASK sets them to "true" when the profiles are added to the map. I dont see the best way to store the map - do i store it in the datastore i.e;

model=sitemaps

properties

key_name=sitemap_xml_1 (and for my map sitemap_index_xml)

mapxml=blobstore (the raw xml map or ror map)

full=boolean (set true when url count is 50) # might need this as a shard will tell us

To make this work my thoughts are

m cache the current site map structure as "sitemap_xml" keep a shard of url count when my task executes 1. build the xml structure for say the first 100 urls marked "index==false" (how many could u run at a time?) 2. test if the current mcache sitemap is full (shardcounter+100>50K) 3.a if the map is near full create a new map entry in models "sitemap_xml_2" - update the map_index file (also stored in my model as "sitemap_index" start a new shard - or reset.2 3.b if the map is not full grab it from mcache 4.append the 100 url xml structure 5.save / m cache the map

I can now add a handler using a url map/route like /sitemaps/*

Get my * as map name and serve the maps from the blobstore/mache on the fly.

Now my question is does this work - is this the right way or a good way to start? Will this handle the situation of making sure the search bots update when a user changes their profile - possibly by setting the change freq correctly? - Do i need a more advance system :( ? or have i re-invented the wheel!

I hope this is all clear and make some form of sense :-)

解决方案

Update frequency

Cache invalidation is a hard problem, see: Cache Invalidation - Is there a General Solution?

As far as I can see, you need to decide how often you want search bots to recrawl your site, rather than how often things are actually changed; if a user's page may contain information they want to remove at short notice, then you want the search bot to re-crawl within a couple of days, even though profiles are changed rarely on average.

Keeping an up-to-date map

Since the speed of your website now figures in its Google PageRank, it's worth updating a static file ready to serve up to the spiders. Perhaps have one script that continually updates a db table with sitemap entries, and another that periodically regenerates the static file(s) from the db table. That way, there is always a static version available for the spiders and it can all happen asynchronously.

Static pages on App Engine

I forgot that you can't have static page files on App Engine. According to this SO question, the best way is to use generate your file and push it to memcache. Also see the documentation on using memcache with App Engine

这篇关于Google App Engine - 为社交网络创建SiteMap的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆