像 kayak.com 这样的网站如何聚合内容? [英] How does a site like kayak.com aggregate content?

查看:39
本文介绍了像 kayak.com 这样的网站如何聚合内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我一直在琢磨一个新项目的想法,想知道是否有人知道像 Kayak.com 这样的服务如何能够如此快速准确地聚合来自这么多来源的数据.更具体地说,您认为 Kayak.com 是在与 API 交互,还是他们正在抓取/抓取航空公司和酒店网站以满足用户请求?我知道这类事情没有一个正确的答案,但我很想知道其他人认为什么是解决这个问题的好方法.如果有帮助,假设您明天要创建 kayak.com ......您的数据来自哪里?

Greetings, I've been toying with an idea for a new project and was wondering if anyone has any idea on how a service like Kayak.com is able to aggregate data from so many sources so quickly and accurately. More specifically, do you think Kayak.com is interacting with APIs or are they crawling/scraping airline and hotel websites in order to fulfill user requests? I know there isn't one right answer for this sort of thing but I'm curious to know what others think would be a good way to go about this. If it helps, pretend you are going to create kayak.com tomorrow ... where is your data coming from?

推荐答案

我在旅游行业担任软件架构师/项目负责人,负责您所描述的项目类型——在我们地区,我们直接与供应商合作,但是对于传出,我们连接到多个聚合器.

I'm working in travel industry as a software architect / project lead on the precisely kind of project you describe - in our region we work with suppliers directly, but for outgoing we connect to several aggregators.

回答你的问题……有些数据是你拥有的,有些是你通过各种方式获得的,有些你必须折磨和扭曲直到它坦白.

To answer your question... some data you have, some you get in various ways, and some you have to torture and twist until it confesses.

您必须问的问题是...您是想像 Kayak 那样销售广告,还是像 Expedia 那样切分?您是搜索还是销售旅游服务?您的目标是利基市场(例如,只是航空旅行)还是一切(住宿、航空公司、租车、交通/观光/会议等附加服务)?您的目标是地区(美国或美国的一部分)还是世界?您做得有多深 - 您是在一个屏幕上显示多个网站,还是将不同的服务捆绑在一起并动态打包?

The questions you have to ask are... Do you want to sell advertising like Kayak or do you take a cut like Expedia? Are you into search or into selling travel services? Do you target niche (for example, just air travel) or everything (accommodation, airlines, rent-a-car, additional services like transport/sightseeing/conferences etc)? Do you target region (US or part of US) or the world? How deep do you go - do you just show several sites on a single screen, or do you bundle different services together and package them dynamically?

如果您采用 Kayak 商业模式,从技术上讲,您不需要网站的许可……但是很多网站都有带有 IFrame 的联盟计划或其他简单的方法来将客户引导到他们的网站.从好的方面来说,您不必自己处理付款/投诉和旅客.至于缺点……如果您想自己比较价格并向用户展示最便宜的选项,则必须进行更深层次的集成,这意味着 API 和网页抓取.

If you're going with Kayak business model, you technically don't need site's permission... but a lot of sites have affiliate programs with IFrames or other simple ways to direct the customer to their site. On the plus side, you don't have to deal with payments/complaints and travelers themselves. As for the cons... if you want to compare prices yourself and present the cheapest option to the user, you'll have to integrate on a deeper level, and that means APIs and web scraping.

至于网络抓取...避免它.糟透了.真的.只是不要这样做.相信我这一点.例如,如果没有网络抓取,您将无法获得诸如低成本之类的东西.低成本航空公司靠增值服务为生.如果用户没有看到他们的网站,他们就不会出售额外的东西,也不会赚取任何收入.因此,他们没有附属机构,他们不提供 API,而且他们几乎不断地改变他们的网站布局.然而,有些公司通过网络抓取低成本网站并将它们包装成漂亮的 API 来谋生.如果您负担得起,您可以为您的用户提供低成本航班的成本比较,这是巨大的.

As for web scraping... avoid it. It sucks. Really. Just don't do it. Trust me on this one. For example, some things like lowcosters you can't get without web scraping. Low cost airlines live from value added services. If the user doesn't see their website, they don't sell extra stuff, and they don't earn anything. Therefore, they don't have affiliates, they don't offer APIs, and they change their site layout almost constantly. However, there are companies which earn a living by web scraping lowcoster's sites and wrapping them into nice APIs. If you can afford them, you can give your users cost-comparison of low cost flights and that's huge.

另一方面,也有提供 API 的普通"运营商.前往航空公司并不是什么大问题,因为它们都在 IATA 之下;基本上,您从 IATA 购买,然后 IATA 将钱分配给运营商.但是,您可能不想直接连接到运营商网络.现在他们有 Web 服务和 SOAP,但是当我说有 SOAP 协议只是围绕文本提示的一个非常薄的包装器时,请相信我,通过它您可以与具有 80es 风格协议的大型机进行交互(想想 Unix提示您按命令计费的位置;执行一次搜索大约需要 20 个命令).这就是为什么您可能希望通过更好的 API 与食物链下游的某个人建立联系.

On the other hand, there are "normal" carriers which offer APIs. It's not that big of a problem to get to airlines since they're all united under IATA; basically, you buy from IATA, and IATA distributes the money to carriers. However, you probably don't want to connect directly to carrier network. They have web services and SOAP these days, but believe me when I say that there are SOAP protocols which are just an insanely thin wrappers around a text prompt through which you can interact with a mainframe with an 80es-style protocol (think of a Unix prompt where you're billed per command; and it takes about 20 commands to do one search). That's why you probably want to connect to somebody a bit more down the food chain, with a better API.

航空公司因此处于高斯曲线的两个极端;一方面是个体供应商,另一方面是高度集中的系统,在那里你实现一个 API,你就可以飞到世界任何地方.住宿和其他旅游产品介于两者之间.有几家大型酒店聚合酒店,还有大量小型供应商和许多聚合器,它们仅涵盖了一部分频谱.例如,您可以租一个灯塔,它甚至不会那么贵 - 但您将无法在一个地方比较不同灯塔的价格.

Airlines are thus on both extremes of Gaussian curve; on one side are individual suppliers, and on the other highly centralized systems where you implement one API and you're able to fly anywhere in the world. Accommodation and the rest of travel products are in between. There are several big players which aggregate hotels, and a ton of small suppliers with a lot of aggregators which cover only part of a spectrum. For example, you can rent a lighthouse and it's even not that expensive - but you won't be able to compare the prices of different lighthouses in one place.

如果您采用 Kayak 商业模式,您可能最终会抓取网站.如果您要集成不同的提供者,您会经常使用 API,其中一些非常好,而其中大部分是可以接受的.我没有使用过 RSS,但 RSS 和网页抓取之间没有太大区别.Jeff 的回答中还没有提到第四个选项……您每晚获取数据的选项,例如通过 FTP 和类似方式获取 .CSV 文件.

If you're into Kayak business model, you'll probably end up scraping websites. If you're into integrating different providers, you'll often work with APIs, some of which are pretty good, and most of which are tolerable. I haven't worked with RSS but there's not a lot of difference between RSS and web scraping. There is also a fourth option not mentioned in Jeff's answer... the one where you get your data nightly, for example .CSV files through FTP and similar.

然后是复杂性.您想要添加的价值越多,您必须处理的复杂性就越大.您可以搜索允许携带宠物的住宿吗?对于距离市中心不到 5 公里的旅馆?你是组合航班吗,你能保证旅客有足够的时间从一个机场到另一个机场……你能提前卖掉交通工具吗?一位著名的大提琴家不想离开他珍贵的 18 世纪大提琴;你能把大提琴的另一个座位卖给他吗(是的,不是自己编的)?

And then there's complexity. The more value you want to add, the more complexity you'll have to handle. Can you search accommodations which allow pets? For a hostel which is located less than 5 km from the town center? Are you combining flights, and are you able to guarantee that the traveler will have enough time to get from one airport to another... can you sell the transport in advance? A famous cellist doesn't want to part from his precious 18th century cello; can you sell him another seat for the cello (yep, not making this one up)?

想比较价格?当然,房间是每晚 30 欧元.但您可以选择 30 人入住一张双人床和 20 人一张单人床,或者您可以在一张双人床中加一张床,并为第三人提供 70% 的折扣.但仅限于 12 岁以下的儿童;我们的加床不适用于成人.而且您不会在搜索结果中看到加床价格 - 只有在您计算最终价格时才会显示.

Want to compare prices? Sure, the room is EUR 30 per night. But you can either get one double for 30 and one single for 20, or you can get one extra bed in a double and get 70% off for third person. But only if it's a child under 12 years of age; our extra beds are not for adults. And you don't get the price for extra bed in search results - only when you calculate the final price.

甚至不要让我开始使用动态包装.想卖房+租车?没问题;与两个不同的提供商集成,然后就可以了...手动更新城市中的位置列表(来自租车提供商)以匹配酒店(来自住宿提供商,他们只为您提供每家酒店的城市).当然,前提是您已经匹配了两者中的城市列表,因为城市代码没有国际标准.

And don't even get me started on dynamic packaging. Want to sell accommodation + rent-a-car? No problem; integrate with two different providers, and off you go... manually updating list of locations in the city (from rent-a-car provider) to match with hotels (from accommodation provider, who gives you only the city for each hotel). Of course, provided that you've already matched the list of cities from the two, since there is no international standard for city codes.

与其他很多产品多的行业不同,旅游业有很多非常复杂的产品.亚马逊很容易;卖书卖土豆,是一回事;您甚至可以将它们装在同一个盒子里.它们很容易组合并且不是由许多部件组装而成.:)

Unlike a lot of other industries which have many products, travel industry has many very complex products. Amazon has it easy; selling books and selling potatoes, it's the same thing; you can even ship them in the same box. They combine easily and aren't assembled from many parts. :)

附言通过一些有关航班的内幕信息链接到 Hacker News 上一个有趣的近期主题.P.P.S.最近偶然发现了一篇关于 IATA 的 NDC 协议与概述旅游业是如何相互联系的,以及这是如何形成的历史课.

P.S. Linking to an interesting recent thread on Hacker News with some insider info regarding flights. P.P.S. Recently stumbled on a great albeit rather old blogpost on IATA's NDC protocol with overview of how travel industry is connected and a history lesson how this came to be.

这篇关于像 kayak.com 这样的网站如何聚合内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆