爬取 Google Play 商店 [英] Crawling the Google Play store

查看:32
本文介绍了爬取 Google Play 商店的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取 Google Play 商店以下载所有 android 应用程序的网页(所有具有以下基本 url 的网页:https://play.google.com/store/apps/).我检查了 play store 的 robots.txt 文件,它不允许抓取这些 URL.

I want to crawl the Google Play store to download the web pages of all the android application (All the webpages with the following base url: https://play.google.com/store/apps/). I checked the robots.txt file of the play store and it disallows crawling these URLs.

另外,当我浏览 Google Play 商店时,我只能看到每个类别最多 3 页的热门应用程序.如何获取其他应用页面?

Also, when I browse the Google Play store I can only see top applications up to 3 pages for each of the categories. How can I get the other application pages?

如果有人尝试过抓取 Google Play,请告诉我以下信息:a) 您是否成功爬取了 Play 商店.如果是,请告诉我你是怎么做到的.b) 如何抓取每个类别的热门应用中不可见的隐藏应用页面?c) 是否有一种技术可以同时下载应用程序而不仅仅是网页?

If anyone has tried crawling the Google Play please let me know the following things: a) Were you successful in crawling the play store. If yes, please let me know how you did that. b) How to crawl the hidden application pages not visible in top apps for each of the categories? c) Is there a techniques to download the applications also and not just the webpages?

我已经四处搜索并找到了以下链接:

I already searched around and found the following links:

a) https://code.google.com/p/android-market-api/ 
b) https://code.google.com/p/android-marketplace-crawler/source/checkout 
c) http://mohsin-junaid.blogspot.co.uk/2012/12/how-to-install-android-marketplace.html 
d) http://mohsin-junaid.blogspot.in/2012/12/how-to-download-multiple-android-apks.html

谢谢!

推荐答案

首先,Google Play 的 机器人.txt 不允许带有基本/store/apps"的页面.

First of all, Google Play's robots.txt does NOT disallow the pages with base "/store/apps".

如果您想抓取 Google Play,您需要开发自己的网络抓取工具,解析 HTML 页面并提取您需要的应​​用元数据(例如标题、描述、价格等).此主题已在另一个问题中讨论.有一些图书馆可以帮助解决这个问题,例如:

If you want to crawl Google Play you would need to develop your own web crawler, parse the HTML page and extract the app meta-data you need (e.g. title, descriptions, price, etc). This topic has been covered in this other question. There are libraries helping with that, for instance:

更难的部分是找到"要抓取的应用页面.您可以使用 1) Google Play Sitemap 或 2) 按照您在所抓取的每个页面中找到的应用链接,如 链接提取器中所述 文档(如果您打算使用 Scrapy).

The harder part is to "find" the app-pages to crawl. You could use 1) the Google Play Sitemap or 2) follow the app-links you find in every page you crawl as explained in the Link Extractor documentation (in case you plan to use Scrapy).

另一种选择是使用基于 ProtoBuf 的开源库来获取有关应用程序的元数据,这里是项目的链接:https://code.google.com/archive/p/android-market-api.该库代表有效的 Google 帐户从 Google Play 获取应用程序元数据,但在这种情况下,您还需要一个爬虫来查找"哪些应用程序可用并安排其元数据检索.这个其他开源项目可以帮助您:https://code.google.com/archive/p/android-marketplace-crawler.

Another option is to use an open-source library based on ProtoBuf to fetch meta-data about an app, here the link to the project: https://code.google.com/archive/p/android-market-api. This library fetches app meta-data from Google Play on behalf of a valid Google account, but also in this case you need a crawler to "find" which apps are available and schedule their meta-data retrieval. This other open-source project can help you with that: https://code.google.com/archive/p/android-marketplace-crawler.

如果您不想自己实现所有这些,可以使用第三方托管服务通过基于 JSON 的 API 访问 Android 应用元数据.例如,42matters.com(我工作的公司) 为 Android 和 iOS 提供了一个 API 来检索应用程序的元数据,这里有更多详细信息:

If you don't want to implement all this by yourself, you could use a third-party managed service to access Android apps meta-data through a JSON-based API. For instance, 42matters.com (the company I work for) offers an API for both Android and iOS to retrieve apps' meta-data, here more details:

https://42matters.com/应用市场数据

为了获取应用程序的标题、图标、描述、下载,您可以使用此处记录的查找"端点:

In order to get the Title, Icon, Description, Downloads for an app you can use the "lookup" endpoint as documented here:

https://42matters.com/docs/app-market-data/android/apps/lookup

这是Angry Birds Space Premium"应用的 JSON 响应示例:

This is an example of the JSON response for the "Angry Birds Space Premium" app:

{
    "package_name": "com.rovio.angrybirdsspace.premium",
    "title": "Angry Birds Space Premium",
    "description": "Play over 300 interstellar levels across 10 planets...",
    "short_desc": "The #1 mobile game of all time blasts off into space!",
    "rating": 4.3046236038208,
    "category": "Arcade",
    "cat_key": "GAME_ARCADE",
    "cat_keys": [
        "GAME_ARCADE",
        "GAME",
        "FAMILY_EDUCATION",
        "FAMILY"
    ],
    "price": "$1.15",
    "downloads": "1,000,000 - 5,000,000",
    "version": "2.2.1",
    "content_rating": "Everyone",
    "promo_video": "https://www.youtube.com/embed/g6AL9YqRHaI?ps=play&vq=large&rel=0&autohide=1&showinfo=0&autoplay=1",
    "market_update": "2015-07-03T00:00:00+00:00",
    "screenshots": [
        "https://lh3.googleusercontent.com/ZmuBQzIy1G74coPrQ1R7fCeKdJmjTdpJhNrIHBOaFyM0N2EYdUPwZaQjnQUtiUDGmac=h310",
        "https://lh3.googleusercontent.com/Xg2Aq70ZH0SnNhtSKH7xg9jCfisWgmmq3C7xQbx6YMhTVAIRqlRJeH8GYtjxapb_qR4=h310",
        "https://lh3.googleusercontent.com/T4o5-2_UP82sj4fSSegbjrGmslNHlfvtEYuZacXMSOC55-7eyiKySw05lNF1QQGO2FeU=h310",
        "https://lh3.googleusercontent.com/f2ennaLdivFu5cQQaVPKsRcWxB8FS5T4Bkoy3l0iPW9-GDDnTVRhvR5kz6l4m8FL1c8=h310",
        "https://lh3.googleusercontent.com/H-9M03_-O9Df1nHr2-rUdjtk2aeBY3bAxnqSX3m2zh_aV8-K1t0qU1DxLXnK0GrDAw=h310"
    ],
    "created": "2012-03-22T08:24:00+00:00",
    "developer": "Rovio Entertainment Ltd.",
    "number_ratings": 20812,
    "price_currency": "$",
    "icon": "https://lh3.ggpht.com/aQaIEGrmba1ENSEgUtArdm3yhJUug7BRWlu_WaspoJusZyHv1rjlWtYqe_qRjE_Kmh1E=w300",
    "icon_72": "https://lh3.ggpht.com/aQaIEGrmba1ENSEgUtArdm3yhJUug7BRWlu_WaspoJusZyHv1rjlWtYqe_qRjE_Kmh1E=w72",
    "market_url": "https://play.google.com/store/apps/details?id=com.rovio.angrybirdsspace.premium&referrer=utm_source%3D42matters.com%26utm_medium%3Dapi"
}

我希望这会有所帮助,否则请随时与我联系.我非常了解这个主题,可以为您指明正确的方向.

I hope this helps, otherwise feel free to get in touch with me. I know this topic quite well and can point you in the right direction.

问候,

安德烈亚

这篇关于爬取 Google Play 商店的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆