抓取Google Play商店 [英] Crawling the Google Play store

查看:275
本文介绍了抓取Google Play商店的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要抓取Google Play商店以下载所有Android应用程序的网页(所有网页均包含以下基本网址: https://play.google.com/store/apps/ )。我查看了Play商店的robots.txt文件,并且不允许抓取这些网址。



此外,当我浏览Google Play商店时,每个类别只能看到最多3页的顶级应用程序。

如果有人尝试抓取Google Play,请告诉我以下几件事:
a)您是否成功抓取了戏剧商店。如果是的话,请让我知道你是如何做到的。
b)如何抓取隐藏的应用程序页面在每个类别的顶级应用程序中不可见?
c)是否有一种技术可以下载应用程序,而不仅仅是网页?



我已经搜索并找到以下链接:

  a)https://code.google.com/p/android-market-api/ 
b)https://代码。 google.com/p/android-marketplace-crawler/source/checkout
c)http://mohsin-junaid.blogspot.co.uk/2012/12/how-to-install-android-marketplace.html $
$)b $ b $) >

感谢!

解决方案

首先,Google Play的 robots.txt 不会禁止以/ store / apps为基础的页面。



如果您想抓取Google Play,您需要开发自己的网页抓取工具,解析HTML页面并提取您需要的应​​用程序元数据(例如标题,说明,价格等)。此主题已在其他问题中进行了介绍。有些库可以帮助你,例如:



更难的部分是查找要抓取的应用页面。您可以使用1) Google Play网站或2)按照如链接提取器中所述,您在每个抓取的网页中找到的应用链接文档(如果您计划使用Scrapy)。



另一个选择是使用基于ProtoBuf的开源库来获取关于应用程序,这里是该项目的链接: https://code.google.com /存档/ p / Android的市场-API
这个库代表有效的Google帐户从Google Play获取应用元数据,但在这种情况下,您需要抓取工具来查找哪些应用可用并安排他们的元数据检索。其他开源项目可以帮助您: https://代码。 google.com/archive/p/android-marketplace-crawler



如果你不想自己实现所有这些,你可以使用第三方托管服务通过基于JSON的API访问Android应用程序元数据。例如, 42matters.com (我工作的公司)为Android和iOS提供了一个API来检索应用程序的元数据 - 数据,这里有更多详细信息:



https: //42matters.com/app-market-data



为了获得应用程序的标题,图标,说明,下载,您可以使用 lookup端点记录在这里:

https://42matters.com/docs/app-market-data/android/apps/lookup



这是一个例子愤怒的小鸟空间高级版应用程序的JSON响应:

  {
package_name:com。 rovio.angrybirdsspace.premium,
title:愤怒的小鸟太空溢价,
描述:在10个星球上玩300多个星际...,
short_desc :所有时间排名第一的手机游戏都会进入太空!,
rating:4.3046236038208,
category:Arcade,
cat_key:GAME_ARCADE ,
cat_keys:[
GAME_ARCADE,
GAME,
FAMILY_EDUCATION,
FAMILY
],
价格:$ 1.15,
下载:1,000,000 - 5,000,000,
版本:2.2.1,
content_rating:所有人,
promo_video:https://www.youtube.com/embed/g6AL9YqRHaI?ps=play&vq=large&rel=0&autohide=1&showinfo=0&autoplay=1,
market_update:2015-07-03T00:00:00 + 00:00,
屏幕截图:[
https://lh3.googleusercontent.com/ZmuBQzIy1G74coPrQ1R7fCeKdJmjTdpJhNrIHBOaFyM0N2EYdUPwZaQjnQUtiUDGmac=h310 ,
https://lh3.googleusercontent.com/Xg2Aq70ZH0SnNhtSKH7xg9jCfisWgmmq3C7xQbx6YMhTVAIRqlRJeH8GYtjxapb_qR4=h310,
https://lh3.googleusercontent.com/T4o5-2_UP82sj4fSSegbjrGmslNHlfvtEYuZa cXMSOC55-7eyiKySw05lNF1QQGO2FeU = h310,
https://lh3.googleusercontent.com/f2ennaLdivFu5cQQaVPKsRcWxB8FS5T4Bkoy3l0iPW9-GDDnTVRhvR5kz6l4m8FL1c8=h310,
https://lh3.googleusercontent.com/H-9M03_-O9Df1nHr2- rUdjtk2aeBY3bAxnqSX3m2zh_aV8-K1t0qU1DxLXnK0GrDAw = h310

created:2012-03-22T08:24:00 + 00:00,
开发者:Rovio娱乐有限公司 ,
number_ratings:20812,
price_currency:$,
icon:https://lh3.ggpht.com/aQaIEGrmba1ENSEgUtArdm3yhJUug7BRWlu_WaspoJusZyHv1rjlWtYqe_qRjE_Kmh1E=w300,
icon_72:https://lh3.ggpht.com/aQaIEGrmba1ENSEgUtArdm3yhJUug7BRWlu_WaspoJusZyHv1rjlWtYqe_qRjE_Kmh1E=w72,
market_url:https://play.google.com/store/apps/details?id=com。 rovio.angrybirdsspace.premium& referrer = utm_source%3D42matters.com%26utm_medium%3Dapi



<我希望这会有所帮助,否则请随时与我联系。我非常了解这个话题,并且可以指出你正确的方向。



问候,



Andrea
p>

I want to crawl the Google Play store to download the web pages of all the android application (All the webpages with the following base url: https://play.google.com/store/apps/). I checked the robots.txt file of the play store and it disallows crawling these URLs.

Also, when I browse the Google Play store I can only see top applications up to 3 pages for each of the categories. How can I get the other application pages?

If anyone has tried crawling the Google Play please let me know the following things: a) Were you successful in crawling the play store. If yes, please let me know how you did that. b) How to crawl the hidden application pages not visible in top apps for each of the categories? c) Is there a techniques to download the applications also and not just the webpages?

I already searched around and found the following links:

a) https://code.google.com/p/android-market-api/ 
b) https://code.google.com/p/android-marketplace-crawler/source/checkout 
c) http://mohsin-junaid.blogspot.co.uk/2012/12/how-to-install-android-marketplace.html 
d) http://mohsin-junaid.blogspot.in/2012/12/how-to-download-multiple-android-apks.html

Thanks!

解决方案

First of all, Google Play's robots.txt does NOT disallow the pages with base "/store/apps".

If you want to crawl Google Play you would need to develop your own web crawler, parse the HTML page and extract the app meta-data you need (e.g. title, descriptions, price, etc). This topic has been covered in this other question. There are libraries helping with that, for instance:

The harder part is to "find" the app-pages to crawl. You could use 1) the Google Play Sitemap or 2) follow the app-links you find in every page you crawl as explained in the Link Extractor documentation (in case you plan to use Scrapy).

Another option is to use an open-source library based on ProtoBuf to fetch meta-data about an app, here the link to the project: https://code.google.com/archive/p/android-market-api. This library fetches app meta-data from Google Play on behalf of a valid Google account, but also in this case you need a crawler to "find" which apps are available and schedule their meta-data retrieval. This other open-source project can help you with that: https://code.google.com/archive/p/android-marketplace-crawler.

If you don't want to implement all this by yourself, you could use a third-party managed service to access Android apps meta-data through a JSON-based API. For instance, 42matters.com (the company I work for) offers an API for both Android and iOS to retrieve apps' meta-data, here more details:

https://42matters.com/app-market-data

In order to get the Title, Icon, Description, Downloads for an app you can use the "lookup" endpoint as documented here:

https://42matters.com/docs/app-market-data/android/apps/lookup

This is an example of the JSON response for the "Angry Birds Space Premium" app:

{
    "package_name": "com.rovio.angrybirdsspace.premium",
    "title": "Angry Birds Space Premium",
    "description": "Play over 300 interstellar levels across 10 planets...",
    "short_desc": "The #1 mobile game of all time blasts off into space!",
    "rating": 4.3046236038208,
    "category": "Arcade",
    "cat_key": "GAME_ARCADE",
    "cat_keys": [
        "GAME_ARCADE",
        "GAME",
        "FAMILY_EDUCATION",
        "FAMILY"
    ],
    "price": "$1.15",
    "downloads": "1,000,000 - 5,000,000",
    "version": "2.2.1",
    "content_rating": "Everyone",
    "promo_video": "https://www.youtube.com/embed/g6AL9YqRHaI?ps=play&vq=large&rel=0&autohide=1&showinfo=0&autoplay=1",
    "market_update": "2015-07-03T00:00:00+00:00",
    "screenshots": [
        "https://lh3.googleusercontent.com/ZmuBQzIy1G74coPrQ1R7fCeKdJmjTdpJhNrIHBOaFyM0N2EYdUPwZaQjnQUtiUDGmac=h310",
        "https://lh3.googleusercontent.com/Xg2Aq70ZH0SnNhtSKH7xg9jCfisWgmmq3C7xQbx6YMhTVAIRqlRJeH8GYtjxapb_qR4=h310",
        "https://lh3.googleusercontent.com/T4o5-2_UP82sj4fSSegbjrGmslNHlfvtEYuZacXMSOC55-7eyiKySw05lNF1QQGO2FeU=h310",
        "https://lh3.googleusercontent.com/f2ennaLdivFu5cQQaVPKsRcWxB8FS5T4Bkoy3l0iPW9-GDDnTVRhvR5kz6l4m8FL1c8=h310",
        "https://lh3.googleusercontent.com/H-9M03_-O9Df1nHr2-rUdjtk2aeBY3bAxnqSX3m2zh_aV8-K1t0qU1DxLXnK0GrDAw=h310"
    ],
    "created": "2012-03-22T08:24:00+00:00",
    "developer": "Rovio Entertainment Ltd.",
    "number_ratings": 20812,
    "price_currency": "$",
    "icon": "https://lh3.ggpht.com/aQaIEGrmba1ENSEgUtArdm3yhJUug7BRWlu_WaspoJusZyHv1rjlWtYqe_qRjE_Kmh1E=w300",
    "icon_72": "https://lh3.ggpht.com/aQaIEGrmba1ENSEgUtArdm3yhJUug7BRWlu_WaspoJusZyHv1rjlWtYqe_qRjE_Kmh1E=w72",
    "market_url": "https://play.google.com/store/apps/details?id=com.rovio.angrybirdsspace.premium&referrer=utm_source%3D42matters.com%26utm_medium%3Dapi"
}

I hope this helps, otherwise feel free to get in touch with me. I know this topic quite well and can point you in the right direction.

Regards,

Andrea

这篇关于抓取Google Play商店的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆