从 http & 抓取数据JavaScript 站点 [英] Scraping data from a http & javaScript site

查看:39
本文介绍了从 http & 抓取数据JavaScript 站点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前想从亚马逊页面抓取一些数据,但我有点卡住了.

例如,让我们看看这个页面.

在那里我们可以看到一个包含所有尺寸和颜色的字典,在其下方,在asinToDimentionIndexMap"中,每个产品代码都带有数字,表示来自variationValuesdictionary"的变体.

例如在asinToDimentionIndexMap中我们可以看到

"B01KWIUH5M":[0,0]

这意味着产品代码 B01KWIUH5M 与尺寸8M US"(variationValues size_name 部分中的位置 0)和颜色Teal"相关联(与之前的想法相同)

我想同时抓取variationValues 和asinToDimentionIndexMap,所以我可以将IndexMap 数字与variationValues 相关联.

网站上的另一个人(顺便说一句,感谢您的帮助)建议这样做.

script = response.xpath('//script/text()').extract_frist()进口重新# 捕获 {} 之间的所有内容data = re.findall(script, '(\{.+?\}_')导入jsond = json.loads(data[0])d['产品'][0]

我可以理解第一部分.我们将所有脚本"作为字符串获取,然后获取 {} 之间的所有内容.问题是在那之后会发生什么.我对 json 的了解不是很好,阅读一些关于它的内容并没有多大帮助.

有没有办法从这些数据中获取 2 个带有变体值和 asinToDimentionIndexMap 的字典或列表?(也许在中间使用一些正则表达式从大字符串中获取一些数据).或者稍微解释一下 json 部分会发生什么.

感谢您的帮助!

添加了variationValues 和asinToDimensionIndexMap 的照片

解决方案

我认为你很接近 Manuel!

以下代码会将您抓取的源代码转换为易于选择的框:

导入jsond = json.loads(data[0])

JSON 是一种用于存储对象信息的通用格式.换句话说,它旨在将字符串数据解释为对象数据,而不管您使用的平台是什么.

https://www.w3schools.com/js/js_json_intro.asp

我假设您可能会发现问题的一个挑战是在访问 json 对象中的特定框"时是否有任何错误.

您的代码格式看起来正确,但您在每个框"中的访问权限可能看起来不同.

例如.如果您的 'asinToDimentionIndexMap' 对象嵌套在较大的 'products' 对象中的较小框内,那么您可以像这样访问它(在运行上面的代码之后):

d['products'][0]['asinToDimentionIndexMap']

我进行了一些修改和斜线处理,以便您可以更好地了解特定 json 文件的结构.看看下面的链接.在右侧,您会看到哪些盒子在另一个盒子内"——这正是您访问所需内容所需要知道的.

JSON 对象查看器

例如,以下将产生companyCompliancePolicies_feature_div":

导入jsond = json.loads(data[0])d['updateDivLists']['full'][0]['divToUpdate']

之前帮助您的人为您概述了一般情况,但您需要以这种方式查看结构才能真正找到您要查找的内容.

I currently want to scrape some data from an amazon page and I'm kind of stuck.

For example, lets take this page.

https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1

I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.

There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.

For example, in asinToDimentionIndexMap we can see

"B01KWIUH5M":[0,0]

Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)

I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.

Another person in the site (thanks for the help btw) suggested doing it this way.

script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_') 

import json
d = json.loads(data[0])
d['products'][0]

I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.

Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.

Thanks for the help!

EDIT: Added photo of variationValues and asinToDimensionIndexMap

解决方案

I think you are close Manuel!

The following code will turn your scraped source into easy-to-select boxes:

import json
d = json.loads(data[0])

JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.

https://www.w3schools.com/js/js_json_intro.asp

I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.

Your code format looks correct, but your access within "each box" may look different.

Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):

d['products'][0]['asinToDimentionIndexMap']

I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.

JSON Object Viewer

For example, the following would yield "companyCompliancePolicies_feature_div":

import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']

The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.

这篇关于从 http & 抓取数据JavaScript 站点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆