从使用 Power BI 的网站抓取数据 - 从网站上的 Power BI 检索数据 [英] Scraping Data from a website which uses Power BI - retrieving data from Power BI on a website

查看:48
本文介绍了从使用 Power BI 的网站抓取数据 - 从网站上的 Power BI 检索数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从这个页面(以及类似的页面)抓取数据:https://cereals.ahdb.org.uk/market-data-centre/historical-data/feed-ingredients.aspx

I want to scrape data from this page (and pages similar to it): https://cereals.ahdb.org.uk/market-data-centre/historical-data/feed-ingredients.aspx

此页面使用 Power BI.不幸的是,找到一种抓取 Power BI 的方法很困难,因为每个人都希望抓取使用/进入 Power BI,而不是从中.最接近的答案是这个问题.然而无关.

This page uses Power BI. Unfortunately, finding a way to scrape Power BI is hard, because everyone wants to scrape using/into Power BI, not from it. The closest answer was this question. Yet unrelated.

首先,我使用了Apache tika,很快我发现加载页面后正在加载表数据.我需要页面的渲染版本.

Firstly, I used Apache tika, and soon I realized the table data is been loading after loading the page. I need the rendered version of the page.

因此,我使用了 Selenium.我想在开始时 Select All (发送 Ctrl+A 组合键),但它不起作用.可能是页面事件限制了(我也尝试使用开发者工具删除所有事件,但仍然 Ctrl+A 不起作用.

Therefore, I used Selenium. I wanted to Select All at the begining (sending Ctrl+A key combination), but it doesn't work. Maybe it is restricted by the page events (I also tried to remove all the events using developer tools, yet still Ctrl+A doesn't work.

我也尝试读取 HTML 内容,但 Power BI 使用 position:absolutediv 元素放在屏幕上并区分 div 在表中(行和列)是一项费力的活动.

I also tried to read the HTML contents, but Power BI puts div elements on the screen using position:absolute and distinguishing the location of a div in the table (both row and column) is an effortful activity.

由于 Power BI 使用 JSON,我尝试从那里读取数据.然而,它是如此复杂,我无法找到规则.它似乎将关键字放在某处并在表格中使用它们的索引.

Since Power BI uses JSON, I tried to read data from there. However it is so complicated I couldn't find out the rules. It seems it puts keywords somewhere and uses their indices in the table.

注意:我意识到所有数据都没有加载,甚至没有同时显示.scroll-bar-part-bar 类的 div 负责充当滚动条,并移动加载/显示数据的其他部分.

Note: I realized that all of the data is not loaded and even shown at the same time. A div of class scroll-bar-part-bar is responsible to act as a scroll bar, and moving that loads/shows other parts of the data.

我用来读取数据的代码如下.如前所述,生成的数据的顺序与浏览器上呈现的顺序不同:

The code I used to read data is as follows. As mentioned, the order of the produced data differs from what is rendered on the browser:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

options = webdriver.ChromeOptions()
options.binary_location = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe"
driver = webdriver.Chrome(options=options, executable_path="C:/Drivers/chromedriver.exe")

driver.get("https://app.powerbi.com/view?r=eyJrIjoiYjVjM2MyNjItZDE1Mi00OWI1LWE5YWYtODY4M2FhYjU4ZDU1IiwidCI6ImExMmNlNTRiLTNkM2QtNDM0Ni05NWVmLWZmMTNjYTVkZDQ3ZCJ9")
parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')
values = [child.get_attribute('title') for child in children]

我很欣赏上述任何问题的解决方案.不过,对我来说最有趣的是以 JSON 格式存储 Power BI 数据的约定.

I appreciate solutions for any of the above problems. The most interesting for me though, is the convention of storing Power BI data in JSON format.

推荐答案

把滚动部分和 JSON 放在一边,我设法读取了数据.关键是读取父级内部的所有元素(在问题中完成):

Putting the scroll part and the JSON aside, I managed to read the data. The key is to read all of the elements inside the parent (which is done in the question):

parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')

然后使用它们的位置对它们进行排序:

Then sort them using their location:

x = [child.location['x'] for child in children]
y = [child.location['y'] for child in children]
index = np.lexsort((x,y))

要对我们在不同行中阅读的内容进行排序,此代码可能会有所帮助:

To sort what we have read in different lines, this code may help:

rows = []
row = []
last_line = y[index[0]]
for i in index:
    if last_line != y[i]:
        row.append[children[i].get_attribute('title')]
    else:
        rows.append(row)
        row = list([children[i].get_attribute('title')]
rows.append(row)

这篇关于从使用 Power BI 的网站抓取数据 - 从网站上的 Power BI 检索数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆