从使用Power BI的网站中收集数据-从网站上的Power BI中检索数据 [英] Scraping Data from a website which uses Power BI - retrieving data from Power BI on a website

查看：296 发布时间：2020/5/30 2:30:07 python selenium web-scraping powerbi

本文介绍了从使用Power BI的网站中收集数据-从网站上的Power BI中检索数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从该页面（以及与之相似的页面）中删除数据： https://cereals.ahdb.org.uk/market-data-centre/historical-data/feed-ingredients.aspx

I want to scrap data from this page (and pages similar to it): https://cereals.ahdb.org.uk/market-data-centre/historical-data/feed-ingredients.aspx

此页面使用 Power BI 。不幸的是，找到一种报废Power BI的方法很困难，因为每个人都想报废使用/报废Power BI，而不是从报废。最接近的答案是此问题。

This page uses Power BI. Unfortunately, finding a way to scrap Power BI is hard, because everyone wants to scrap using/into Power BI, not from it. The closest answer was this question. Yet unrelated.

首先，我使用了 Apache tika ，不久，我意识到加载页面后正在加载表数据。我需要页面的渲染版本。

Firstly, I used Apache tika, and soon I realized the table data is been loading after loading the page. I need the rendered version of the page.

因此，我使用了 Selenium 。我想在一开始全选（发送 Ctrl + A 组合键），但是它不起作用。也许受页面事件的限制（我也尝试使用开发人员工具删除所有事件，但是 Ctrl + A 仍然无效。

Therefore, I used Selenium. I wanted to Select All at the begining (sending Ctrl+A key combination), but it doesn't work. Maybe it is restricted by the page events (I also tried to remove all the events using developer tools, yet still Ctrl+A doesn't work.

我也尝试读取HTML内容，但是Power BI使用 position：absolute在屏幕上放置 div 个元素并区分 div 在表（行和列）中的位置是一项费力的工作。

I also tried to read the HTML contents, but Power BI puts div elements on the screen using position:absolute and distinguishing the location of a div in the table (both row and column) is an effortful activity.

由于Power BI使用JSON，所以我尝试从那里读取数据。但是它是如此复杂，以至于我找不到规则。它似乎将关键字放在某个地方并在表中使用它们的索引。

Since Power BI uses JSON, I tried to read data from there. However it is so complicated I couldn't find out the rules. It seems it puts keywords somewhere and uses their indices in the table.

注意：我意识到所有数据都不会加载，甚至不会同时显示。A div scroll-bar-part-bar 类的code>负责充当滚动条，并移动以加载/显示数据的其他部分。

Note: I realized that all of the data is not loaded and even shown at the same time. A div of class scroll-bar-part-bar is responsible to act as a scroll bar, and moving that loads/shows other parts of the data.

我用来读取数据的代码如下：如前所述，生成数据的顺序与浏览器上呈现的内容有所不同：

The code I used to read data is as follows. As mentioned, the order of the produced data differs from what is rendered on the browser:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

options = webdriver.ChromeOptions()
options.binary_location = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe"
driver = webdriver.Chrome(options=options, executable_path="C:/Drivers/chromedriver.exe")

driver.get("https://app.powerbi.com/view?r=eyJrIjoiYjVjM2MyNjItZDE1Mi00OWI1LWE5YWYtODY4M2FhYjU4ZDU1IiwidCI6ImExMmNlNTRiLTNkM2QtNDM0Ni05NWVmLWZmMTNjYTVkZDQ3ZCJ9")
parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')
values = [child.get_attribute('title') for child in children]

我很高兴为上述任何问题提供解决方案。对于我来说，最有趣的是约定以JSON格式存储Power BI数据。

I appreciate solutions for any of the above problems. The most interesting for me though, is the convention of storing Power BI data in JSON format.

推荐答案

将滚动部分和除了JSON，我设法读取了数据。关键是读取父级内部的所有元素（在问题中完成）：

Putting the scroll part and the JSON aside, I managed to read the data. The key is to read all of the elements inside the parent (which is done in the question):

parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')

然后使用其位置对其进行排序：

Then sort them using their location:

x = [child.location['x'] for child in children]
y = [child.location['y'] for child in children]
index = np.lexsort((x,y))

要对我们在不同行中阅读的内容进行排序，此代码可能会有所帮助：

To sort what we have read in different lines, this code may help:

rows = []
row = []
last_line = y[index[0]]
for i in index:
    if last_line != y[i]:
        row.append[children[i].get_attribute('title')]
    else:
        rows.append(row)
        row = list([children[i].get_attribute('title')]
rows.append(row)

这篇关于从使用Power BI的网站中收集数据-从网站上的Power BI中检索数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从使用Power BI的网站中收集数据-从网站上的Power BI中检索数据 [英] Scraping Data from a website which uses Power BI - retrieving data from Power BI on a website

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从使用Power BI的网站中收集数据-从网站上的Power BI中检索数据 [英] Scraping Data from a website which uses Power BI - retrieving data from Power BI on a website

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭