使用Python从PowerPivot模型中提取原始数据 [英] Extracting raw data from a PowerPivot model using Python

查看:289
本文介绍了使用Python从PowerPivot模型中提取原始数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我不得不使用Python从PowerPivot模型中读取一些数据时,似乎琐碎的任务变成了一场噩梦。我相信我在过去几天中对此进行了很好的研究,但是现在我遇到了麻烦,并且希望得到Python / SSAS / ADO社区的帮助。

What seemed like a trivial task turned into a real nightmare when I had to read in some data from a PowerPivot model using Python. I believe I've researched this very well over the last couple of days but now I hit a brick wall and would appreciate some help from the Python/SSAS/ADO community.

基本上,我要做的只是以编程方式访问PowerPivot模型中存储的原始数据-我的想法是通过以下列出的方法之一连接到基础PowerPivot(即MS Analysis Services)引擎,列出该模型中包含的表,然后使用简单的DAX查询(如 EVALUATE(table_name)之类的数据从每个表中提取原始数据)。轻轻松松吧?嗯,也许不是。

Basically, all I want to do is programmatically access raw data stored in PowerPivot models - my idea was to connect to the underlying PowerPivot (i.e. MS Analysis Services) engine via one of the methods listed below, list the tables contained in the model, then extract the raw data from each table using a simple DAX query (something like EVALUATE (table_name)). Easy peasy, right? Well, maybe not.

如您所见,我尝试了几种不同的方法。我将尝试尽可能仔细地记录所有内容,以便那些未使用PowerPivot功能的用户可以很好地了解我想做什么。

As you can see, I've tried several different approaches. I'll try to document everything as carefully as possible so that those uninitiated in PowerPivot functionality will have a good idea of what I'd like to do.

首先,有关以编程方式访问Analysis Services引擎的一些背景知识(它表示为2005 SQL Server,但所有内容仍应适用): SQL Server数据挖掘可编程性用于Analysis Services连接的数据提供程序

First of all, some background on programmatic access to Analysis Services engine (it says 2005 SQL Server, but all of it ought to still be applicable): SQL Server Data Mining Programmability and Data providers used for Analysis Services connections.

在下面的示例中将使用的示例Excel / PowerPivot文件可以在以下位置找到:用于Excel 2010的Microsoft PowerPivot和Excel 2013示例中的PowerPivot

The sample Excel/PowerPivot file I'll be using in the example below can be found here: Microsoft PowerPivot for Excel 2010 and PowerPivot in Excel 2013 Samples.

另外,请注意,我使用的是Excel 2010,因此我的某些代码是特定于版本的。例如。 wb.Connections [ PowerPivot数据] .OLEDBConnection.ADOConnection 应该是 wb.Model.DataModelConnection.ModelConnection.ADOConnection 如果您使用的是Excel 2013。

Also, note that I'm using Excel 2010, so some of my code is version-specific. E.g. wb.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection should be wb.Model.DataModelConnection.ModelConnection.ADOConnection if you're using Excel 2013.

此问题中将使用的连接字符串基于以下信息:使用以下方法连接到PowerPivot引擎C#。此外,某些方法显然需要在数据检索之前对PowerPivot模型进行某种初始化。请参阅此处:从VBA自动化PowerPivot刷新操作

The connection string I'll be using throughout this question is based on the information found here: Connect to PowerPivot engine with C#. Additionally, some of the methods apparently require some sort of initialization of the PowerPivot model prior to data retrieval. See here: Automating PowerPivot Refresh operation from VBA.

最后,这里有几个链接表明应该可以实现(但是请注意,这些链接主要引用C#,而不是Python):

Finally, here's a couple of links showing that this should be achievable (note however, that these links mainly refer to C#, not Python):

  • Made connection to PowerPivot DataModel, how can I fill a dataset with it?
  • Connecting to PowerPivot with C#
  • 2013 C# connection to PowerPivot DataModel
  • Connecting Tableau and PowerPivot. It just works. (showing that external apps can in fact read PowerPivot model data - note that the Tableau add-in installs Interop.ADODB.dll assembly, which I guess is what it uses to access the PowerPivot data)
import clr
clr.AddReference("Microsoft.AnalysisServices.AdomdClient")
import Microsoft.AnalysisServices.AdomdClient as ADOMD
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = ADOMD.AdomdConnection(ConnString)
Connection.Open()

在这里,出现的问题是PowerPivot模型尚未初始化:

Here, it appears the problem is that the PowerPivot model has not been initialized:

AdomdConnectionException: A connection cannot be made. Ensure that the server is running.



2。使用AMO



2. Using AMO

import clr
clr.AddReference("Microsoft.AnalysisServices")
import Microsoft.AnalysisServices as AMO
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = AMO.Server()
Connection.Connect(ConnString)

相同故事,服务器未运行:

Same story, "the server is not running":

ConnectionException: A connection cannot be made. Ensure that the server is running.

请注意,AMO在技术上不用于查询数据,但我将其作为可能的方式之一连接到PowerPivot模型。

Note that AMO is technically not used for querying data, but I included it as one of the potential ways of connecting to the PowerPivot model.

import clr
clr.AddReference("System.Data")
import System.Data.OleDb as ADONET
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = ADONET.OleDbConnection()
Connection.ConnectionString = ConnString
Connection.Open()

这类似于使用python或ironpython访问mssql的最简单方法是什么?。不幸的是,这也不起作用:

This is similar to What's the simplest way to access mssql with python or ironpython?. Unfortunately, this also doesn't work:

OleDbException: OLE DB error: OLE DB or ODBC error: The following system error occurred:
The requested name is valid, but no data of the requested type was found.



4。通过adodbapi模块使用ADO



4. Using ADO via adodbapi module

import adodbapi
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = adodbapi.connect(ConnString)

类似于 Python和MS Access VBA之间的OLEDB / ODBC的相反工作。我得到的错误是:

Similar to Opposite Workings of OLEDB/ODBC between Python and MS Access VBA. The error I get is:

OperationalError: (com_error(-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB
Provider for SQL Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The
following system error occurred:  The requested name is valid, but no data of the requested
type was found...

与ADO.NET基本上是相同的问题

This is basically the same problem as with ADO.NET above.

from win32com.client import Dispatch
Xlfile = "H:\\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
Connection = Workbook.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection
Recordset = Dispatch('ADODB.Recordset')

Query = "EVALUATE(dbo_DimDate)" #sample DAX query
Recordset.Open(Query, Connection)

这种方法的想法来自这篇使用VBA的博客文章:使用VBA将表或DAX查询从Power Pivot导出到CSV 。请注意,此方法使用了显式的Refresh命令来初始化模型(即服务器)。错误消息如下:

The idea for this approach came from this blog post that uses VBA: Export a table or DAX query from Power Pivot to CSV using VBA. Note that this approach uses an explicit Refresh command that initializes the model (i.e. "server"). Here's the error message:

com_error: (-2147352567, 'Exception occurred.', (0, u'ADODB.Recordset', u'Arguments are of
the wrong type, are out of acceptable range, or are in conflict with one another.',
u'C:\\Windows\\HELP\\ADO270.CHM', 1240641, -2146825287), None)

但是, ADO连接已建立:

It appears, however, that the ADO connection has been established:


  • type(Connection)返回实例

  • print(Connection)返回 Provider = MSOLAP。 5;持久性安全信息=真;初始目录= Microsoft_SQLServer_AnalysisServices;数据源= $ Embedded $; MDX兼容性= 1;安全选项= 2; ConnectTo = 11.0; MDX缺少成员模式=错误;子查询= 2;优化响应= 3;单元格错误模式= TextValue

  • type(Connection) returns instance
  • print(Connection) returns Provider=MSOLAP.5;Persist Security Info=True;Initial Catalog=Microsoft_SQLServer_AnalysisServices;Data Source=$Embedded$;MDX Compatibility=1;Safety Options=2;ConnectTo=11.0;MDX Missing Member Mode=Error;Subqueries=2;Optimize Response=3;Cell Error Mode=TextValue

似乎问题出在ADODB.Recordset对象的创建上。

It seems the problem lies in the creation of the ADODB.Recordset object.

from win32com.client import Dispatch
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnString)

类似于从Python访问访问的连接[重复] 在Win32平台中使用ADO查询查询(Python配方)。不幸的是,Python吐出的错误与以上两个示例相同:

Similar to Connection to Access from Python [duplicate] and Query access using ADO in Win32 platform (Python recipe). Unfortunately, the error Python spits out is the same as in the two examples above:

com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The following system
error occurred:  The requested name is valid, but no data of the requested type was found.
..', None, 0, -2147467259), None)



7。通过Excel / win32com使用ADO,直接使用ADODB.Connection和模型刷新



7. Using ADO via Excel/win32com, direct use of ADODB.Connection plus model refresh

from win32com.client import Dispatch
Xlfile = "H:\\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
ConnStringInternal = "Provider=MSOLAP.5;Persist Security Info=True;Initial Catalog=
                     Microsoft_SQLServer_AnalysisServices;Data Source=$Embedded$;MDX
                     Compatibility=1;Safety Options=2;ConnectTo=11.0;MDX Missing Member
                     Mode=Error;Optimize Response=3;Cell Error Mode=TextValue"

Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnStringInternal)

我希望可以初始化Excel实例,然后初始化PowerPivot模型,然后使用Excel用于ember的内部连接字符串创建连接eded PowerPivot数据(类似于如何将powerpivot数据作为表格复制到excel工作簿中?-注意连接字符串与我在其他地方使用的字符串不同)。不幸的是,这是行不通的,我的猜测是Python在一个单独的实例中启动ADODB.Connection进程(因为在执行最后三行而没有首先初始化Excel等时出现相同的错误消息):

I was hoping I could initialize an instance of Excel, then initialize the PowerPivot model, and then create a connection using the internal connection string Excel uses for embedded PowerPivot data (similar to How do you copy the powerpivot data into the excel workbook as a table? - note that the connection string is different from the one I've used elsewhere). Unfortunately, this doesn't work and my guess is that Python starts the ADODB.Connection process in a separate instance (as I get the same error message when I execute the last three rows without first initializing Excel, etc.):

com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'Either the user, ****** (masked), does not have access
to the Microsoft_SQLServer_AnalysisServices database, or the database does not exist.',
None, 0, -2147467259), None)


推荐答案

Lo瞧,我终于设法解决了这个问题-事实证明,确实可以使用Python访问Power Pivot数据!以下是我所做工作的简短摘要-您可以在此处找到更详细的描述:分析服务( SSAS)。注意:代码既没有针对效率也没有进行优化。

Lo and behold, I finally managed to crack the problem - turns out that accessing Power Pivot data using Python is indeed possible! Below's a short recap of what I did - you can find a more detailed description here: Analysis Services (SSAS) on a shoestring. Note: the code has been optimized neither for efficiency nor elegance.


  • 安装Microsoft Power BI Desktop(带有免费的Analysis Services服务器,因此没有需要昂贵的SQL Server许可证-但是,如果您拥有适当的许可证,同样的方法显然也可以使用。)

  • 首先创建msmdsrv.ini设置文件来启动AS引擎,然后从ABF文件还原数据库(使用AMO.NET),然后使用ADOMD.NET提取数据。

这是Python代码该图说明了AS引擎+ AMO.NET的各个部分:

Here's the Python code that illustrates the AS engine + AMO.NET parts:

import psutil, subprocess, random, os, zipfile, shutil, clr, sys, pandas

def initialSetup(pathPowerBI):
    sys.path.append(pathPowerBI)

    #required Analysis Services assemblies
    clr.AddReference("Microsoft.PowerBI.Amo.Core")
    clr.AddReference("Microsoft.PowerBI.Amo")     
    clr.AddReference("Microsoft.PowerBI.AdomdClient")

    global AMO, ADOMD
    import Microsoft.AnalysisServices as AMO
    import Microsoft.AnalysisServices.AdomdClient as ADOMD

def restorePowerPivot(excelName, pathTarget, port, pathPowerBI):   
    #create random folder
    os.chdir(pathTarget)
    folder = os.getcwd()+str(random.randrange(10**6, 10**7))
    os.mkdir(folder)

    #extract PowerPivot model (abf backup)
    archive = zipfile.ZipFile(excelName)
    for member in archive.namelist():
        if ".data" in member:
            filename = os.path.basename(member)
            abfname = os.path.join(folder, filename) + ".abf"
            source = archive.open(member)
            target = file(os.path.join(folder, abfname), 'wb')
            shutil.copyfileobj(source, target)
            del target
    archive.close()

    #start the cmd.exe process to get its PID
    listPIDpre = [proc for proc in psutil.process_iter()]
    process = subprocess.Popen('cmd.exe /k', stdin=subprocess.PIPE)
    listPIDpost = [proc for proc in psutil.process_iter()]
    pid = [proc for proc in listPIDpost if proc not in listPIDpre if "cmd.exe" in str(proc)][0]
    pid = str(pid).split("=")[1].split(",")[0]

    #msmdsrv.ini
    msmdsrvText = '''<ConfigurationSettings>
       <DataDir>{0}</DataDir>
       <TempDir>{0}</TempDir>
       <LogDir>{0}</LogDir>
       <BackupDir>{0}</BackupDir>
       <DeploymentMode>2</DeploymentMode>
       <RecoveryModel>1</RecoveryModel>
       <DisklessModeRequested>0</DisklessModeRequested>
       <CleanDataFolderOnStartup>1</CleanDataFolderOnStartup>
       <AutoSetDefaultInitialCatalog>1</AutoSetDefaultInitialCatalog>
       <Network>
          <Requests>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
          </Requests>
          <Responses>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
             <CompressionLevel>9</CompressionLevel>
          </Responses>
          <ListenOnlyOnLocalConnections>1</ListenOnlyOnLocalConnections>
       </Network>
       <Port>{1}</Port>
       <PrivateProcess>{2}</PrivateProcess>
       <InstanceVisible>0</InstanceVisible>
       <Language>1033</Language>
       <Debug>
          <CallStackInError>0</CallStackInError>
       </Debug>
       <Log>
          <Exception>
             <CrashReportsFolder>{0}</CrashReportsFolder>
          </Exception>
          <FlightRecorder>
             <Enabled>0</Enabled>
          </FlightRecorder>
       </Log>
       <AllowedBrowsingFolders>{0}</AllowedBrowsingFolders>
       <ResourceGovernance>
          <GovernIMBIScheduler>0</GovernIMBIScheduler>
       </ResourceGovernance>
       <Feature>
          <ManagedCodeEnabled>1</ManagedCodeEnabled>
       </Feature>
       <VertiPaq>
          <EnableDisklessTMImageSave>0</EnableDisklessTMImageSave>
          <EnableProcessingSimplifiedLocks>1</EnableProcessingSimplifiedLocks>
       </VertiPaq>
    </ConfigurationSettings>'''

    #save ini file to disk, fill it with required parameters
    msmdsrvini = open(folder+"\\msmdsrv.ini", "w")
    msmdsrvText = msmdsrvText.format(folder, port, pid) #{0},{1},{2}
    msmdsrvini.write(msmdsrvText)
    msmdsrvini.close()

    #run AS engine inside the cmd.exe process
    initString = "\"{0}\\msmdsrv.exe\" -c -s \"{1}\""
    initString = initString.format(pathPowerBI.replace("/","\\"),folder)
    process.stdin.write(initString + " \n")

    #connect to the AS instance from Python
    AMOServer = AMO.Server()
    AMOServer.Connect("localhost:{0}".format(port))

    #restore database from PowerPivot abf backup, disconnect
    AMORestoreInfo = AMO.RestoreInfo(os.path.join(folder, abfname))
    AMOServer.Restore(AMORestoreInfo)
    AMOServer.Disconnect()

    return process

数据提取部分:

def runQuery(query, port, flag):
    #ADOMD assembly
    ADOMDConn = ADOMD.AdomdConnection("Data Source=localhost:{0}".format(port))
    ADOMDConn.Open()
    ADOMDCommand = ADOMDConn.CreateCommand() 
    ADOMDCommand.CommandText = query

    #read data in via AdomdDataReader object
    DataReader = ADOMDCommand.ExecuteReader()

    #get metadata, number of columns
    SchemaTable = DataReader.GetSchemaTable()
    numCol = SchemaTable.Rows.Count #same as DataReader.FieldCount

    #get column names
    columnNames = []
    for i in range(numCol):
        columnNames.append(str(SchemaTable.Rows[i][0]))

    #fill with data
    data = []
    while DataReader.Read()==True:
        row = []
        for j in range(numCol):
            try:
                row.append(DataReader[j].ToString())
            except:
                row.append(DataReader[j])
        data.append(row)
    df = pandas.DataFrame(data)
    df.columns = columnNames 

    if flag==0:
        DataReader.Close()
        ADOMDConn.Close()

        return df     
    else:   
        #metadata table
        metadataColumnNames = []
        for j in range(SchemaTable.Columns.Count):
            metadataColumnNames.append(SchemaTable.Columns[j].ToString())
        metadata = []
        for i in range(numCol):
            row = []
            for j in range(SchemaTable.Columns.Count):
                try:
                    row.append(SchemaTable.Rows[i][j].ToString())
                except:
                    row.append(SchemaTable.Rows[i][j])
            metadata.append(row)
        metadf = pandas.DataFrame(metadata)
        metadf.columns = metadataColumnNames

        DataReader.Close()
        ADOMDConn.Close()

        return df, metadf

然后通过类似这样的方式提取原始数据:

The raw data are then extracted via something like this:

pathPowerBI = "C:/Program Files/Microsoft Power BI Desktop/bin"
initialSetup(pathPowerBI)
session = restorePowerPivot("D:/Downloads/PowerPivotTutorialSample.xlsx", "D:/", 60000, pathPowerBI)
df, metadf = runQuery("EVALUATE dbo_DimProduct", 60000, 1)
endSession(session)

这篇关于使用Python从PowerPivot模型中提取原始数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆