约束数据库 [英] Constraint database

查看:95
本文介绍了约束数据库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道约束编程的直觉,因此我从来没有真正体验过使用约束求解器进行编程的经验。尽管我认为能够实现我们定义为一致数据的情况是不同的情况。

I know the intuition behind constraint programming, so to say I never really experienced programming using a constraint solver. Although I think it is a different situation to be able to achieve what we would define as consistent data.

上下文:

我们有一套要在ETL上实施的规则服务器。这些规则是:

We have a set of rules to implement on a ETL server. These rules are either:


  • 作用于一行。

  • 作用于行之间,成一行

  • 两次运行之间的行为相同(它应该对所有数据或最后n次运行保持相同的约束);

第三种情况与第二种情况不同,因为在第二种情况成立时仍适用,但运行次数明确。

The third case is different from the second, as it holds when the 2nd case holds but for a well defined number of runs. It might be applied for one single run (one file), or between (1 to n (previous) or on All files).

从技术上讲,如我们所设想的ETL一样,它可以应用于单次运行(一个文件),也可以应用于(1到n(先前)或在所有文件之间)。它在两次运行之间没有内存:两个文件(但这需要重新考虑)

Technically as we conceived the ETL, it has no memory between two runs: two files (but this is to be re-thought)

对于第三种规则的应用,ETL需要具有内存(我认为我们将最终在ETL中备份数据);或者通过在某个时间窗口后对整个数据库进行无限次重新检查(作业),因此最终存储在数据库中的数据不一定能及时满足第三种规则。

For the application of the third kind of rule, ETL needs to have memory (I think we would end-up back-upping data in ETL); Or by re-checking infinitely (a Job) on the whole database after some time window, So data ending up in database do not necessarily fulfill the third kind of rule in-time.

示例:

虽然我们有连续的数据流,但我们将约束应用于整个受约束的数据库,第二天我们将收到一个月的备份或校正数据,对于这个时间窗口,我们希望仅对此运行(此时间窗口)满足约束,而不必担心整个数据库,对于以后的运行,所有数据都应像之前无需担心过去的数据。您可以想象其他符合时序逻辑的规则。

While we have a continuous flowing data, we apply constraints to have a whole constrained database, the next day we will receive a backup or a correction data for say one month, for this time window, we would like to have constraints satisfied for only this run (this time window), without worrying about the whole database, for future runs all data should be constrained like before without worrying about past data. You can imagine other rules that could fit Temporal logic.

目前,我们仅实现了第一类规则。我想到的方式是拥有一个缩小的数据库(任何类型的数据库:MySQL,PostgreSQL,MongoDB ...),该数据库备份所有数据(仅包含受约束的列,可能带有哈希值),并带有基于早期基于一致性的标志

For now, we only have the first kind of rules implemented. The way I thought of it is to have a minified database (of any kind: MySQL, PostgreSQL, MongoDB ...) that back-up all Data (only constrained columns, probably with hashed values) with flags referring to consistency based on earlier kind of rules.

问题:是否有任何解决方案/概念替代方案可以简化此过程?

Question: Are there any solutions / conception alternatives that would ease this process ?

以Cook编程语言插图;一组规则和以下操作的示例:

To illustrate in a Cook programming language; An example of a set of rules and following actions:

run1 : WHEN tableA.ID == tableB.ID AND tableA.column1 > tableB.column2
       BACK-UP 
       FLAG tableA.rule1
AFTER run1 : LOG ('WARN')

run2 : WHEN tableA.column1 > 0
       DO NOT BACK-UP 
       FLAG tableA.rule2
AFTER run2 : LOG ('ERROR')

注意
虽然约束编程理论上是解决组合问题的范例,但实际上可以加快问题的开发和执行;我认为这与约束解决问题有所不同。由于第一个目的不是在解决之前优化约束,所以可能甚至没有限制数据域。主要关心的是在数据接收上应用规则并执行一些基本操作(拒绝行,接受行,记录...)。

Note: While constraint programming is in theory a paradigm for solving combinatorial problems and in practice can speed problem development and execution; I think this is different than a constraint solving problem; As the first purpose is not for optimizing constraints before resolution, probably not even limiting data domains; It's main concern is to apply rules on data reception and execute some basic actions (Reject a line, Accept a line, Logging...).

我真的希望这不是一个非常广泛的问题,而是在正确的地方。

推荐答案

我找到了一个完善的解决方案比我想的要多谈论检查数据一致性。显然,这就是我们所说的测试驱动的数据分析

I found a sophisticated solution to achieve more than what I thought; talking about checking data consistency. Apparently this is what we would call test-driven data analysis

所以现在通过此实现,我们已经绑定到Python和Pandas了,但是幸运的是, 不仅仅。我们甚至可以检查MySQL,PostgreSQL ...表中的数据一致性。

So now with this implementation we are bound to Python, and Pandas, but fortunately, not only. We can even check data consistency in MySQL, PostgreSQL ... tables.

我没想到的是,我们可以根据样本数据推断规则。这对于设置规则可能会有所帮助。
这就是为什么有 tdda.constraints.verify_df tdda.constraints.discover_df 的原因。

The plus I did not think about, is that we can infer rules based on sample data. This could be helpful for setting rules. This is why there is tdda.constraints.verify_df and the tdda.constraints.discover_df.

据我所读,它没有提出一种解决方案(用于检查最后(n个)文件的一致性(较弱))。我考虑过一些事情,我们可以将批处理文件称为一致性,这只能确保对某些运行(最后n次运行)而不是所有数据的规则满意度。
它仅作用于单个文件,它需要更高级别的布线才能处理(n)个连续到达的文件。

As far as I read about, It does not propose a solution for checking (a weaker) consistency on last (n) files. Something I thought about that we could call batch files consistency, that only ensures a rule satisfaction for some set of runs (last n runs) and not all data. It only acts on single files, it needs a higher level wiring to be able to condition (n) files that arrive successively.

更多:
https://tdda.readthedocs.io/en/ Latest / constraints.html#module-tdda.constraints

assertCSVFilesCorrect 检查目录中的一组文件,可能相同

assertCSVFilesCorrect Checks a set of files in a directory, same is possible for Pandas dataframes, etc.

摘自官方文档:


tdda .constraints库用于从
(熊猫)DataFrame中发现约束,将约束写为JSON,并验证
数据集是否满足约束文件中的约束。
还支持各种关系数据库中的表。还有一个
命令行实用程序,用于发现和验证约束,以及
检测失败的记录。

The tdda.constraints library is used to discover constraints from a (Pandas) DataFrame, write them out as JSON, and to verify that datasets meet the constraints in the constraints file. It also supports tables in a variety of relation databases. There is also a command-line utility for discovering and verifying constraints, and detecting failing records.

ps:我仍然对其他解决方案持开放态度,我想这是任何ETL解决方案的用例。

我还设立了赏金以进一步丰富响应。

这篇关于约束数据库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆