如何在组织中共享数据 [英] How to share data across an organization

查看:146
本文介绍了如何在组织中共享数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是跨多个部门工作人员和应用程序共享关键数据的一些好方法对一个组织?

What are some good ways for an organization to share key data across many deparments and applications?

要举个例子,假设有一个主应用和数据库来管理客户数据。有在该读取数据,并将其与他们自己的数据组织其他十个应用程序和数据库。目前,这个数据共享是通过数据库的混合物进行(DB)链接,物化视图,触发器,临时表,重新输入信息,网络服务等。

To give an example, let's say there is one primary application and database to manage customer data. There are ten other applications and databases in the organization that read that data and relate it to their own data. Currently this data sharing is done through a mixture of database (DB) links, materialized views, triggers, staging tables, re-keying information, web services, etc.

是否有共享数据的任何其他好的方法?而且,你如何你的方法比较的那些上面对于喜欢关注:

  • 重复数据
  • 容易出错的数据同步过程
  • 紧与松耦合(减少依赖/脆性/测试协调)
  • 简化建筑
  • 安全
  • 性能
  • 定义良好的接口
  • 等有关的问题?

    Are there any other good approaches for sharing data? And, how do your approaches compare to the ones above with respect to concerns like:

  • duplicate data
  • error prone data synchronization processes
  • tight vs. loose coupling (reducing dependencies/fragility/test coordination)
  • architectural simplification
  • security
  • performance
  • well-defined interfaces
  • other relevant concerns?

    请记住,共享客户数据以多种方式使用,从简单的单记录查询到复杂的,多predicate,多种类,加入与存储在不同的数据库中其他组织的数据。

    Keep in mind that the shared customer data is used in many ways, from simple, single record queries to complex, multi-predicate, multi-sort, joins with other organization data stored in different databases.

    感谢您的建议和意见...

    Thanks for your suggestions and advice...

    推荐答案

    我敢肯定,你看到了这点,这取决于。

    I'm sure you saw this coming, "It Depends".

    这取决于一切。并解决共享客户数据部A可能是与部门B共享客户数据完全不同。

    It depends on everything. And the solution to sharing Customer data for department A may be completely different for sharing Customer data with department B.

    我最喜欢的是已经涨到了多年来的理念是最终一致性的概念。这个词从亚马逊来谈论分布式系统。

    My favorite concept that has risen up over the years is the concept of "Eventual Consistency". The term came from Amazon talking about distributed systems.

    在premise的是,虽然在分布式企业数据的状态可能不是完全一致的,现在,它最终将。

    The premise is that while the state of data across a distributed enterprise may not be perfectly consistent now, it "eventually" will be.

    例如,当一个客户记录被更新系统A,B系统的客户数据已经陈旧和不匹配。但是,最终,从A记录将被发送通过一些过程B点。所以,最终,两个实例相匹配。

    For example, when a customer record gets updated on system A, system B's customer data is now stale and not matching. But, "eventually", the record from A will be sent to B through some process. So, eventually, the two instances will match.

    当你用一个单一的系统工作,你不必EC,而你的即时更新,单一的真理之源,而且,通常情况下,锁定机制来处理竞争条件和冲突。

    When you work with a single system, you don't have "EC", rather you have instant updates, a single "source of truth", and, typically, a locking mechanism to handle race conditions and conflicts.

    越能您的运营能够与EC数据的工作,就越容易在这些系统中分离出来。一个简单的例子是用于销售的数据仓库。他们用DW来运行他们的每日报告,但他们不跑他们的报告直到凌晨,他们总是看昨天(或更早)的数据。因此,有没有为DW要与日常运营系统完全一致的实时需求。这是完全可以接受的一个过程在,比如跑,密切的商业和天交易和活动移到集体在一个大的,单一的更新操作。

    The more able your operations are able to work with "EC" data, the easier it is to separate these systems. A simple example is a Data Warehouse used by sales. They use the DW to run their daily reports, but they don't run their reports until the early morning, and they always look at "yesterdays" (or earlier) data. So there's no real time need for the DW to be perfectly consistent with the daily operations system. It's perfectly acceptable for a process to run at, say, close of business and move over the days transactions and activities en masse in a large, single update operation.

    您可以看到这个需求如何能够解决许多问题。有没有争夺的交易数据,无后顾之忧,一些报告的数据将在积累统计,因为报告中提出两个单独的查询到实时数据库的中间改变。无需为高细节喋喋不休白天吸了网络和CPU处理等​​。

    You can see how this requirement can solve a lot of issues. There's no contention for the transactional data, no worries that some reports data is going to change in the middle of accumulating the statistic because the report made two separate queries to the live database. No need to for the high detail chatter to suck up network and cpu processing, etc. during the day.

    现在,这是欧盟的一个极端,简化,非常粗糙的例子。

    Now, that's an extreme, simplified, and very coarse example of EC.

    不过,考虑一个大的系统像谷歌。作为搜索的消费者,我们何时或如何需要多长时间的搜索结果,谷歌的收成如何建立一个搜索页面上不知道。 1ms的? 1秒? 10秒? 10小时?这很容易成像如何,如果你打谷歌西海岸的服务器,你很可能得到不同的搜索结果比如果你打他们的东海岸服务器。在任何时候都是这两个实例完全一致。但是,很大程度上,他们大多是一致的。而对于其使用的情况下,他们的消费者并没有真正的滞后和延迟的影响。

    But consider a large system like Google. As a consumer of Search, we have no idea when or how long it takes for a search result that Google harvests to how up on a search page. 1ms? 1s? 10s? 10hrs? It's easy to imaging how if you're hitting Googles West Coast servers, you may very well get a different search result than if you hit their East Coast servers. At no point are these two instances completely consistent. But by large measure, they are mostly consistent. And for their use case, their consumers aren't really affected by the lag and delay.

    考虑电子邮件。 A希望将消息发送到B,但在这个过程中信息是通过系统的C,D路由和E.每个系统接受邮件,承担完全责任,然后递给它关闭到另一个。发送者看到的道路上的电子邮件去。接收器并没有真正错过它,因为他们不一定知道它的到来。所以,有时间,它可以利用该消息通过系统没有任何人关心知道或关心它是如何快速移动一个大窗口。

    Consider email. A wants to send message to B, but in the process the message is routed through system C, D, and E. Each system accepts the message, assume complete responsibility for it, and then hands it off to another. The sender sees the email go on its way. The receiver doesn't really miss it because they don't necessarily know its coming. So, there is a big window of time that it can take for that message to move through the system without anyone concerned knowing or caring about how fast it is.

    在另一方面,A本来在电话中与B.我只是送了它,你得到它?现在呢?现在?现在明白了吗?

    On the other hand, A could have been on the phone with B. "I just sent it, did you get it yet? Now? Now? Get it now?"

    因此​​,有某种潜在的,隐含的性能和响应级别。最后,最终,A的发件箱相匹配乙收件箱。

    Thus, there is some kind of underlying, implied level of performance and response. In the end, "eventually", A's outbox matches B inbox.

    这些延迟,接受陈旧的数据,一日龄或1-5s旧其是否,是什么控制你的系统的最终耦合。在宽松的这一要求,松散的耦合,并在设计方面更灵活,你在您的处置。

    These delays, the acceptance of stale data, whether its a day old or 1-5s old, are what control the ultimate coupling of your systems. The looser this requirement, the looser the coupling, and the more flexibility you have at your disposal in terms of design.

    这是真正到CPU中的核心。在同一系统上运行的现代,多核,多线程应用程序,可以有相同的数据的不同看法,只有微秒过时。如果你的code可以彼此,然后快乐的日子可能不一致数据的正常工作,它沿着拉链。如果没有,你需要特别注意,以确保您的数据完全一致,使用技术,如易失性存储器资格,或锁定结构,等等所有这一切,在他们的方式,性能价格比。

    This is true down to the cores in your CPU. Modern, multi core, multi-threaded applications running on the same system, can have different views of the "same" data, only microseconds out of date. If your code can work correctly with data potentially inconsistent with each other, then happy day, it zips along. If not you need to pay special attention to ensure your data is completely consistent, using techniques like volatile memory qualifies, or locking constructs, etc. All of which, in their way, cost performance.

    所以,这是基本的考虑。所有其他决定从这里开始。回答这个问题可以告诉你如何分割横跨机器的应用,哪些资源是共享的,以及它们是如何共享。什么协议和技术可用于移动数据,以及多少会在处理方面的成本来执行传输。复制,负载均衡,数据共享,等等,等等所有基于这样的理念。

    So, this is the base consideration. All of the other decisions start here. Answering this can tell you how to partition applications across machines, what resources are shared, and how they are shared. What protocols and techniques are available to move the data, and how much it will cost in terms of processing to perform the transfer. Replication, load balancing, data shares, etc. etc. All based on this concept.

    编辑,以应对第一条评论。

    Edit, in response to first comment.

    正确,准确。这里的游戏,例如,如果B不能更改的客户数据,那么什么是有变化的客户数据带来的危害?你能风险它是过时的时间很短?也许你的客户数据进来慢慢就好了,你可以从复制它立即B点。称这种变化是换上,因为低容积,被随手拿起一个队列(小于1秒),但即使它仍然将是交易与​​原来的变化,所以有一个小窗口,其中A会有数据,B则没有。

    Correct, exactly. The game here, for example, if B can't change customer data, then what is the harm with changed customer data? Can you "risk" it being out of date for a short time? Perhaps your customer data comes in slowly enough that you can replicate it from A to B immediately. Say the change is put on a queue that, because of low volume, gets picked up readily (< 1s), but even still it would be "out of transaction" with the original change, and so there's a small window where A would have data that B does not.

    现在心里真的开始旋转。 滞后,什么是最差的情况那1秒期间会发生什么。你可以设计围绕它?如果你能围绕一个1S滞后工程师,你可能能够工程师围绕5S,1M,甚至更长的时间滞后。如何对客户数据的钱你实际上是在B之间使用?也许B是旨在促进为了从库存采摘系统。很难想象,不是简单的一个客户ID,并可能是一个名称,也更是必要的。只是要严重查明订单是,而它的组装谁。

    Now the mind really starts spinning. What happens during that 1s of "lag", whats the worst possible scenario. And can you engineer around it? If you can engineer around a 1s lag, you may be able to engineer around a 5s, 1m, or even longer lag. How much of the customer data do you actually use on B? Maybe B is a system designed to facilitate order picking from inventory. Hard to imagine anything more being necessary than simply a Customer ID and perhaps a name. Just something to grossly identify who the order is for while it's being assembled.

    采摘系统不一定需要打印出所有的顾客信息,直到拾取过程的最末端,并且通过随后的顺序可能已经移动到另一个系统,也许是更电流,尤其是,航运信息,所以最后采摘系统不需要几乎在所有的客户数据。事实上,你可以嵌入和非规范化采摘订单中的客户信息,所以没有必要或更高版本同步的期望。只要客户ID是正确的(这将永远不会无论如何改变)和名字(这变化,因此很少是不值得讨论),这是你唯一需要的真正的参考,所有你挑单是在时间精确完美创建。

    The picking system doesn't necessarily need to print out all of the customer information until the very end of the picking process, and by then the order may have moved on to another system that perhaps is more current with, especially, shipping information, so in the end the picking system doesn't need hardly any customer data at all. In fact, you could EMBED and denormalize the customer information within the picking order, so there's no need or expectation of synchronizing later. As long as the Customer ID is correct (which will never change anyway) and the name (which changes so rarely it's not worth discussing), that's the only real reference you need, and all of your pick slips are perfectly accurate at the time of creation.

    关键是打破系统和重点放在必要任务的基本数据的心态。数据不需要并不需要被复制或同步。乡亲在擦伤之类的非规范化和数据缩减,尤其是当他们从关系数据建模的世界是。并有很好的理由,应该谨慎考虑。但是,一旦你去分配,实际上已隐含非规范化。哎呀,你现在复制它批发。所以,你可能也差不多吧聪明。

    The trick is the mindset, of breaking the systems up and focusing on the essential data that's necessary for the task. Data you don't need doesn't need to be replicated or synchronized. Folks chafe at things like denormalization and data reduction, especially when they're from the relational data modeling world. And with good reason, it should be considered with caution. But once you go distributed, you have implicitly denormalized. Heck, you're copying it wholesale now. So, you may as well be smarter about it.

    这一切都可以通过扎实的程序和工作流程的透彻理解缓解。识别风险,并逐步建立政策和程序来处理它们。

    All this can mitigated through solid procedures and thorough understanding of workflow. Identify the risks and work up policy and procedures to handle them.

    但最困难的部分是打破链到数据库中心之初,并指示人们,他们不能​​拥有一切就像当你有一个单一的中心,完善的信息存储,他们可以预期。

    But the hard part is breaking the chain to the central DB at the beginning, and instructing folks that they can't "have it all" like they may expect when you have a single, central, perfect store of information.

    这篇关于如何在组织中共享数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆