重构大数据对象 [英] Refactoring large data object

查看:106
本文介绍了重构大数据对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

重构大型纯状态对象的常见策略是什么?

时间决策支持系统,进行国家空域的在线建模/模拟。这块软件消耗多个实况数据馈送,并且产生对空域中的大量实体的状态的每分钟一次估计。

I am working on a specific soft-real-time decision support system which does online modeling/simulation of the national airspace. This piece of software consumes a number of live data feeds, and produces a once-per-minute estimate of the "state" of a large number of entities in the airspace. The problem breaks down neatly until we hit what is currently the lowest-level entity.

我们的数学模型估计/预测了几个小时的时间线中的50个参数,进入到这些实体的过去和未来,大约每分钟一次。目前,这些记录被编码为具有许多字段的单个Java类(一些被折叠成 ArrayList )。我们的模型是不断发展的,并且领域之间的依赖关系还没有确定,所以每个实例都会通过一个复杂的模型,累积设置随之而来。

Our mathematical model estimates/predicts upwards of 50 parameters for a timeline of several hours into the past and future for each of these entities, roughly once per minute. Currently, these records are encoded as a single Java class with a lot of fields (some get collapsed into an ArrayList). Our model is evolving, and the dependencies among the fields are not yet set in stone, so each instance wanders through a convoluted model, accumulating settings as it goes along.

目前我们有类似以下的东西,它使用构建器模式方法来构建记录的内容,并强制执行什么是已知的依赖关系一旦估计完成,我们使用 .build()类型方法将以下转换为不可变形式。

Currently we have something like the following, which uses a builder pattern approach to build up the contents of the record, and enforce what the known dependencies are (as a check against programmer error as evolve the mode.) Once the estimate is done, we convert the below into an immutable form using a .build() type method.

final class OneMinuteEstimate {

  enum EstimateState { INFANT, HEADER, INDEPENDENT, ... };
  EstimateState state = EstimateState.INFANT; 

  // "header" stuff
  DateTime estimatedAtTime = null;
  DateTime stamp = null;
  EntityId id = null;

  // independent fields
  int status1 = -1;
  ...

  // dependent/complex fields...
  ... goes on for 40+ more fields... 

  void setHeaderFields(...)
  {
     if (!EstimateState.INFANT.equals(state)) {
        throw new IllegalStateException("Must be in INFANT state to set header");
     }

     ... 
  }

}

一旦大量的这些估计完成,它们被组合到时间线,其中分析聚合模式/趋势。我们已经考虑使用嵌入式数据库,但是遇到了性能问题;我们宁愿在数据建模方面进行整理,然后逐渐将软实时代码的一部分移动到嵌入式数据存储中。

Once a very large number of these estimates are complete, they are assembled into timelines where aggregate patterns/trends are analyzed. We have looked at using an embedded database but have struggled with performance issues; we'd rather get this sorted out in terms of data modeling and then incrementally move portions of the soft-real-time code into an embedded data store.

一旦

问题:


  • 这是一个巨大的类,有太多的字段。

  • 在类中编码的行为很少;

  • 维护 build()方法非常繁琐。

  • 手动维护状态机抽象只是为了确保大量的依赖建模组件正确填充数据对象,但是随着模型的发展,它为我们节省了大量的挫折感。

  • 存在大量重复,特别是当上述记录被聚合成非常类似的累加,其等于滚动总和/平均值或时间序列中上述结构的其他统计产品。

  • 虽然一些字段可以聚集在一起,但是它们在逻辑上是彼此的对等体,我们尝试的任何细分已经导致行为/逻辑人为分裂,

  • It's a giant class, with way too many fields.
  • There is very little behavior encoded in the class; it's mostly a holder for data fields.
  • Maintaining the build() method is extremely cumbersome.
  • It feels clumsy to manually maintain a "state machine" abstraction merely for the purpose of ensuring that a large number of dependent modeling components are properly populating a data object, but it has saved us a lot of frustration as the model evolves.
  • There is a lot of duplication, particularly when the records described above are aggregated into very similar "rollups" which amount to rolling sums/averages or other statistical products of the above structure in time series.
  • While some of the fields could be clumped together, they are all logically "peers" of one another, and any breakdown we've tried has resulted in having behavior/logic artificially split and needing to reach two levels deep in indirection.

开箱即用的想法令人满意,但这是我们需要逐步发展的。在任何人说它之前,我会注意到,可能建议我们的数学模型不够清晰,如果该模型的数据表示是很难得到的。公平的点,我们正在努力,但我认为这是一个R& D环境的副作用,有很多贡献者和很多并发假设。

Out of the box ideas entertained, but this is something we need to evolve incrementally. Before anyone else says it, I'll note that one could suggest that our mathematical model is insufficiently crisp if the data representation for that model is this hard to get ahold of. Fair point, and we're working that, but I think that's a side-effect of an R&D environment with a lot of contributors, and a lot of concurrent hypotheses in play.

(不重要,但这是用Java实现的。我们使用HSQLDB或Postgres作为输出产品。我们不使用任何持久化框架,部分原因是缺乏熟悉,部分原因是我们有足够的性能问题只需要数据库和手动编码的存储例程...我们对转向额外的抽象感到怀疑。)

(Not that it matters, but this is implemented in Java. We use HSQLDB or Postgres for output products. We don't use any persistence framework, partly out of a lack of familiarity, partly because we have enough performance trouble with just the database alone and hand-coded storage routines... we're skeptical of moving towards additional abstraction.)

推荐答案

我有很多相同的问题,你做了。

I had much of the same problem you did.

至少我认为我的,听起来像我做的。表示是不同的,但在10,000英尺,听起来几乎相同。

At least I think I did, sounds like I did. Representation was different, but at 10,000 feet, sounds pretty much the same. Crapload of discrete, "arbitrary" variables and a bunch of ad hoc relationships among them (essentially business driven), subject to change at a moment's notice.

你还有另一个不同的变量问题,你提到,这就是性能要求。听起来更快更好,可能一个缓慢的完美的解决方案将被抛弃为快速糟糕的一个,只是因为较慢的一个不能满足基线性能要求,无论多么好。

You also have another issue, which you sorta mentioned, and that was the performance requirement. Sounds like faster is better, and likely a slow perfect solution would be tossed out for the fast lousy one, simply because the slower one can't meet a baseline performance requirement, no matter how good it is.

简单来说,我做的是为我的系统设计一个简单的域特定规则语言。

To put it simply, what I did was I designed a simple domain specific rule language for my system.

DSL的整个点

非常粗糙,设想的例子:

Very crude, contrived example:

D = 7
C = A + B
B = A / 5
A = 10
RULE 1: IF (C < 10) ALERT "C is less than 10"
RULE 2: IF (C > 5) ALERT "C is greater than 5"
RULE 3: IF (D > 10) ALERT "D is greater than 10"
MODULE 1: RULE 1
MODULE 2: RULE 3
MODULE 3: RULE 1, RULE 2

首先,这不是我的语法的代表。

First, this is not representative of my syntax.

但是你可以从模块看到,它是3,简单规则。

But you can see from the Modules, that it is 3, simple rules.

但是,很明显,规则1依赖于依赖于A和B的C,B取决于A.这些关系

The key though, is that it's obvious from this that Rule 1 depends on C, which depends on A and B, and B depends on A. Those relationships are implied.

因此,对于该模块,所有这些依赖关系都随它而来。您可以看到我是否为模块1生成代码,它可能看起来像:

So, for that module, all of those dependencies "come with it". You can see if I generated code for Module 1 it might look something like:

public void module_1() {
    int a = 10;
    int b = a / 5;
    int c = a + b;
    if (c < 10) {
        alert("C is less than 10");
    }
}

然而,如果我创建了模块2,是:

Whereas if I created Module 2, all I would get is:

public void module_2() {
    int d = 7;
    if (d > 10) {
        alert("D is greater than 10.");
    }
}


$ b <

In Module 3 you see the "free" reuse:

public void module_3() {
    int a = 10;
    int b = a / 5;
    int c = a + b;
    if (c < 10) {
        alert("C is less than 10");
    }
    if (c > 5) {
        alert("C is greater than 5");
    }
}

所以,即使我有一个汤规则,模块根依赖的依赖,因此过滤掉它不关心的东西。抓住一个模块,摇动树,并保留剩下的东西。

So, even though I have one "soup" of rules, the Modules root the base of the dependencies, and thus filter out the stuff it doesn't care about. Grab a module, shake the tree and keep what's left hanging.

我的系统使用DSL生成源代码,但你可以很容易地创建一个迷你运行时解释器

My system used the DSL to generate source code, but you can easily have it create a mini runtime interpreter as well.

简单的拓扑排序处理我的依赖关系图。

Simple topological sorting handled the dependency graph for me.

虽然在最终的,产生的逻辑中存在不可避免的重复,至少在模块之间,规则库中没有任何重复。你作为开发人员/知识工作者维持的是规则基础。

So, the nice thing about this is that while there was inevitable duplication in the final, generated logic, at least across modules, there wasn't any duplication in the rule base. What you as a developer/knowledge worker maintain is the rule base.

还有一个很好的是,你可以改变一个方程,而不必担心这么多的副作用。例如,如果我改变do C = A / 2,那么,突然,B完全退出。但是IF(C <10)的规则根本不会改变。

What is also nice is that you can change an equation, and not worry so much about the side effects. For example, if I change do C = A / 2, then, suddenly, B drops out completely. But the rule for IF (C < 10) doesn't change at all.

使用几个简单的工具,可以显示整个依赖图,孤立变量(如B)等。

With a few simple tools, you can show the entire dependency graph, you can find orphaned variables (like B), etc.

通过生成源代码,它会按照你想要的速度运行。

By generating source code, it's going to run as fast as you want.

在我的例子中,有趣的是看到一个规则删除了一个变量,并看到500行源代码从结果模块中消失。这是500行我不用手爬行和删除在维护和开发。所有我要做的是在我的规则库中更改一个规则,让魔法发生。

In my case, it was interesting to see a rule drop a single variable and see 500 lines of source code vanish from the resulting module. That's 500 lines I didn't have to crawl through by hand and remove during maintenance and development. All I had to do was change a single rule in my rule base and let "magic" happen.

我甚至能够做一些简单的窥视孔优化和消除变量。

I was even able to do some simple peephole optimization and eliminate variables.

这不是那么难。您的规则语言可以是XML或简单的表达式解析器。没有理由去全船Yacc或ANTLR上它,如果你不想。我将为S表达式插入一个插件,不需要语法,大脑死亡解析。

It's not that hard to do. Your rule language can be XML, or a simple expression parser. No reason to go full boat Yacc or ANTLR on it if you don't want to. I'll put a plug in for S-Expressions, no grammar needed, brain dead parsing.

实际上,电子表格也是一个伟大的输入工具。只是严格的格式化。种类可以合并在SVN(所以,不要做),但最终用户喜欢它。

Spreadsheets also make a great input tool, actually. Just be strict on the formatting. Kind of sucks for merging in SVN (so, Don't Do That), but end users love it.

你可能能够逃脱实际规则基于系统。我的系统在运行时不是动态的,并没有真正需要复杂的目标寻求和推断,所以我不需要这样的系统的开销。但是如果一个人为你开箱,那么快乐的一天。

You may well be able to get away with an actual rule based system. My system wasn't dynamic at runtime, and didn't really need sophisticated goal seeking and inference, so I didn't need the overhead of such a system. But if one works for you out of the box, then happy day.

哦,对于一个实现注释,对于那些不相信你可以打64K代码限制在Java方法,我可以保证,它可以做:)。

Oh, and for an implementation note, for those who don't believe you can hit the 64K code limit in a Java method, well I can assure you it can be done :).

这篇关于重构大数据对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆