计算UID可能性的大小 [英] Computing the size of UID possibilities

查看:82
本文介绍了计算UID可能性的大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每个DICOM规范都通过以下方式定义UID: 9.1 UID编码规则.换句话说,以下是有效的DICOM UID:

Per DICOM specification, a UID is defined by: 9.1 UID Encoding Rules. In other words the following are valid DICOM UIDs:

  • "1.2.3.4.5"
  • "1.3.6.1.4.35045.103501438824148998807202626810810788788999"
  • "1.2.826.0.1.3680043.2.1143.5028470438645158236649541857909059554"

以下是非法的DICOM UID:

while the following are illegal DICOM UIDs:

  • .1.2.3.4.5"
  • "1..2.3.4.5"
  • "1.2.3.4.5."
  • "1.2.3.4.05"
  • "12345"
  • "1.2.826.0.1.3680043.2.1143.50284704386451582366495418579090595540"

因此,我知道该字符串最多为64个字节,并且应与以下正则表达式[0-9\.]+相匹配.但是,此正则表达式实际上是一个超集,因为它的可能性远小于(10+1)^64 (=4457915684525902395869512133369841539490161434991526715513934826241L).

Therefore I know that the string is at most 64 bytes, and should match the following regex [0-9\.]+. However this regex is really a superset, since there are a lot less than (10+1)^64 (=4457915684525902395869512133369841539490161434991526715513934826241L) possibilities.

如何精确计算遵守DICOM UID规则的可能性?

How would one computes precisely the number of possibilities to respect the DICOM UID rules ?

读取org root/后缀规则明确表明我至少需要一个点('.').在这种情况下,组合至少为3个字节(字符),格式为[0-9].[0-9].在这种情况下,长度为3的UID有10x10=100个可能性.

Reading the org root / suffix rule clearly indicates that I need at least one dot ('.'). In which case the combination is at least 3 bytes (char) in the form: [0-9].[0-9]. In which case there are 10x10=100 possibilities for UID of length 3.

看第一个答案,似乎有些不清楚:

Looking at the first answer, there seems to be something unclear about:

每个部分的第一位数字不得为零,除非 组件是一个数字.

The first digit of each component shall not be zero unless the component is a single digit.

这是什么意思:

  • "0.0"有效
  • "00.0"或"1.01"无效

因此,我想说一个正确的表达是:

Thus I would say a proper expression would be:

(([1-9][0-9]*)|0)(\.([1-9][0-9]*|0))+

使用简单的C代码,我可以找到:

Using a simple C code, I could find:

  • f(0)= 0
  • f(1)= 0
  • f(2)= 0
  • f(3)= 100
  • f(4)= 1800
  • f(5)= 27100
  • f(6)= 369000
  • f(7)= 4753000
  • f(8)= 59049000

对根UID部分的验证不在此问题的范围内.第二个验证步骤可以避免拒绝某些无法注册的OID(例如,有人提到对第一弧和第二弧的限制).为简单起见,我们将接受所有可能的(有效)根UID.

The validation of the Root UID part is outside the scope of this question. A second validation step could take care of rejecting some OID that cannot possibly be registered (some people mention restriction on first and second arc for example). For simplicity we'll accept all possible (valid) Root UID.

推荐答案

虽然我的其他答案很好地照顾了这个特定的应用程序,但这是一种更通用的方法.它会处理您有不同的正则表达式描述所讨论语言的情况.它也允许相当长的字符串长度,因为它只需要O(log n )算术运算即可计算长度为 n 的字符串的组合数.在这种情况下,字符串的数量增长如此之快,以至于这些算术运算的成本将急剧增长,但是在其他情况下,情况可能并非如此.

While my other answer takes good care of this specific application, here is a more generic approach. It takes care of situations where you have a different regular expression describing the language in question. It also allows for considerably longer string lengths, since it only requires O(log n) arithmetic operations to compute the number of combinations for strings of length up to n. In this case the number of strings grows so quickly that the cost of these arithmetic operations will grow dramatically, but that may not be the case for other, otherwise similar situations.

从有关您的语言的正则表达式描述开始.将正则表达式转换为有限状态自动机.在您的情况下,正则表达式可以表示为

Start with a regular expression description of your language in question. Translate that regular expression into a finite state automaton. In your case the regular expression can be given as

(([1-9][0-9]*)|0)(\.([1-9][0-9]*|0))+

自动机可能看起来像这样:

The automaton could look like this:

此自动机通常包含ε转换(即不与任何输入字符相对应的状态转换).删除那些,使一个过渡对应于输入的一个字符.然后将ε过渡添加到接受状态.如果接受状态还有其他传出转换,请不要向其中添加ε循环,而是将ε转换添加到没有传出边缘的接受状态,然后将循环添加到该状态.这可以看作是在输入的末尾用ε填充输入,而中间不允许ε填充.综上所述,此转换可确保准确执行 n 个状态转换对应于处理 n 个字符或更少的输入.修改后的自动机可能看起来像这样:

This automaton usually contains ε-transitions (i.e. state transitions which do not correspond to any input character). Remove those, so that one transition corresponds to one character of input. Then add an ε-transition to the accepting state(s). If the accepting states have other outgoing transitions, don't add ε-loops to them, but instead add an ε-transition to an accepting state with no outgoing edges and then add the loop to that. This can be seen as padding the input with ε at its end, without allowing ε in the middle. Taken together, this transformation ensures that performing exactly n state transitions corresponds to processing an input of n characters or less. The modified automaton might look like this:

请注意,根据正则表达式构造第一个自动机消除ε过渡可以自动执行(也许甚至只需一步即可 .产生的自动机可能比我在这里构建的自动机更复杂手动,但是原理是相同的.

Note that both the construction of the first automaton from the regular expression and the elimination of ε-transitions can be performed automatically (and perhaps even in a single step. The resulting automata might be more complicated than what I constructed here manually, but the principle is the same.

您不必使自动机确定性,因为对于每种组合源状态和输入字符的组合只有一个目标状态.在我的手动构建工具中也不是这种情况.但是您必须确保每个完整的输入都只有一条可能的路径进入接受状态,因为实际上您将在计算路径. 使自动机具有确定性也会确保这种较弱的属性,因此除非您能确保没有这个的独特路径.在我的示例中,每个组件的长度明确规定了要使用的路径,因此我没有确定性.但是我在本文的结尾处提供了一个确定性方法的例子.

You don't have to make the automaton deterministic in the sense that for every combination of source state and input character there is only one target state. That's not the case in my manually constructed one either. But you have to make sure that every complete input has only one possible path to the accepting state, since you'll essentially be counting paths. Making the automaton deterministic would ensure this weaker property, too, so go for that unless you can ensure unique paths without this. In my example the length of each component clearly dictates which path to use, so I didn't make it deterministic. But I've included an example with a deterministic approach at the end of this post.

接下来,写下过渡矩阵.将行和列与您的状态相关联(在我的示例中,按 a,b,c,d,e,f 的顺序).对于自动机中的每个箭头,在与该箭头的源状态关联的列和与该箭头的目标状态关联的行中,写入该箭头的标签中包含的字符数.

Next, write down the transition matrix. Associate the rows and columns with your states (in order a, b, c, d, e, f in my example). For each arrow in your automaton, write the number of characters included in the label of that arrow in the column associated with the source state and the row associated with the target state of that arrow.

⎛ 0  0  0  0  0  0⎞
⎜ 9 10  0  0  0  0⎟
⎜10 10  0 10 10  0⎟
⎜ 0  0  1  0  0  0⎟
⎜ 0  0  0  9 10  0⎟
⎝ 0  0  0 10 10  1⎠

从该矩阵读取结果

现在将这个矩阵与列向量一起应用一次具有以下含义:如果在输入向量中编码了到达给定状态的可能方式的数目,则输出向量将为您提供稍后过渡的方式数目.取该矩阵的64次幂,集中在第一列上(因为ste start情况被编码为(1,0,0,0,0,0),这意味着只有一种以开始状态结束的方式)并进行汇总对应于接受状态的所有条目(在这种情况下,只有最后一个).该矩阵的64次幂的左下角元素是

Read result off that matrix

Now applying this matrix with a column vector once has the following meaning: if the number of possible ways to arrive in a given state is encoded in the input vector, the output vector gives you the number of ways one transition later. Take the 64th power of that matrix, concentrate on the first column (since ste start situation is encoded as (1,0,0,0,0,0), meaning only one way to end up in the start state) and sum up all the entries that correspond to accepting states (only the last one in this case). The bottom left element of the 64th power of this matrix is

1474472506836676237371358967075549167865631190000000000000000000000

证实了我的其他答案.

为了实际计算该矩阵的64次方,最简单的方法是重复平方:对矩阵平方6次后,您的指数为2 6 = 64.在您的指数(即最大字符串长度)不是2的幂的情况下,您仍然可以执行通过平方进行幂运算根据指数的位模式乘以相关的平方.这就是使这种方法采用O(log n )算术运算来计算字符串长度 n 的结果的方法,假定状态的数量固定,因此每个矩阵的成本固定平方.

In order to actually compute the 64th power of that matrix, the easiest approach would be repeated squaring: after squaring the matrix 6 times you have an exponent of 26 = 64. If in some other scenario your exponent (i.e. maximal string length) is not a power of two, you can still perform exponentiation by squaring by multiplying the relevant squares according to the bit pattern of the exponent. This is what makes this approach take O(log n) arithmetic operations to compute the result for string length n, assuming a fixed number of states and therefore fixed cost for each matrix squaring.

如果您要使用通常的动力装置构造使我的自动机具有确定性,那么您最终会得到

If you were to make my automaton deterministic using the usual powerset construction, you'd end up with

并将状态分类为 a bc c d cf cef f 一个将获得转换矩阵

and sorting the states as a, bc, c, d, cf, cef, f one would get the transition matrix

⎛ 0  0  0  0  0  0  0⎞
⎜ 9 10  0  0  0  0  0⎟
⎜ 1  0  0  0  0  0  0⎟
⎜ 0  1  1  0  1  1  0⎟
⎜ 0  0  0  1  0  0  0⎟
⎜ 0  0  0  9  0 10  0⎟
⎝ 0  0  0  0  1  1  1⎠

并可以将其第64次幂的第一列的最后三个元素相加,以获得与上述相同的结果.

and could sum the last three elements of the first column of its 64th power to obtain the same result as above.

这篇关于计算UID可能性的大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆