日语COBOL代码:G文字和标识符的规则? [英] Japanese COBOL Code: rules for G literals and identifiers?

查看:178
本文介绍了日语COBOL代码:G文字和标识符的规则?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在处理IBMEnterprise日语COBOL源代码.

We are processing IBMEnterprise Japanese COBOL source code.

确切描述G类型文字中允许的内容的规则, 以及标识符允许使用的内容还不清楚.

The rules that describe exactly what is allowed in G type literals, and what are allowed for identifiers are unclear.

IBM手册指示G'....'文字 引号内的第一个字符必须为SHIFT-OUT, SHIFT-IN作为右引号之前的最后一个字符. 我们的COBOL词法分析器知道"这一点,但是反对G文字 在真实代码中找到.结论:IBM手册是错误的, 否则我们会误读它.客户不会让我们看到代码, 因此很难诊断出问题.

The IBM manual indicates that a G'....' literal must have a SHIFT-OUT as the first character inside the quotes, and a SHIFT-IN as the last character before the closing quote. Our COBOL lexer "knows" this, but objects to G literals found in real code. Conclusion: the IBM manual is wrong, or we are misreading it. The customer won't let us see the code, so it is pretty difficult to diagnose the problem.

为清晰起见,在文本下方进行了修订/扩展:

有人知道G字面量的确切规则吗, 以及它们(不)与IBM参考手册中所说的相符吗? 理想的答案应该是G文字的正则表达式. 这就是我们现在正在使用的(叹息由另一位作者编码):

Does anyone know the exact rules of G literal formation, and how they (don't) match what the IBM reference manuals say? The ideal answer would a be regular expression for the G literal. This is what we are using now (coded by another author, sigh):

#token non_numeric_literal_quote_g [STRING]
  "<G><squote><ShiftOut> (  
     (<NotLineOrParagraphSeparatorNorShiftInNorShiftOut>|<squote><squote>|<ShiftOut>)  
     (<NotLineOrParagraphSeparator>|<squote><squote>)

     | <ShiftIn> ( <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut>|
                   <ShiftIn>|<ShiftOut>)

     | <squote><squote>

 )* <ShiftIn><squote>"

其中< name>是一个宏,它是另一个正则表达式.大概他们 命名足够好,所以您可以猜测它们包含的内容.

where <name> is a macro that is another regular expression. Presumably they are named well enough so you can guess what they contain.

这是一个或多个字符在每个字节的X'00 ... X'FF范围内"时 除了的8位字符代码之外,DBCS字符怎么能不可以? 如果您进行检查,则现有RE会匹配3种类型的字符对.

Here is the IBM Enterprise COBOL Reference. Chapter 3 "Character Strings", subheading "DBCS literals" page 32 is relevant reading. I'm hoping that by providing the exact reference, an experienced IBMer can tell us how we misread it :-{ I'm particularly unclear on what the phrase "DBCS-characters" means when it says "one or more characters in the range X'00...X'FF for either byte" How can DBCS-characters be anything but pairs of 8-bit character codes? The existing RE matches 3 types of pairs of characters if you examine it.

以下一个答案表明< squote>< squote>配对是错误的. 好的,我可能会相信,但这意味着RE只会拒绝 包含单个< squote>的文字字符串.我不相信那是 我们似乎在遍历G文字的每个实例时遇到的问题.

One answer below suggests that the <squote><squote> pairing is wrong. OK, I might believe that, but that means the RE would only reject literal strings containing single <squote>s. I don't believe that's the problem we are having as we seem to trip over every instance of a G literal.

类似地,可以明显地组成COBOL标识符 DBCS字符.标识符到底允许什么? 再次使用正则表达式将是理想的选择.

Similarly, COBOL identifiers can apparantly be composed with DBCS characters. What is allowed for an identifier, exactly? Again a regular expression would be ideal.

我开始认为问题可能不是RE. 我们正在阅读Shift-JIS编码的文本.我们的读者将其转换为 文本转换为Unicode.但是DBCS字符确实是 不是Shift-JIS;相反,它们是二进制编码的数据.可能的 发生的是正在转换DBCS数据 好像是Shift-JIS,这会破坏该功能 将两个字节"识别为DBCS元素.例如, 如果DBCS字符对为:81:1F,则为ShiftJIS阅读器 会将这对转换为单个Unicode字符, 然后便失去了它的两字节性质.如果您无法计算对数, 您找不到结尾报价.如果找不到结尾引号, 您无法识别文字.所以问题就会出现 是我们需要在中间切换输入编码模式 词法化过程. Yu.

I'm beginning to think the problem might not be the RE. We are reading Shift-JIS encoded text. Our reader converts that text to Unicode as it goes. But DBCS characters are really not Shift-JIS; rather, they are binary-coded data. Likely what is happening is the that DBCS data is getting translated as if it were Shift-JIS, and that would muck up the ability to recognize "two bytes" as a DBCS element. For instance, if a DBCS character pair were :81 :1F, a ShiftJIS reader would convert this pair into a single Unicode character, and its two-byte nature is then lost. If you can't count pairs, you can't find the end quote. If you can't find the end quote, you can't recognize the literal. So the problem would appear to be that we need to switch input-encoding modes in the middle of the lexing process. Yuk.

推荐答案

尝试通过在规则中添加单引号来查看更改是否通过

Try to add a single quote in your rule to see if it passes by making this change,

  <squote><squote> => <squote>{1,2}

如果我没记错的话,N和G文字之间的区别是G允许单引号.您的正则表达式不允许这样做.

If I remember it correctly, one difference between N and G literals is that G allows single quote. Your regular expression doesn't allow that.

我以为您使所有其他DBCS文字正常工作,并且G字符串出现问题,所以我只是指出了N和G之间的区别.现在,我仔细研究了您的RE.它有问题.在我使用的Cobol中,您可以将ASCII与日语混合使用,例如,

I thought you got all other DBCS literals working and just having issues with G-string so I just pointed out the difference between N and G. Now I took a closer look at your RE. It has problems. In the Cobol I used, you can mix ASCII with Japanese, for example,

  G"ABC<ヲァィ>" <> are Shift-out/shift-in

您RE仅假定使用DBCS.我会放宽此限制,然后重试.

You RE assumes the DBCS only. I would loose this restriction and try again.

我认为不可能完全以正则表达式处理G文字.单独使用有限状态机无法跟踪匹配的报价和SO/SI.您的RE非常复杂,因为它试图做不可能的事情.我会简化它,并手动处理不匹配的令牌.

I don't think it's possible to handle G literals entirely in regular expression. There is no way to keep track of matching quotes and SO/SI with a finite state machine alone. Your RE is so complicated because it's trying to do the impossible. I would just simplify it and take care of mismatching tokens manually.

您还可能会遇到编码问题.该代码可能位于EBCDIC(片假名)或UTF-16中,将其视为ASCII无效.在Windows上,有时会将SO/SI转换为0x1E/0x1F.

You could also face encoding issues. The code could be in EBCDIC (Katakana) or UTF-16, treating it as ASCII will not work. SO/SI sometimes are converted to 0x1E/0x1F on Windows.

我只是想帮助您在黑暗中进行拍摄而看不到实际的代码:)

I am just trying to help you shoot in the dark without seeing the actual code :)

这篇关于日语COBOL代码:G文字和标识符的规则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆