解码算法需要 [英] decoding algorithm wanted
问题描述
- PDF可以在Acrobat Reader中正确显示
- 选择所有并通过Acrobat Reader复制测试
- 并粘贴到文本编辑器中
- 将显示内容已编码
所以,例子是:
13579 - > 3579;
hello - > jgnnq
它基本上是一个ASCII字符的偏移(也许交换)。
问题是当我只访问几个样本时,如何自动找到偏移量。我无法确定编码偏移是否改变。我所知道的一些文字通常(如果不是总是)出现,例如
谢谢!
编辑:感谢您的反馈。我会尝试将问题分解成较小的问题:
第1部分:如何检测字符串内的相同部分?
p>你需要强制它。
如果这些模式是简单的,像你的例子中的+2个字符代码(这是+2个char代码)
hij
efg
lmn
lmn
opq
1 2 3
3 4 5
5 6 7
7 8 9
9:
您可以轻松实现这一点,以检查已知字词
>>> text ='jgnnq'
>>> knowns = ['hello','13579']
>>>
>>>对于我在范围(-5,+ 5):#check -5到+5 char代码范围
... rot =''。连接(chr(ord(j)+ i)for j)
... for x in knowns:
... if x in rot:
... print rot
...
hello
I receive encoded PDF files regularly. The encoding works like this:
- the PDFs can be displayed correctly in Acrobat Reader
- select all and copy the test via Acrobat Reader
- and paste in a text editor
- will show that the content are encoded
so, examples are:
13579 -> 3579;
hello -> jgnnq
it's basically an offset (maybe swap) of ASCII characters.
The question is how can I find the offset automatically when I have access to only a few samples. I cannot be sure whether the encoding offset is changed. All I know is some text will usually (if not always) show up, e.g. "Name:", "Summary:", "Total:", inside the PDF.
Thank you!
edit: thanks for the feedback. I'd try to break the question into smaller questions:
Part 1: How to detect identical part(s) inside string?
You need to brute-force it.
If those patterns are simple like +2 character code like in your examples (which is +2 char codes)
h i j
e f g
l m n
l m n
o p q
1 2 3
3 4 5
5 6 7
7 8 9
9 : ;
You could easily implement like this to check against knowns words
>>> text='jgnnq'
>>> knowns=['hello', '13579']
>>>
>>> for i in range(-5,+5): #check -5 to +5 char code range
... rot=''.join(chr(ord(j)+i) for j in text)
... for x in knowns:
... if x in rot:
... print rot
...
hello
这篇关于解码算法需要的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!