如何在巴什(Linux版)或Python筛选文件只可打印字符? [英] How to filter only printable characters in a file on Bash (linux) or Python?

查看:312
本文介绍了如何在巴什(Linux版)或Python筛选文件只可打印字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使包括非打印字符到仅仅只包括可打印字符的文件。我觉得这个问题是关系到 ACSCII控制动作,但我无法找到一个解决方案做到这一点,也弄不明白的含义。[16D (ASCII控制动作角色?)在下列文件。

I want to make a file including non-printable characters to just only include printable characters. I think this problem is related to ACSCII control action, but I could not find a solution to do that and also could not understand meaning of .[16D (ASCII control action character??) in the following file.

输入文件的进制打印:

00000000: 4845 4c4c 4f20 5448 4953 2049 5320 5448 HELLO THIS IS TH
00000010: 4520 5445 5354 1b5b 3136 4420 2020 2020 E TEST.[16D
00000020: 2020 2020 2020 2020 2020 201b 5b31 3644            .[16D
00000030: 2020

当我 ED上庆典文件,我刚:HELLO。我想这是因为默认的间$ P $点的ASCII控制动作,两个 [16D 秒。

When I cated that file on bash, I just got: "HELLO ". I think this is because default cat interprets that ASCII control action, two .[16Ds.

为什么两个 [16D 字符串使文件只是打印HELLO?和..我如何才能让这个文件只包括可打印字符,也就是说,HELLO?

Why are two .[16D strings make cat FILE just to print "HELLO"?, and... how can I make that file just to include printable characters, i.e., "HELLO "?

推荐答案

该hexdump都表明,在点。[16D 实际上是一个转义字符, \\ X1B 。结果
Esc键[ <大骨节病> N D 是的 ANSI转义code 删除 N 字符。因此, Esc键[16D 告诉终端删除16个字符,这也解释了输出。

The hexdump shows that the dot in .[16D is actually an escape character, \x1b.
Esc[nD is an ANSI escape code to delete n characters. So Esc[16D tells the terminal to delete 16 characters, which explains the cat output.

有多种方法可以从文件中删除ANSI转义codeS,无论是使用bash命令(例如,使用 SED ,如Anubhava的答案)或Python。

There are various ways to remove ANSI escape codes from a file, either using Bash commands (eg using sed, as in Anubhava's answer) or Python.

然而,在这样的情况下,它可能是更好的通过终端仿真器间preT运行该文件文件中的任何现有的编辑控制序列,所以你得到的结果文件的作者应用了这些后,他们打算编辑序列。

However, in cases like this, it may be better to run the file through a terminal emulator to interpret any existing editing control sequences in the file, so you get the result the file's author intended after they applied those editing sequences.

要做到这一点在Python的一种方法是使用 pyte ,一个Python模块,实现了一个简单VTXXX兼容的终端仿真程序。您可以使用易于安装 PIP ,这里是对的 readthedocs 的。

One way to do that in Python is to use pyte, a Python module that implements a simple VTXXX compatible terminal emulator. You can easily install it using pip, and here are its docs on readthedocs.

下面是间$ P $点在问题中给出的数据的简单演示程序。它为Python 2编写的,但它很容易适应的Python 3。 pyte 是单向code感知,它的标准流类预期的Uni code字符串,但本示例使用字节流,这样我就可以传递一个简单的字节串。

Here's a simple demo program that interprets the data given in the question. It's written for Python 2, but it's easy to adapt to Python 3. pyte is Unicode-aware, and its standard Stream class expects Unicode strings, but this example uses a ByteStream, so I can pass it a plain byte string.

#!/usr/bin/env python

''' pyte VTxxx terminal emulator demo

    Interpret a byte string containing text and ANSI / VTxxx control sequences

    Code adapted from the demo script in the pyte tutorial at
    http://pyte.readthedocs.org/en/latest/tutorial.html#tutorial

    Posted to http://stackoverflow.com/a/30571342/4014959 

    Written by PM 2Ring 2015.06.02
'''

import pyte


#hex dump of data
#00000000  48 45 4c 4c 4f 20 54 48  49 53 20 49 53 20 54 48  |HELLO THIS IS TH|
#00000010  45 20 54 45 53 54 1b 5b  31 36 44 20 20 20 20 20  |E TEST.[16D     |
#00000020  20 20 20 20 20 20 20 20  20 20 20 1b 5b 31 36 44  |           .[16D|
#00000030  20 20                                             |  |

data = 'HELLO THIS IS THE TEST\x1b[16D                \x1b[16D  '

#Create a default sized screen that tracks changed lines
screen = pyte.DiffScreen(80, 24)
screen.dirty.clear()
stream = pyte.ByteStream()
stream.attach(screen)
stream.feed(data)

#Get index of last line containing text
last = max(screen.dirty)

#Gather lines, stripping trailing whitespace
lines = [screen.display[i].rstrip() for i in range(last + 1)]

print '\n'.join(lines)

输出

HELLO

输出的十六进制转储

00000000  48 45 4c 4c 4f 0a                                 |HELLO.|

这篇关于如何在巴什(Linux版)或Python筛选文件只可打印字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆