Bash:从HTTP响应中删除标题 [英] Bash: Remove headers from HTTP response
问题描述
如果我有一些包含HTTP标头和正文的文本,例如:
HTTP / 1.1 200 OK
Cache -Control:public,max-age = 38
Content-Type:text / html; charset = utf-8
过期时间:2013年11月22日星期五06:15:01 GMT
上次修改时间:2013年11月22日星期五,星期五06:14:01 GMT
变化:*
X-Frame-Options:SAMEORIGIN
日期:2013年11月22日星期五06:14:22 GMT
<!DOCTYPE html>
< html>
< head>
< title>我的网站< / title>
< / head>
< body>
Hello world!
< / body>
< / html>
这个文本是从一个命令传入的,我怎样才能删除这个头文件只留下body?
(在标题中, \r\\\
用作换行符。
\r\\\
标记标题的结尾和主体的开始。)
\r\\\
< (
...
表示任何命令,例如 cat
或 curl
它会输出一些HTTP头文件和body到stdout):$ b $ h2 sed
我的第一个想法是用 sed
替换,在第一次出现 \r\\\
:
\r\之前删除所有内容n
... | sed's | ^。*?\r\\\
\r\\\
||'
<但这不起作用,主要是因为 sed
只能在个别行上运行,所以它不能在 \r
或 \\\
。 (另外,它不支持
?
非贪婪操作符。)
grep
我还考虑过对 \r\\\
:
\r使用 grep
\\\
... | grep -oP'(?<= \ r \ n \r\\\
)。*'
但是这也行不通(主要是因为 grep
只能在个别行上运行)。
pcregrep
有多行模式( -M
),但 pcregrep
常常不可用(它在Ubuntu 12.04,Mac OS X 10.7等中默认没有安装),我想要一个不需要任何非标准工具的解决方案。
perl
然后我想到用 perl
进行替换,使用<$ c $
... | perl -pe's /^.*?\r\\\
\r\\\
// s'
我认为这更接近工作解决方案。不过,我认为默认情况下,Perl的输入记录分隔符( $ /
)是 \\\
,需要更改至
\r\\\
,这样
。
可以匹配 \r\ ñ
。 -0
选项可用于将 $ /
设置为单个字符,但不能设置多个字符。我试过这个,但我不认为它是正确的:
... | perl -pe'$ / =\r\\\
; s /^.*?\r\\\
\r\\\
// s'
另外,我认为 ^
匹配start of line,但需要匹配start of file。
< h2>偏移和子字符串
我有一个想法获得 \r\\\
\r\\\
$的偏移量c $ c> using:
BodyOffset = $(expr index$ MyHttpText\r\\\
\r \\\
)
然后使用以下内容将主体提取为子字符串:
HttpBody = $ {MyHttpText:BodyOffset}
不幸的是, expr
的Mac OS X版本不支持 index
。另外,如果可能的话,我想要一个不需要创建变量的解决方案。
参数替换
另一个想法是使用参数替换,其中#
表示从 $ MyHttpText
中删除最短属于 * \r\\\
的部分,它与
\r\\\
$ MyHttpText
的前端匹配 :
HttpBody = $ {MyHttpText#* \r\\\
\r\\\
}
但我不确定如何在管道命令中使用它,并且我更喜欢一个解决方案不需要变量。
sed 可以做到这一点:
sed '1,/ ^ $ / d'data.txt
该命令删除从第1行开始的所有内容,并在第一次出现emp时结束ty行( ^ $
)。如果你有 \\\
作为换行符,这是有效的。如果您有
\r\\\
作为换行符,您可以使用
dos2unix
和 unix2dos
将它们来回转换,或者您可以将 \r
字符添加到 sed 正则表达式:
sed'1,/ ^ \ $ $ / d'data.txt
但是,最后一行只有在将 \r\\\
作为换行符时才能使用,以使其适用于这两种换行符,您可以使用:
sed'1,/ ^ \r\ {0,1 \} $ / d'data.txt
在这里,我们正在寻找一个空行或者0或1 \r
字符。
If I have some text containing HTTP headers and body, eg:
HTTP/1.1 200 OK
Cache-Control: public, max-age=38
Content-Type: text/html; charset=utf-8
Expires: Fri, 22 Nov 2013 06:15:01 GMT
Last-Modified: Fri, 22 Nov 2013 06:14:01 GMT
Vary: *
X-Frame-Options: SAMEORIGIN
Date: Fri, 22 Nov 2013 06:14:22 GMT
<!DOCTYPE html>
<html>
<head>
<title>My website</title>
</head>
<body>
Hello world!
</body>
</html>
and this text is being piped in from a command, how can I remove the headers to leave just the body?
(Within the headers, \r\n
is used as the line break. \r\n\r\n
marks the end of the headers and the start of the body.)
Here's what I've tried (...
indicates any command such as cat
or curl
which will output some HTTP headers and body to stdout):
sed
My first idea was to do substitution with sed
, to remove everything before the first occurrence of \r\n\r\n
:
... | sed 's|^.*?\r\n\r\n||'
But this doesn't work, mainly because sed
only operates on individual lines, so it can't operate on \r
or \n
. (In addition, it doesn't support the ?
non-greedy operator.)
grep
I also thought of using grep
with a positive lookbehind for \r\n\r\n
:
... | grep -oP '(?<=\r\n\r\n).*'
But this doesn't work either (mainly because grep
only operates on individual lines).
pcregrep
has a multiline mode (-M
), but pcregrep
is often not available (it's not installed by default in Ubuntu 12.04, Mac OS X 10.7, etc), and I'd like a solution which doesn't require any non-standard tools.
perl
I then thought of doing substitution with perl
, using the /s
modifier so that .
matches line breaks:
... | perl -pe 's/^.*?\r\n\r\n//s'
I think this is closer to a working solution. However, I think Perl's Input Record Separator ($/
) is \n
by default, and needs to be changed to \r\n
, so that .
can match \r\n
. The -0
option can be used to set $/
to a single character, but not multiple characters. I've tried this, but I don't think it's correct:
... | perl -pe '$/ = "\r\n"; s/^.*?\r\n\r\n//s'
Also, I think ^
is matching "start of line", but needs to match "start of file".
Offset and substring
I had an idea of getting the offset of \r\n\r\n
using:
BodyOffset=$(expr index "$MyHttpText" "\r\n\r\n")
and then extracting the body as a substring using:
HttpBody=${MyHttpText:BodyOffset}
Unfortunately, the Mac OS X version of expr
doesn't support index
. Also, if possible, I'd like a solution which doesn't require the creation of variables.
Parameter substitution
One other idea I had was to use parameter substitution, where #
means "Remove from $MyHttpText
the shortest part of *\r\n\r\n
that matches the front end of $MyHttpText
":
HttpBody=${MyHttpText#*\r\n\r\n}
But I'm not sure how to use this in a piped sequence of commands, and again I'd prefer a solution which doesn't require variables.
sed can do this:
sed '1,/^$/d' data.txt
This command deletes everything starting from line 1, and ending at the first occurrence of an empty line (^$
). This works if you have \n
as a newline character. If you have \r\n
as a newline character, you can use dos2unix
and unix2dos
to convert them back and forth or you can add the \r
character to the sed regex:
sed '1,/^\r$/d' data.txt
However, the last line will only work if you have \r\n
as a newline character, to make it work on both types of newlines, you can use:
sed '1,/^\r\{0,1\}$/d' data.txt
Here we are looking for an empty line with either 0 or 1 \r
characters.
这篇关于Bash:从HTTP响应中删除标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!