在将变量写入文件之前处理变量中的文本 [英] Processing text inside variable before writing it into file

查看:24
本文介绍了在将变量写入文件之前处理变量中的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Perl WWW::Mechanize 包来从一些网站获取和处理数据.通常我的操作方式如下:

I'm using Perl WWW::Mechanize package in order to fetch and process data from some websites. Usually my way of action is as follows:

  1. 获取网页

  1. Fetch a webpage

$mech->get("$url");

将网页内容保存在一个变量中(顺便说一句,我不确定这是否是将这么多文本保存在标量中的正确方法,据我所知,该标量应该用于单个值)

Save the webpage contents in a variable (BTW, I'm not sure if it's the right way to save this amount of text inside a scalar which, as far as I know, supposed to be used for a single value)

my $list = $mech->content();

使用我创建的子例程将变量的内容写入文本文件.(writetoFile 子例程包含更多功能,例如路径和现有文件验证..)

Use a subroutine that I've created to write the contents of the variable to a text file. (The writetoFile subroutine includes few more features, like path and existing file validations..)

writeToFile("$filename.tmp","$path",$list);

通过创建附加文件来处理上一步创建的文件中的文本,并将处理后的内容保存在那里(然后删除初始临时文件).

Processing the text in a file created in the previous step by creating an additional file and save the processed content there (Then deleting the initial temporary file).

我想知道的是,是否可以在将文本存储在文件中之前执行处理,直接在 $list 变量中?整个过程按预期工作,但我不太喜欢它背后的逻辑,而且效率似乎也有点低,因为我必须多次重写同一个文件.

What I wonder about, is whether it is possible to perform the processing before storing the text in a file, directly inside the $list variable? The whole process is working as expected but I don't really like the logic behind it and it seems a bit inefficient as well, since I have to rewrite the same file multiple times.

只是为了提供更多关于我在处理变量内容时实际追求的信息.所以在这种情况下我从网站获取的数据实际上是一个由空行分隔的项目列表,第一行与我无关.所以我在处理这些数据时所做的是两件事:

Just to give a bit more information about what I'm actually after when I process the variable contents. So the data I fetch from the website in this case is actually a list of items separated by a blank line and the first line is irrelevant to me. So what I'm doing while processing this data is 2 things:

  1. 删除空 (CRLF) 行
  2. 删除包含特定文本的第一行.

理想情况下,我想将处理过的列表(没有空格和第一行删除)保存在一个文件中,而无需在此过程中创建任何其他文件.为了保存文件,我想使用 writeToFile 子(我写的),因为它还对此类文件是否已经存在执行验证(如果文件将在最终处理之前保存 - writeToFile 将始终重写现有文件).

Ideally I want to save the processed list (no blank spaces and first line removed) in a file without creating any additional files on the way. In order to save the file I would like to use the writeToFile sub (I wrote) since it also performs validation on whether such file already exists (If a file will be saved before final processing - the writeToFile will always rewrite the existing file).

希望它是有道理的.

推荐答案

您正在寻找 split.模式取决于:使用 (?<=\n) 在新行字符处拆分并保留它.如果这无关紧要,请使用 \R 来包含所有类型的换行符.

You're looking for split. The pattern depends: use (?<=\n) split at a new line character and keep it. If that doesn't matter, use \R to include all sort of line breaks.

foreach my $line (split qr/\R/, $mech->content) {
    …
}

现在强制性的 HTML-parsing-with-regex 警告:如果你使用 Mechanize 获得 HTML 源代码,逐行解析它没有多大意义.您可能想要处理 HTML 剥离的 text版本的文档,或者将 HTML 源代码传递给解析器,例如 Web::Query以声明方式获取您需要的部分.

Now the obligatory HTML-parsing-with-regex admonishment: if you get HTML source with Mechanize, parsing it line-by-line does not make much sense. You probably want to process the HTML-stripped text version of the document instead, or pass the HTML source to a parser such as Web::Query to declaratively get at the pieces you need.

这篇关于在将变量写入文件之前处理变量中的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆