在文本文件中查找与正则表达式匹配的行 [英] Finding lines in a text file matching a regular expression

查看:94
本文介绍了在文本文件中查找与正则表达式匹配的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谁能解释一下我如何在 Ruby 中使用正则表达式来只返回字符串的匹配项.

例如,如果代码读取包含一系列名称的 .txt 文件:

约翰·史密斯詹姆斯·琼斯大卫·布朗汤姆戴维森等等等等

..并且要匹配的单词被输入为ohn",然后它只会返回John Smith",但不会返回任何其他名称.

解决方案

注意:在现代 Ruby 中,不要使用 File.each_line,而是使用 IO.foreach.例如:

[1] pry(main)>IO.foreach('./.bashrc') 做 |l|[1] pry(main)* puts l[1] 撬(主)* 结束导出路径=~/bin:$PATH导出编辑器='vi'导出视觉=$EDITOR

进展发生,事情发生变化.

<小时>

以下是到达目的地的一些不同方式.

首先请注意,我使用了一种更惯用的方式来编写从文件中读取行的代码.Ruby 的 IO 和 File 库使打开、读取和关闭文件在一个漂亮整洁的包中变得非常容易.

File.each_line('file.txt') 做 |li|把 li if (li['ohn'])结尾

这会在行中的任何位置查找 'ohn',但不打扰正则表达式.

File.each_line('file.txt') 做 |li|把 li if (li[/ohn/])结尾

寻找相同的字符串,只是它使用正则表达式来到达那里.功能上和第一个例子一样.

File.each_line('file.txt') 做 |li|把 li if (li[/ohn\b/])结尾

这是查找以ohn"结尾的名称的更聪明的方法.它使用正则表达式,但也指定模式必须出现在单词的末尾.\b 表示词边界".

此外,在读取文件时,务必提前考虑正在读取的文件是否会超过您的应用可用的 RAM.很容易一次性将整个文件读入内存,然后从 RAM 中处理它,但如果超出可用的物理 RAM,您可能会瘫痪或终止您的应用或机器.

<小时><块引用>

您是否知道其他答案显示的代码实际上是将整个文件加载到 RAM 中,还是通过从 readlines 函数流式传输到 select 函数进行了某种优化?

来自 IO#readlines 文档:><块引用>

读取由 name 指定的整个文件作为单独的行,并在数组中返回这些行.行由 sep 分隔.

另外一个考虑因素是大批量读取期间的内存分配.即使您有足够的 RAM,您也可能会遇到这样的情况:语言在读取数据时窒息,发现它没有为变量分配足够的内存,并且在获取更多内存时不得不暂停.这个循环一直重复,直到整个文件被加载.

多年前,当我将一个非常大的数据文件加载到我管理的 HP 最大的 mini 上的 Perl 应用程序时,我开始对此很敏感.该应用程序会定期暂停几秒钟,我不知道为什么.我进入调试器并找不到问题.最后,通过使用老式打印语句跟踪运行,我将暂停隔离到文件slurp".我有足够的内存和足够的处理能力,但是 Perl 没有分配足够的内存.我切换到逐行阅读,应用程序快速完成了它的处理.Ruby 与 Perl 一样,具有良好的 I/O,并且可以在逐行读取时非常快速地读取大文件.我从来没有找到一个很好的理由来吞咽文本文件,除非有可能将我想要的内容分布在多行中,但这种情况并不常见.

Can anyone explain how I could use regular expressions in Ruby to only return the matches of a string.

For example, if the code reads in a .txt file with a series of names in it:

John Smith
James Jones
David Brown
Tom Davidson
etc etc

..and the word to match is typed in as being 'ohn', it would then just return 'John Smith', but none of the other names.

解决方案

Note: Instead of using File.each_line, use IO.foreach in modern Rubies instead. For instance:

[1] pry(main)> IO.foreach('./.bashrc') do |l|
[1] pry(main)*   puts l
[1] pry(main)* end
export PATH=~/bin:$PATH
export EDITOR='vi'
export VISUAL=$EDITOR

Progress happens and things change.


Here are some different ways to get where you're going.

Notice first I'm using a more idiomatic way of writing the code for reading lines from a file. Ruby's IO and File libraries make it very easy to open, read and close the file in a nice neat package.

File.each_line('file.txt') do |li|
  puts li if (li['ohn'])
end

That looks for 'ohn' anywhere in the line, but doesn't bother with a regular expression.

File.each_line('file.txt') do |li|
  puts li if (li[/ohn/])
end

That looks for the same string, only it uses a regex to get there. Functionally it's the same as the first example.

File.each_line('file.txt') do |li|
  puts li if (li[/ohn\b/])
end

This is a bit smarter way of looking for names that end with 'ohn'. It uses regex but also specifies that the pattern has to occur at the end of a word. \b means "word-boundary".

Also, when reading files, it's important to always think ahead about whether the file being read could ever exceed the RAM available to your app. It's easy to read an entire file into memory in one pass, then process it from RAM, but you can cripple or kill your app or machine if you exceed the physical RAM available to you.


Do you know if the code shown by the other answers is in fact loading the entire file into RAM or is somehow optimized by streaming from the readlines function to the select function?

From the IO#readlines documentation:

Reads the entire file specified by name as individual lines, and returns those lines in an array. Lines are separated by sep.

An additional consideration is memory allocation during a large, bulk read. Even if you have sufficient RAM, you can run into situations where a language chokes as it reads in the data, finds it hasn't allocated enough memory to the variable, and has to pause as it grabs more. That cycle repeats until the entire file is loaded.

I became sensitive to this many years ago when I was loading a very big data file into a Perl app on HP's biggest mini, that I managed. The app would pause for a couple seconds periodically and I couldn't figure out why. I dropped into the debugger and couldn't find the problem. Finally, by tracing the run using old-school print statements I isolated the pauses to a file "slurp". I had plenty of RAM, and plenty of processing power, but Perl wasn't allocating enough memory. I switched to reading line by line and the app flew through its processing. Ruby, like Perl, has good I/O, and can read a big file very quickly when it's reading line-by-line. I have never found a good reason for slurping a text file, except when it's possible to have content I want spread across several lines, but that is not a common occurrence.

这篇关于在文本文件中查找与正则表达式匹配的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆