如何对正则表达式中的某些单词进行例外处理 [英] How to make exceptions for certain words in regex

查看:36
本文介绍了如何对正则表达式中的某些单词进行例外处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对编程和正则表达式非常陌生,所以如果之前有人问过这个问题,我深表歉意(不过我没有找到).

I'm very new in programming and regex so apologise if this's been asked before (I didn't find one, though).

我想用 Python 来总结文字文本中的词频.假设文本的格式为

I want to use Python to summarise word frequencies in a literal text. Let's assume the text is formatted like

Chapter 1
blah blah blah

Chapter 2
blah blah blah
....

现在我将文本作为字符串读取,我想使用 re.findall 来获取文本中的每个单词,所以我的代码是

Now I read the text as a string, and I want to use re.findall to get every word in this text, so my code is

wordlist = re.findall(r'\b\w+\b', text)

但问题是它与每个章节标题中的所有这些 Chapter 相匹配,我不想将其包含在我的统计数据中.所以我想忽略与 Chapter\s*\d+ 匹配的内容.我该怎么办?

But the problem is that it matches all these Chapters in each chapter title, which I don't want to include in my stats. So I want to ignore what matches Chapter\s*\d+. What should I do?

提前致谢,伙计们.

推荐答案

解决方案

您可以先删除所有Chapter+space+digits:

wordlist = re.findall(r'\b\w+\b', re.sub(r'Chapter\s*\d+\s*','',text))

如果您只想使用一次搜索,您可以使用否定前瞻来查找前面没有第 X 章"且不以数字开头的任何单词:

If you want to use just one search , you can use a negative lookahead to find any word that isn't preceded by "Chapter X" and does not begin with a digit :

wordlist = re.findall(r'\b(?!Chapter\s+\d+)[A-Za-z]\w*\b',text)

如果性能是一个问题,加载一个巨大的字符串并用正则表达式解析它无论如何都不是正确的方法.只需逐行读取文件,扔掉任何匹配 r'^Chapter\s*\d+' 的行,并用 r'\b\w+\b'<分别解析剩余的每一行/代码>:

If performance is an issue, loading a huge string and parsing it with a Regex wouldn't be the correct method anyway. Just read the file line by line, toss any line that matches r'^Chapter\s*\d+' and parse each remaining line separately with r'\b\w+\b' :

import re

lines=open("huge_file.txt", "r").readlines()

wordlist = []
chapter = re.compile(r'^Chapter\s*\d+')
words = re.compile(r'\b\w+\b')
for line in lines:
  if not chapter.match(line):
    wordlist.extend(words.findall(line))

print len(wordlist)

性能

我写了一个小的 ruby​​ 脚本来写一个大文件:

Performance

I wrote a small ruby script to write a huge file :

all_dicts = Dir["/usr/share/dict/*"].map{|dict|
  File.readlines(dict)
}.flatten

File.open('huge_file.txt','w+') do |txt|
  newline=true
  txt.puts "Chapter #{rand(1000)}"
  50_000_000.times do
    if rand<0.05
      txt.puts
      txt.puts
      txt.puts "Chapter #{rand(1000)}"
      newline = true
    end
    txt.write " " unless newline
    newline = false
    txt.write all_dicts.sample.chomp
    if rand<0.10
      txt.puts
      newline = true
    end
  end
end

生成的文件有超过 5000 万字,大约 483MB 大:

The resulting file has more than 50 million words and is about 483MB big :

Chapter 154
schoolyard trashcan's holly's continuations

Chapter 814
assure sect's Trippe's bisexuality inexperience
Dumbledore's cafeteria's rubdown hamlet Xi'an guillotine tract concave afflicts amenity hurriedly whistled
Carranza
loudest cloudburst's

Chapter 142
spender's
vests
Ladoga

Chapter 896
petition's Vijayawada Lila faucets
addendum Monticello swiftness's plunder's outrage Lenny tractor figure astrakhan etiology's
coffeehouse erroneously Max platinum's catbird succumbed nonetheless Nissan Yankees solicitor turmeric's regenerate foulness firefight
spyglass
disembarkation athletics drumsticks Dewey's clematises tightness tepid kaleidoscope Sadducee Cheerios's

两步过程平均需要 12.2 秒来提​​取词表,前瞻方法需要 13.5 秒,Wiktor 的回答也需要 13.5 秒.我第一次写的lookahead方法使用了re.IGNORECASE,耗时18s左右.

The two-step process took 12.2s to extract the wordlist on average, the lookahead method took 13.5s and Wiktor's answer also took 13.5s. The lookahead method I first wrote used re.IGNORECASE, and it took around 18s.

在读取整个文件时,所有 Regexen 方法之间的性能基本上没有差异.

There's basically no difference in performance between all the Regexen methods when reading the whole file.

令我感到惊讶的是,readlines 脚本花费了大约 20.5 秒,并且使用的内存并不比其他脚本少多少.如果您对如何改进脚本有任何想法,请发表评论!

What surprised me though is that the readlines script took around 20.5s, and didn't use much less memory than the other scripts. If you have any idea how to improve the script, please comment!

这篇关于如何对正则表达式中的某些单词进行例外处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆