如何从句子中提取字符ngram? - Python [英] How to extract character ngram from sentences? - python

查看:255
本文介绍了如何从句子中提取字符ngram? - Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下word2ngrams函数从一个单词中提取字符3克:

The following word2ngrams function extracts character 3grams from a word:

>>> x = 'foobar'
>>> n = 3
>>> [x[i:i+n] for i in range(len(x)-n+1)]
['foo', 'oob', 'oba', 'bar']

这篇文章显示了单个单词的字符ngram提取,快速使用python 实现字符n元语法.

This post shows the character ngrams extraction for a single word, Quick implementation of character n-grams using python.

但是如果我有句子并且想要提取字符ngram,除了迭代调用word2ngram() 之外,还有没有更快的方法?

But what if i have sentences and i want to extract the character ngrams, is there a faster method other than iteratively call the word2ngram()?

实现相同的word2ngramsent2ngram输出的正则表达式版本是什么?会更快吗?

What will be the regex version of achieving the same word2ngram and sent2ngram output? would it be faster?

我尝试过:

import string, random, time
from itertools import chain

def word2ngrams(text, n=3):
  """ Convert word into character ngrams. """
  return [text[i:i+n] for i in range(len(text)-n+1)]

def sent2ngrams(text, n=3):
    return list(chain(*[word2ngrams(i,n) for i in text.lower().split()]))

def sent2ngrams_simple(text, n=3):
    text = text.lower()
    return [text[i:i+n] for i in range(len(text)-n+1) if not " " in text[i:i+n]]

# Generate 10000 random strings of length 100.
sents = [" ".join([''.join(random.choice(string.ascii_uppercase) for j in range(10)) for i in range(100)]) for k in range(100)]

start = time.time()
x = [sent2ngrams(i) for i in sents]
print time.time() - start        

start = time.time()
y = [sent2ngrams_simple(i) for i in sents]
print time.time() - start        

print x==y

[输出]:

0.0205280780792
0.0271739959717
True

已编辑

regex方法看起来很优雅,但是它比迭代调用word2ngram()的速度慢:

The regex method looks elegant but it performs slower than iteratively calling word2ngram():

import string, random, time, re
from itertools import chain

def word2ngrams(text, n=3):
  """ Convert word into character ngrams. """
  return [text[i:i+n] for i in range(len(text)-n+1)]

def sent2ngrams(text, n=3):
    return list(chain(*[word2ngrams(i,n) for i in text.lower().split()]))

def sent2ngrams_simple(text, n=3):
    text = text.lower()
    return [text[i:i+n] for i in range(len(text)-n+1) if not " " in text[i:i+n]]

def sent2ngrams_regex(text, n=3):
    rgx = '(?=('+'\S'*n+'))'
    return re.findall(rgx,text)

# Generate 10000 random strings of length 100.
sents = [" ".join([''.join(random.choice(string.ascii_uppercase) for j in range(10)) for i in range(100)]) for k in range(100)]

start = time.time()
x = [sent2ngrams(i) for i in sents]
print time.time() - start        

start = time.time()
y = [sent2ngrams_simple(i) for i in sents]
print time.time() - start        

start = time.time()
z = [sent2ngrams_regex(i) for i in sents]
print time.time() - start  

print x==y==z

[输出]:

0.0211708545685
0.0284190177917
0.0303599834442
True

推荐答案

为什么不只是(?=(...))

编辑相同,但不包含空格(?=(\S\S\S))
edit2 ,您也可以使用所需的内容.前任.仅使用字母数字(?=([^\W_]{3}))

edit Same thing, but not whitespace (?=(\S\S\S))
edit2 You can use just what you want as well. Ex. uses alphanum only (?=([^\W_]{3}))

使用前瞻来捕获3个字符.然后引擎将位置分别提高1次
比赛.然后捕获下一个3.

Uses a lookahead to capture 3 characters. Then the engine bumps the position up 1 time each
match. Then captures next 3.

foobar的结果是
foo
oob
oba
酒吧

Result of foobar is
foo
oob
oba
bar

 # Compressed regex
 #  (?=(...))

 # Expanded regex
 (?=                   # Start Lookahead assertion
      (                     # Capture group 1 start
           .                     # dot - metachar, matches any character except newline
           .                     # dot - metachar
           .                     # dot - metachar
      )                     # Capture group 1 end
 )                     # End Lookahead assertion

这篇关于如何从句子中提取字符ngram? - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆