我怎么可以跟踪字符位置之后,我删除一个字符串元素? [英] How can I keep track of character positions after I remove elements from a string?

查看:100
本文介绍了我怎么可以跟踪字符位置之后,我删除一个字符串元素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们说我有以下字符串:

Let us say I have the following string:

 "my ., .,dog. .jumps. , .and..he. .,is., .a. very .,good, .dog"  
  1234567890123456789012345678901234567890123456789012345678901 <-- char pos

现在,我已经写了一个普通的前pression从字符串中删除上面的某些元素,在这个例子中,所有空白,所有的时间,和所有的逗号。

Now, I have written a regular expression to remove certain elements from the string above, in this example, all whitespace, all periods, and all commas.

我留下了以下转化的字符串:

I am left with the following transformed string:

 "mydogjumpsandheisaverygooddog"

现在,我要构建K-克该字符串。让我们说我是拿5克以上的字符串,它看起来像:

Now, I want to construct k-grams of this string. Let us say I were to take 5-grams of the above string, it would look like:

  mydog ydogj dogju ogjum gjump jumps umpsa ...

我的问题是,对于每一个K-克,我要跟踪它的原始字符位置的在我列出的第一个源文本。

The problem I have is that for each k-gram, I want to keep track of its original character position in the first source text I listed.

所以,mydog,将有0的开始位置11的结束位置。不过,我的源文本和修改后的文本之间没有映射。所以,我不知道在哪里一个特定的K-克开始和结束有关的原始,未经修改的文本。这是很重要的我的程序跟踪。

So, "mydog", would have a start position of "0" and an end position of "11". However, I have no mapping between the source text and the modified text. So, I have no idea where a particular k-gram starts and ends in relation to the original, unmodified text. This is important to my program to keep track of.

我创造了K-克的列表是这样的:

I am creating a list of k-grams like this:

public class Kgram
{
    public int start;  
    public int end;  
    public int text;  
}

其中,启动结束在源文本(顶部)和文本的位置就是第k修改后-gram文本。

where start and end are positions in the source text (top) and the text is that of the k-gram text after the modifications.

任何人都可以点我在正确的方向来解决这个问题的最好方法是什么?

Can anyone point me in the right direction for the best way to solve this problem?

推荐答案

不要使用普通的EX pression'替换'API来完成你的更换。仅使用正则表达式来找到你要修改的地方,做MOD自己,保持偏移映射。我用一种形式是int数组一样大的原始字符串,存储'N删除字符这里的价值观,但也有许多其他的可能性。

Don't use a regular expression 'replace' API to do your replacing. Only use regexps to find the places you want to modify, do the mod yourself, and maintain an offset mapping. One form I've used is an array of ints as big as the original string, storing 'n chars deleted' here values, but there are a host of other possibilities.

这里的基本数据结构是对的数组。每对包含一个偏移和校正。根据时间/空间权衡,你可以preFER至S $ P $垫出的信息通过数据结构一样大的原始字符串。

The basic data structure here is an array of pairs. Each pair contains an offset and a correction. Depending on time/space tradeoffs, you may prefer to spread the information out over a data structure as large as the original string.

这篇关于我怎么可以跟踪字符位置之后,我删除一个字符串元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆