文本替换效率 [英] Text replacement efficiency

查看:54
本文介绍了文本替换效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对我上一个问题的扩展:
文本清理和替换:删除\nJava中的文本

An extension to my previous question:
Text cleaning and replacement: delete \n from a text in Java

我正在清理这个传入的文本,它来自一个包含不规则文本的数据库.这意味着, 没有标准或规则.有些包含像 &reg、&trade、&lt 这样的 HTML 字符,而另一些则以这种形式出现:&#8221、&#8211 等.其他时候我只得到带有 <和 >.

I am cleaning this incoming text, which comes from a database with irregular text. That means, there' s no standard or rules. Some contain HTML characters like &reg, &trade, &lt, and others come in this form: &#8221, &#8211, etc. Other times I just get the HTML tags with < and >.

我使用 String.replace 来替换字符的含义(这应该没问题,因为我使用的是 UTF-8,对吗?),并使用 replaceAll() 使用正则表达式删除 HTML 标签.

I am using String.replace to replace the characters by their meaning (this should be fine since I'm using UTF-8 right?), and replaceAll() to remove the HTML tags with a regular expression.

除了为每次替换调用一次 replace() 函数并编译 HTML 标记正则表达式之外,是否有任何建议可以使此替换有效?

Other than one call to the replace() function for each replacement, and compiling the HTML tags regular expression, is there any recommendation to make this replacement efficient?

推荐答案

我的第一个建议是衡量最简单的方法(可能是多次 replace/replaceAll 调用)的性能.是的,它可能效率低下.通常,最简单的方法是低效的.你需要问问自己:你有多在乎?

My first suggestion is to measure the performance of the simplest way of doing it (which is probably multiple replace/replaceAll calls). Yes, it's potentially inefficient. Quite often the simplest way of doing this is inefficient. You need to ask yourself: how much do you care?

您是否有样本数据和性能可接受的阈值?如果你不这样做,那是第一个停靠港.然后测试简单的实现,看看它是否真的一个问题.(请记住,字符串替换几乎肯定只是您正在执行的操作的一部分.当您开始从数据库中获取文本时,可能最终成为瓶颈.)

Do you have sample data and a threshold at which point the performance is acceptable? If you don't, that's the first port of call. Then test the naive implementation, and see whether it really is a problem. (Bear in mind that string replacement is almost certainly only part of what you're doing. As you're fetching the text from a database to start with, that may well end up being the bottleneck.)

一旦您确定替换确实是瓶颈,就值得进行一些测试,看看替换的哪些导致了最大的问题 - 听起来您正在做几个不同的种替换.缩小范围越多越好:您可能会发现最简单代码中的真正瓶颈是由易于以合理简单的方式提高效率的东西引起的,而试图优化一切都会更难.

Once you've determined that the replacement really is the bottleneck, it's worth performing some tests to see which bits of the replacement are causing the biggest problem - it sounds like you're doing several different kinds of replacement. The more you can narrow it down, the better: you may find that the real bottleneck in the simplest code is caused by something which is easy to make efficient in a reasonably simple way, whereas trying to optimise everything would be a lot harder.

这篇关于文本替换效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆