有效地从字符串中读取括号中的两个逗号分隔的浮点数,而不受全局语言环境的影响 [英] Efficiently reading two comma-separated floats in brackets from a string without being affected by the global locale

查看:136
本文介绍了有效地从字符串中读取括号中的两个逗号分隔的浮点数,而不受全局语言环境的影响的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是图书馆的开发人员,我们的旧代码使用sscanf()sprintf()来读写字符串中的各种内部类型.与使用我们的库且其语言环境与我们基于XML文件的语言环境("C"语言环境)不同的用户遇到了问题.在我们的例子中,这导致从这些XML文件中解析出不正确的值,并且在运行时以字符串形式提交了这些值.用户可以直接更改语言环境,但也可以在用户不知情的情况下更改语言环境.如果语言环境更改发生在另一个库(例如GTK)中,而该库是一个错误报告中的肇事者",则可能会发生这种情况.因此,我们显然希望从语言环境中删除任何依赖关系,以使自己永久摆脱这些问题.

I am a developer of a library and our old code uses sscanf() and sprintf() to read/write a variety of internal types from/to strings. We have had issues with users who used our library and had a locale that was different from the one we based our XML files on ("C" locale). In our case this resulted in incorrect values parsed from those XML files and those submitted as strings in run-time. The locale may be changed by a user directly but can also be changed without the knowledge of the user. This can happen if the locale-changes occurs inside another library, such as GTK, which was the "perpetrator" in one bug report. Therefore, we obviously want to remove any dependency from the locale to permanently free ourselves from these issues.

我已经阅读了关于float/double/int/...的其他问题和答案,尤其是如果它们由字符分隔或位于方括号内,但是到目前为止,我发现的拟议解决方案对我们并不满意.我们的要求是:

I have already read other questions and answers in the context of float/double/int/... especially if they are separated by a character or located inside brackets, but so far the proposed solutions I found were not satisfying to us. Our requirements are:

  1. 除标准库外,不依赖其他库.因此,例如,不能使用boost中的任何东西.

  1. No dependencies on libraries other than the standard library. Using anything from boost is therefore, for example, not an option.

必须是线程安全的.这是特定于语言环境的意思,可以全局更改.这对我们来说真的很糟糕,因为我们的库中的一个线程可能会受到用户程序中另一个线程的影响,该线程也可能正在运行完全不同的库中的代码.因此,直接受setlocale()影响的任何内容均不可行.同样,由于线程中的竞争条件,在开始读取/写入之前设置语言环境,然后再将其设置回原始值并不是解决方案.

Must be thread-safe. This is meant in specific regarding the locale, which can be changed globally. This is really awful for us, as therefore a thread of our library can be affected by another thread in the user's program, which may also be running code of a completely different library. Anything affected by setlocale() directly is therefore not an option. Also, setting the locale before starting to read/write and setting it back to the original value thereafter is not a solution due to race conditions in threads.

尽管效率不是最高优先级(#1和#2是最高优先级),但这仍然绝对是我们关注的问题,因为根据用户的程序,在运行时可能会非常频繁地读取和写入字符串.越快越好.

While efficiency is not the topmost priority (#1 & #2 are), it is still definitely of our concern, as strings may be read and written in run-time quite frequently, depending on the user's program. The faster, the better.

作为附加说明:boost::lexical_cast不保证不受语言环境的影响(来源:

As an additional note: boost::lexical_cast is not guaranteed to be unaffected by the locale (source: Locale invariant guarantee of boost::lexical_cast<>). So that would not be a solution even without requirement #1.

到目前为止,我收集了以下信息:

I gathered the following information so far:

  • First of all, what I saw being suggested a lot is using boost's lexical_cast but unfortunately this is not an option for us as at all, as we can't require all users to also link to boost (and because of the lacking locale-safety, see above). I looked at the code to see if we can extract anything from it but I found it difficult to understand and too large in length, and most likely the big performance-gainers are using locale-dependent functions anyways.
  • Many functions introduced in C++11, such as std::to_string, std::stod, std::stof, etc. depend on the global locale just the way sscanf and sprintf do, which is extremely unfortunate and to me not understandable, considering that std::thread has been added.
  • std::stringstream seems to be a solution in general, since it is thread-safe in the context of the locale, but also in general if guarded right. However, if it is constructed freshly every time it can be slow (good comparison: http://www.boost.org/doc/libs/1_55_0/doc/html/boost_lexical_cast/performance.html). I assume this can be solved by having one such stream per thread configured and available, clearing it each time after usage. However, a problem is that it doesn't solve formats as easily as sscanf() does, for example: " { %g , %g } ".

sscanf()模式是:

  • " { %g , %g }"
  • " { { %g , %g } , { %g , %g } }"
  • " { top: { %g , %g } , left: { %g , %g } , bottom: { %g , %g } , right: { %g , %g }"
  • " { %g , %g }"
  • " { { %g , %g } , { %g , %g } }"
  • " { top: { %g , %g } , left: { %g , %g } , bottom: { %g , %g } , right: { %g , %g }"

用stringstreams编写它们似乎没什么大不了,但是阅读它们似乎是有问题的,尤其是考虑到空格.

Writing these with stringstreams seems no big deal, but reading them seems problematic, especially considering the whitespaces.

在这种情况下我们应该使用std::regex还是过度杀伤力?字符串流是此任务的理想解决方案,还是在提到的要求下还有更好的方法吗?另外,在线程安全性和语言环境的上下文中是否还有其他我没有考虑的问题,尤其是在使用std :: stringstream时?

Should we use std::regex in this context or is this overkill? Are stringstreams a good solution for this task or is there any better way to do this given the mentioned requirements? Also, are there any other problems in the context of thread-safety and locales that I have not considered in my question - especially regarding the usage of std::stringstream?

推荐答案

在您的情况下,stringstream似乎是最好的方法,因为您可以独立于所设置的全局语言环境来控制其语言环境.但是,的确,格式化的阅读并不像sscanf()那样简单.

In your case the stringstream seems to be the best approach, as you can control it's locale independently of the global locale that was set. But it's true that the formatted reading is not as easy as with sscanf().

从性能的角度来看,使用正则表达式进行流输入对于这种简单的逗号分隔读取来说是一个过大的杀伤力:在一个非正式基准上,它比scanf()慢十倍以上.

Form the point of view of performance, stream input with regex is an overkill for this kind of simple comma separated reading : on an informal benchmark it was more than 10 times slower than a scanf().

您可以轻松地编写一些辅助类,以方便阅读所列举的格式.这里是另一个SO答案的基本思想使用起来很简单:

You can easily write a little auxiliary class to facilitate reading formats like you have enumerated. Here the general idea on another SO answer The use can be as easy as:

sst >> mandatory_input(" { ")>> x >> mandatory_input(" , ")>>y>> mandatory_input(" } ");

如果您有兴趣,我前段时间已经写过.在完整的文章中,提供示例和说明以及源代码.该类有70行代码,但其中大多数可提供错误处理功能,以备需要时使用.它具有可接受的性能,但仍比scanf()慢.

If you're interested, I've written one some time ago. Here the full article with examples and explanation as well as source code. The class is 70 lines of code, but most of them to provide error processing functions in case these are needed. It has acceptable performance, but is still slower than scanf().

这篇关于有效地从字符串中读取括号中的两个逗号分隔的浮点数,而不受全局语言环境的影响的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆