将字符串转换为大多数有效类型'1' - > 1,'A'---> 'A','1.2'---> 1.2 [英] converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

查看:78
本文介绍了将字符串转换为大多数有效类型'1' - > 1,'A'---> 'A','1.2'---> 1.2的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,


我正在使用csv导入大量文本数据文件。我想补充一下

更多的自动感应能力。我正在采样数据

文件并对属性进行模糊逻辑评分(在
数据库/ csv文件中的colls,例如身高体重收入等) 。)确定

最有效的''type''将属性coll转换为进一步的

处理和高效存储...

来自采样文件数据的示例行:[[''8'',''2.33'',''A'',''BB'',''你好

那里''''100,000,000,000''],[下一行...] ....]


除了缺少属性指示符外,我们可以假设相同的

类型的数据通过coll继续。例如,一个字符串,int8,

int16,float等。


1. python中测试天气的最有效方法是什么? />
可以转换成给定的数字类型,如果它的

真的是像''A''或''hello'这样的字符串,那就单独留下?速度是关键吗?有什么想法吗?


2.那里有什么东西可以处理这个问题吗?


谢谢,

Conor

Hello,

I''m importing large text files of data using csv. I would like to add
some more auto sensing abilities. I''m considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient ''type'' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ [''8'',''2.33'', ''A'', ''BB'', ''hello
there'' ''100,000,000,000''], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like ''A'' or ''hello''? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor

推荐答案

5月18日下午6:07,py_genetic< conor.robin ... @ gmail.comwrote:
On May 18, 6:07 pm, py_genetic <conor.robin...@gmail.comwrote:

您好,


我正在使用csv导入大量文本文件。我想补充一下

更多的自动感应能力。我正在采样数据

文件并对属性进行模糊逻辑评分(在
数据库/ csv文件中的colls,例如身高体重收入等) 。)确定

最有效的''type''将属性coll转换为进一步的

处理和高效存储...

来自采样文件数据的示例行:[[''8'',''2.33'',''A'',''BB'',''你好

那里''''100,000,000,000''],[下一行...] ....]


除了缺少属性指示符外,我们可以假设相同的

类型的数据通过coll继续。例如,一个字符串,int8,

int16,float等。


1. python中测试天气的最有效方法是什么? />
可以转换成给定的数字类型,如果它的

真的是像''A''或''hello'这样的字符串,那就单独留下?速度是关键吗?有什么想法吗?
Hello,

I''m importing large text files of data using csv. I would like to add
some more auto sensing abilities. I''m considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient ''type'' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ [''8'',''2.33'', ''A'', ''BB'', ''hello
there'' ''100,000,000,000''], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like ''A'' or ''hello''? Speed is key? Any thoughts?



给定字符串s:


尝试:

integerValue = int(s)<除了ValueError之外,还有
,e:

试试:

floatValue = float(s)
除了ValueError之外的


pass

else:

s = floatValue

else:

s = integerValue


我相信它会自动识别base 8和base 16整数

(但不是8/16基础浮点数)。

given the string s:

try:
integerValue = int(s)
except ValueError, e:
try:
floatValue = float(s)
except ValueError:
pass
else:
s = floatValue
else:
s = integerValue

I believe it will automatically identify base 8 and base 16 integers
(but not base 8/16 floats).


2.还有什么东西可以处理这个问题吗?


谢谢,

Conor
2. Is there anything out there already which deals with this issue?

Thanks,
Conor


py_genetic写道:
py_genetic wrote:

您好,


我正在导入大文本使用csv的数据文件。我想补充一下

更多的自动感应能力。我正在采样数据

文件并对属性进行模糊逻辑评分(在
数据库/ csv文件中的colls,例如身高体重收入等) 。)确定

最有效的''type''将属性coll转换为进一步的

处理和高效存储...

来自采样文件数据的示例行:[[''8'',''2.33'',''A'',''BB'',''你好

那里''''100,000,000,000''],[下一行...] ....]


除了缺少属性指示符外,我们可以假设相同的

类型的数据通过coll继续。例如,一个字符串,int8,

int16,float等。


1. python中测试天气的最有效方法是什么? />
可以转换成给定的数字类型,如果它的

真的是像''A''或''hello'这样的字符串,那就单独留下?速度是关键吗?有什么想法吗?


2.那里有什么东西可以处理这个问题吗?


谢谢,

Conor
Hello,

I''m importing large text files of data using csv. I would like to add
some more auto sensing abilities. I''m considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient ''type'' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ [''8'',''2.33'', ''A'', ''BB'', ''hello
there'' ''100,000,000,000''], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like ''A'' or ''hello''? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor



这是未经测试的,但这里有一个大纲可以做你想要的。


首先将行转换为列:

列= zip(*行)

好​​的,这是很多打字。现在,你应该在列中运行,使用限制性最强的类型进行
测试并使用限制较少的

类型。你还需要记住你的数字中可能有逗号的数字 - 所以你需要编写自己的转换器,为自己确定

是什么文字映射到什么价值观。只有你可以在这里决定你真正想要的
。以下是我如何做到这一点的最小想法:

def make_int(astr):

如果不是astr:

返回0

else:

返回int(astr.replace('','',''''))


def make_float( astr):

如果不是astr:

返回0.0

否则:

返回浮动(astr.replace( '','',''''))


make_str = lambda s:s

现在你可以将转换器放在一个列表中,记住订购它们。

converter = [make_int,make_float,make_str]

现在,进入列检查,移动到下一个,限制较少,
$ b特定转换器发生故障时的$ b转换器。我们假设make_str

身份运算符永远不会失败。为了提高效率,我们可以把它留下来并且有一个

标志等,但这是留作练习。

new_columns = []
$ b列中列的$ b:

转换器中的转换器:

尝试:

new_column = [convert(v)for v in column]

休息

除外:

继续

new_columns.append(new_column)

无缘无故,转换回行:

new_rows = zip(* new_columns)

你必须自己决定如何处理歧义。例如,

将''1.0''浮点数或整数?以上假设您希望列中的所有值

具有相同的类型。重新排序循环可以在列中提供混合的

类型,但不符合您的规定要求。一些

的东西不像它们那样高效(例如,消除了笨拙的make_str的
)。但是添加测试以提高效率会使云计算成逻辑。


James

This is untested, but here is an outline to do what you want.

First convert rows to columns:
columns = zip(*rows)
Okay, that was a lot of typing. Now, you should run down the columns,
testing with the most restrictive type and working to less restrictive
types. You will also need to keep in mind the potential for commas in
your numbers--so you will need to write your own converters, determining
for yourself what literals map to what values. Only you can decide what
you really want here. Here is a minimal idea of how I would do it:
def make_int(astr):
if not astr:
return 0
else:
return int(astr.replace('','', ''''))

def make_float(astr):
if not astr:
return 0.0
else:
return float(astr.replace('','', ''''))

make_str = lambda s: s
Now you can put the converters in a list, remembering to order them.
converters = [make_int, make_float, make_str]
Now, go down the columns checking, moving to the next, less restrictive,
converter when a particular converter fails. We assume that the make_str
identity operator will never fail. We could leave it out and have a
flag, etc., for efficiency, but that is left as an exercise.
new_columns = []
for column in columns:
for converter in converters:
try:
new_column = [converter(v) for v in column]
break
except:
continue
new_columns.append(new_column)
For no reason at all, convert back to rows:
new_rows = zip(*new_columns)
You must decide for yourself how to deal with ambiguities. For example,
will ''1.0'' be a float or an int? The above assumes you want all values
in a column to have the same type. Reordering the loops can give mixed
types in columns, but would not fulfill your stated requirements. Some
things are not as efficient as they might be (for example, eliminating
the clumsy make_str). But adding tests to improve efficiency would cloud
the logic.

James


On 19 / 05/2007 10:04 AM,James Stroud写道:
On 19/05/2007 10:04 AM, James Stroud wrote:

py_genetic写道:
py_genetic wrote:

>你好,

我正在使用csv导入大型文本数据文件。我想补充一些自动感应能力。我正在对数据进行采样并对属性进行一些模糊逻辑评分(
数据库/ csv文件中的colls,例如身高体重收入等)以确定最有效的''type''将属性coll转换为进一步的处理和高效存储...

来自采样文件数据的示例行:[[''8'', ''2.33'',''A'',''BB'',''你好
那里有'100,000,000,000''],[下一行...] ....]

除了缺少属性指示符之外,我们可以假设相同的数据类型的数据继续通过coll。例如,一个字符串,int8,
int16,float等。

1。 python中测试天气的最有效方法是什么?
可以转换成给定的数字类型,如果它真的像''A'或''你好'那么字符串就可以单独留下来?速度是关键吗?有什么想法吗?

2。还有什么东西可以处理这个问题吗?

谢谢,
Conor
>Hello,

I''m importing large text files of data using csv. I would like to add
some more auto sensing abilities. I''m considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient ''type'' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ [''8'',''2.33'', ''A'', ''BB'', ''hello
there'' ''100,000,000,000''], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like ''A'' or ''hello''? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor



这是未经测试的,但这里是做你想做的大纲。


首先将行转换为列:


columns = zip(* rows)


好​​的,那是很多打字。现在,你应该在列中运行,使用限制性最强的类型进行
测试并使用限制较少的

类型。你还需要记住你的数字中可能有逗号的数字 - 所以你需要编写自己的转换器,为自己确定

是什么文字映射到什么价值观。只有你可以在这里决定你真正想要的
。这是我如何做到这一点的最小想法:


def make_int(astr):

如果不是astr:

返回0

否则:

返回int(astr.replace('','',''''))


def make_float(astr):

如果不是astr:

返回0.0

否则:

返回浮动( astr.replace('','','''')


make_str = lambda s:s


现在你可以把列表中的转换器,记得订购它们。


converter = [make_int,make_float,make_str]


现在,沿着列向下当特定转换器发生故障时,检查,移动到下一个限制较少的转换器。我们假设make_str

身份运算符永远不会失败。为了提高效率,我们可以把它留下来并且有一个

标志等等,但这只是一个练习。


new_columns = []

列中的列:

转换器中的转换器:

尝试:

new_column = [converter(v)for v in the column]

break

除外:

继续

new_columns.append(new_column)


无缘无故,转换回行:


new_rows = zip(* new_columns)


你必须自己决定如何处理歧义。例如,

将''1.0''浮点数或整数?以上假设您希望列中的所有值

具有相同的类型。重新排序循环可以在列中提供混合的

类型,但不符合您的规定要求。一些

的东西不像它们那样高效(例如,消除了笨拙的make_str的
)。但是添加测试以提高效率会使云计算成为逻辑。


This is untested, but here is an outline to do what you want.

First convert rows to columns:
columns = zip(*rows)
Okay, that was a lot of typing. Now, you should run down the columns,
testing with the most restrictive type and working to less restrictive
types. You will also need to keep in mind the potential for commas in
your numbers--so you will need to write your own converters, determining
for yourself what literals map to what values. Only you can decide what
you really want here. Here is a minimal idea of how I would do it:
def make_int(astr):
if not astr:
return 0
else:
return int(astr.replace('','', ''''))

def make_float(astr):
if not astr:
return 0.0
else:
return float(astr.replace('','', ''''))

make_str = lambda s: s
Now you can put the converters in a list, remembering to order them.
converters = [make_int, make_float, make_str]
Now, go down the columns checking, moving to the next, less restrictive,
converter when a particular converter fails. We assume that the make_str
identity operator will never fail. We could leave it out and have a
flag, etc., for efficiency, but that is left as an exercise.
new_columns = []
for column in columns:
for converter in converters:
try:
new_column = [converter(v) for v in column]
break
except:
continue
new_columns.append(new_column)
For no reason at all, convert back to rows:
new_rows = zip(*new_columns)
You must decide for yourself how to deal with ambiguities. For example,
will ''1.0'' be a float or an int? The above assumes you want all values
in a column to have the same type. Reordering the loops can give mixed
types in columns, but would not fulfill your stated requirements. Some
things are not as efficient as they might be (for example, eliminating
the clumsy make_str). But adding tests to improve efficiency would cloud
the logic.



[如果出现不止一次,请提前道歉]


这种方法非常合理,如果:

(1)所涉及的类型遵循一个简单的阶梯。层次结构[ints通过

浮点测试,浮点数通过str测试]

(2)数据供应商已确保列中的所有值都是

实际上是预期类型的​​实例。


如果你需要约会,约束(1)就会崩溃。考虑31/12/99,

31/12 / 1999,311299 [int?],31121999 [int?],31DEC99,......那就是

在您允许三个不同订单(dmy,mdy,ymd)的日期之前。


约束(2)刚刚崩溃 - 用户提供的数据,似乎

不是规则,但是Rafferty没有规则,但没有法律,但Murphy'。


我采用的方法是测试一个值所有

类型的列,并选择成功率最高的非文本类型

(如果速率大于某个阈值,例如90%,否则

它的文本)。


对于大文件,采用1 / N样本可以节省大量时间,而且很少

误诊的可能性。


示例:1,079,000条记录的文件,包含15列,最终

被诊断为8 x文本,3 x int,1 x浮动,2 x日期(dmy订单),

和[no kidding] 1 x date(ymd order)。使用N == 101花了大约15

秒[Python 2.5.1,Win XP Pro SP2,3.2GHz双核]; N == 1需要

大约900秒。 转换器是指转换器。日期的功能用C语言写成。


干杯,

John

[apologies in advance if this appears more than once]

This approach is quite reasonable, IF:
(1) the types involved follow a simple "ladder" hierarchy [ints pass the
float test, floats pass the str test]
(2) the supplier of the data has ensured that all values in a column are
actually instances of the intended type.

Constraint (1) falls apart if you need dates. Consider 31/12/99,
31/12/1999, 311299 [int?], 31121999 [int?], 31DEC99, ... and that''s
before you allow for dates in three different orders (dmy, mdy, ymd).

Constraint (2) just falls apart -- with user-supplied data, there seem
to be no rules but Rafferty''s and no laws but Murphy''s.

The approach that I''ve adopted is to test the values in a column for all
types, and choose the non-text type that has the highest success rate
(provided the rate is greater than some threshold e.g. 90%, otherwise
it''s text).

For large files, taking a 1/N sample can save a lot of time with little
chance of misdiagnosis.

Example: file of 1,079,000 records, with 15 columns, ultimately
diagnosed as being 8 x text, 3 x int, 1 x float, 2 x date (dmy order),
and [no kidding] 1 x date (ymd order). Using N==101 took about 15
seconds [Python 2.5.1, Win XP Pro SP2, 3.2GHz dual-core]; N==1 takes
about 900 seconds. The "converter" function for dates is written in C.

Cheers,
John


这篇关于将字符串转换为大多数有效类型'1' - &gt; 1,'A'---&gt; 'A','1.2'---&gt; 1.2的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆