在每个字符串中标记(split?)数据集的最佳方式 [英] The best way to mark (split?) dataset in each string
问题描述
每个字符串包含大约700个具有约250个变量(每个变量1-16个字符)的字符,但没有任何分割符。每个变量的长度是已知的。通过符号
,
修改和标记数据的最佳方式是什么? 例如:
我有一些字符串,如:
0123456789012 ...
1234567890123。
和长度数组:
5,3, 1,4,...
那么我应该这样:
01234 ,567,8,9012,...
12345,678,9,0123,...
$ b $有人可以帮我吗? Python或R-tools主要优先于我...
在[321]:
t =0123456789012 ...
pd.read_fwf(io.StringIO(t),宽度= [5,3,1,4 ],header = None)
Out [321]:
0 1 2 3
0 1234 567 8 9012
这将给您一个数据框,允许您访问每个列,以满足您所需的任何目的
I have a dataset containing 485k strings (1.1 GB).
Each string contains about 700 of chars featuring about 250 variables (1-16 chars per variable), but it doesn't have any splitmarks. Lengths of each variable are known. What is the best way to modify and mark the data by symbol ,
?
For example: I have strings like:
0123456789012...
1234567890123...
and array of lengths:
5,3,1,4,...
then I should get like this:
01234,567,8,9012,...
12345,678,9,0123,...
Could anyone help me with this? Python or R-tools are mostly preferred to me...
Pandas could load this using read_fwf
:
In [321]:
t="""0123456789012..."""
pd.read_fwf(io.StringIO(t), widths=[5,3,1,4], header=None)
Out[321]:
0 1 2 3
0 1234 567 8 9012
This will give you a dataframe allowing you to access each individual column for whatever purpose you require
这篇关于在每个字符串中标记(split?)数据集的最佳方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!