给定一百万个数字的字符串,返回所有重复的3位数字 [英] Given a string of a million numbers, return all repeating 3 digit numbers

查看:162
本文介绍了给定一百万个数字的字符串,返回所有重复的3位数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几个月前,我在纽约接受了一家对冲基金公司的采访,但不幸的是,我没有得到数据/软件工程师的实习机会。 (他们还要求解决方案使用Python。)



我在第一次面试问题上几乎搞砸了……


问题:给定一百万个数字的字符串(例如Pi),编写
a函数/程序,该函数/程序返回所有重复的3位数字和
重复的次数大于1


例如:如果字符串为: 123412345123456 ,则函数/程序将返回:

  123-3次
234-3次
345-2次

在面试失败后他们没有给我解决方案,但他们确实告诉我时间解决方案的复杂度为1000,因为所有可能的结果都在以下之间:



000-> 999



现在我正在考虑,我认为不可能提出一个恒定时间算法。是吗?

解决方案

您轻松下车,可能不想想要为对冲基金工作量子点不了解基本算法的地方:-)



没有没有的方法来处理<$ c中的任意大小的数据结构$ c> O(1)(如果在这种情况下,您需要至少访问每个元素一次)。在这种情况下,您希望的最佳 O(n),其中 n 是字符串的长度。


尽管,顺便说一句,名义上的 O(n)算法 将为 O(1),因此,从技术上讲,它们在此处可能是正确的。但是,这通常不是人们使用复杂度分析的方式。


在我看来,您可能会以多种方式给他们留下深刻的印象。 / p>

首先,通知他们不可能可以在 O(1),除非您使用上面给出的可疑推理。



第二,通过提供Pythonic代码来展示您的精通技能,例如:

  inpStr ='123412345123456'

#O(1)创建数组。
freq = [0] * 1000

#O(n)字符串处理。
对于[int(inpStr [pos:pos + 3])中val的val,范围为(len(inpStr)-2)]:
freq [val] + = 1

#O(1)输出相关数组值。
打印([[(num,freq [num])表示范围(1000)中的num,如果freq [num]> 1]))

此输出:

  [(123,3),(234,3) ,(345,2)] 

尽管您当然可以将输出格式修改为任何格式



最后,通过告诉他们, O(n)几乎没有问题。 code>解决方案,因为上面的代码在不到半秒钟的时间内即可提供一百万个数字字符串的结果。它似乎也线性地缩放,因为一个10,000,000个字符的字符串需要3.5秒,而一个100,000,000个字符的字符串需要36秒。



并且,如果它们 需要的更好,有多种方法可以并行化这种东西,从而大大加快了速度。



不在单个当然,由于有GIL,Python解释器也可以,但是您可以将字符串拆分成类似的东西(需要 vv 表示的重叠才能正确处理边界区域):

  vv 
123412 vv
123451
5123456

您可以将它们种出以分开工作,然后将结果合并。



输入的拆分和输出的组合很可能会用小字符串(甚至可能是百万位的字符串)来节省任何储蓄,但是,对于更大的数据集,这可能会有所作为。我通常的测量,不要猜测 的口号在这里适用。






此咒语也适用于 other 的可能性,例如完全绕过Python并使用另一种可能更快的语言。



例如,以下C语言代码与早期Python代码在相同的硬件上运行,可在0.6秒内处理一百百万位数字,与Python代码处理 one 的时间大致相同百万。换句话说,快得多

  #include< stdio .h> 
#include< string.h>

int main(void){
静态char inpStr [100000000 + 1];
static int freq [1000];

//设置测试数据。

memset(inpStr,‘1’,sizeof(inpStr));
inpStr [sizeof(inpStr)-1] =‘\0’;

//至少需要三位数才能执行任何有用的操作。

如果(strlen(inpStr)< = 2)返回0;

//从前两位数字获取初始提要,然后处理其他两位。

int val =(inpStr [0]-‘0’)* 10 + inpStr [1]-‘0’;
char * inpPtr =&(inpStr [2]);
while(* inpPtr!=‘\0’){
//删除数百个,添加下一位数字为单位,调整表格。

val =(val%100)* 10 + * inpPtr ++-‘0’;
freq [val] ++;
}

//输出(相关部分)表。

for(int i = 0; i< 1000; ++ i)
if(freq [i]> 1)
printf(%3d-> %d\n,i,freq [i]);

返回0;
}


I had an interview with a hedge fund company in New York a few months ago and unfortunately, I did not get the internship offer as a data/software engineer. (They also asked the solution to be in Python.)

I pretty much screwed up on the first interview problem...

Question: Given a string of a million numbers (Pi for example), write a function/program that returns all repeating 3 digit numbers and number of repetition greater than 1

For example: if the string was: 123412345123456 then the function/program would return:

123 - 3 times
234 - 3 times
345 - 2 times

They did not give me the solution after I failed the interview, but they did tell me that the time complexity for the solution was constant of 1000 since all the possible outcomes are between:

000 --> 999

Now that I'm thinking about it, I don't think it's possible to come up with a constant time algorithm. Is it?

解决方案

You got off lightly, you probably don't want to be working for a hedge fund where the quants don't understand basic algorithms :-)

There is no way to process an arbitrarily-sized data structure in O(1) if, as in this case, you need to visit every element at least once. The best you can hope for is O(n) in this case, where n is the length of the string.

Although, as an aside, a nominal O(n) algorithm will be O(1) for a fixed input size so, technically, they may have been correct here. However, that's not usually how people use complexity analysis.

It appears to me you could have impressed them in a number of ways.

First, by informing them that it's not possible to do it in O(1), unless you use the "suspect" reasoning given above.

Second, by showing your elite skills by providing Pythonic code such as:

inpStr = '123412345123456'

# O(1) array creation.
freq = [0] * 1000

# O(n) string processing.
for val in [int(inpStr[pos:pos+3]) for pos in range(len(inpStr) - 2)]:
    freq[val] += 1

# O(1) output of relevant array values.
print ([(num, freq[num]) for num in range(1000) if freq[num] > 1])

This outputs:

[(123, 3), (234, 3), (345, 2)]

though you could, of course, modify the output format to anything you desire.

And, finally, by telling them there's almost certainly no problem with an O(n) solution, since the code above delivers results for a one-million-digit string in well under half a second. It seems to scale quite linearly as well, since a 10,000,000-character string takes 3.5 seconds and a 100,000,000-character one takes 36 seconds.

And, if they need better than that, there are ways to parallelise this sort of stuff that can greatly speed it up.

Not within a single Python interpreter of course, due to the GIL, but you could split the string into something like (overlap indicated by vv is required to allow proper processing of the boundary areas):

    vv
123412  vv
    123451
        5123456

You can farm these out to separate workers and combine the results afterwards.

The splitting of input and combining of output are likely to swamp any saving with small strings (and possibly even million-digit strings) but, for much larger data sets, it may well make a difference. My usual mantra of "measure, don't guess" applies here, of course.


This mantra also applies to other possibilities, such as bypassing Python altogether and using a different language which may be faster.

For example, the following C code, running on the same hardware as the earlier Python code, handles a hundred million digits in 0.6 seconds, roughly the same amount of time as the Python code processed one million. In other words, much faster:

#include <stdio.h>
#include <string.h>

int main(void) {
    static char inpStr[100000000+1];
    static int freq[1000];

    // Set up test data.

    memset(inpStr, '1', sizeof(inpStr));
    inpStr[sizeof(inpStr)-1] = '\0';

    // Need at least three digits to do anything useful.

    if (strlen(inpStr) <= 2) return 0;

    // Get initial feed from first two digits, process others.

    int val = (inpStr[0] - '0') * 10 + inpStr[1] - '0';
    char *inpPtr = &(inpStr[2]);
    while (*inpPtr != '\0') {
        // Remove hundreds, add next digit as units, adjust table.

        val = (val % 100) * 10 + *inpPtr++ - '0';
        freq[val]++;
    }

    // Output (relevant part of) table.

    for (int i = 0; i < 1000; ++i)
        if (freq[i] > 1)
            printf("%3d -> %d\n", i, freq[i]);

    return 0;
}

这篇关于给定一百万个数字的字符串,返回所有重复的3位数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆