Python:Pycharm 运行时 [英] Python: Pycharm runtimes

查看:51
本文介绍了Python:Pycharm 运行时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目睹了 PyCharm 的一些奇怪的运行时问题,如下所述.该代码已在具有 20 个内核和 256 GB RAM 的机器上运行,并且有足够的内存可用.我没有展示任何实际功能,因为它是一个相当大的项目,但我非常乐意根据要求添加详细信息.

I am witnessing some strange run time issues with PyCharm that are explained below. The code has been run on a machine with 20 cores and 256 GB RAM and there is sufficient memory to spare. I am not showing any of the real functions as it is a reasonably large project, but am more than happy to add details upon request.

简而言之,我有一个具有以下结构的 .py 文件项目:

In short, I have a .py file project with the following structure:

import ...
import ...

cpu_cores = control_parameters.cpu_cores
prng = RandomState(123)

def collect_results(result_list):
    return pd.DataFrame({'start_time': result_list[0::4],
                  'arrival_time': result_list[1::4],
                  'tour_id': result_list[2::4],
                  'trip_id': result_list[3::4]})

if __name__ == '__main__':

    # Run the serial code
    st = starttimes.StartTimesCreate(prng)
    temp_df, two_trips_df, time_dist_arr = st.run()

     # Prepare the dataframe to sample start times. Create groups from the input dataframe
    temp_df1 = st.prepare_two_trips_more_df(temp_df, two_trips_df)
    validation.logger.info("Dataframe prepared for multiprocessing")

    grp_list = []
    for name, group in temp_df1.groupby('tour_id'):  ### problem lies here in runtimes
        grp_list.append(group)
    validation.logger.info("All groups have been prepared for multiprocessing, "
                           "for a total of %s groups" %len(grp_list))

################ PARALLEL CODE BELOW #################

for 循环 在 1050 万行和 18 列的数据帧上运行.在当前表单中,创建群组列表(280 万个群组)大约需要 25 分钟.这些组被创建,然后被馈送到多进程池,其代码未显示.

The for loop is run on a dataframe of 10.5million rows and 18 columns. In the current form it takes about 25 mins to create the list of groups (2.8M groups). These groups are created and then fed to a multiprocess pool, code for which is not shown.

花费的 25 分钟相当长,因为我也完成了以下测试运行,只需要 7 分钟.本质上,我将 temp_df1 文件保存为 CSV,然后在预先保存的文件中进行批处理,并像以前一样运行 for 循环.

The 25 mins it is taking is quite long for I have done the following test run as well, which takes only 7 mins. Essentially, I saved the temp_df1 file to a CSV and then just batched in the pre-saved file and run the same for loop as before.

import ...
import ...

cpu_cores = control_parameters.cpu_cores
prng = RandomState(123)

def collect_results(result_list):
    return pd.DataFrame({'start_time': result_list[0::4],
                  'arrival_time': result_list[1::4],
                  'tour_id': result_list[2::4],
                  'trip_id': result_list[3::4]})

if __name__ == '__main__':

    # Run the serial code
    st = starttimes.StartTimesCreate(prng)

    temp_df1 = pd.read_csv(r"c:\\...\\temp_df1.csv")
    time_dist = pd.read_csv(r"c:\\...\\start_time_distribution_treso_1.csv")
    time_dist_arr = np.array(time_dist.to_records())

    grp_list = []
    for name, group in temp_df1.groupby('tour_id'):
        grp_list.append(group)
    validation.logger.info("All groups have been prepared for multiprocessing, "
                           "for a total of %s groups" %len(grp_list))

问题那么,是什么导致代码在我只批处理文件时比在更上游的函数中创建文件时运行速度快 3 倍?

QUESTION So, what is it that is causing the code to run 3 times faster when I just batch in the file versus when the file is created as part of a function further upstream?

提前致谢,请让我知道如何进一步澄清.

Thanks in advance and please let me know how I can further clarify.

推荐答案

我正在回答我的问题,因为我在做一堆测试时偶然发现了答案,谢天谢地,当我用谷歌搜索解决方案时,其他人也有相同的"https://stackoverflow.com/questions/48042952/pandas-dataframe-aggregate-on-column-whos-dtype-category-leads-to-slow-perf/51164942#51164942">问题.关于为什么在执行 group_by 操作时使用分类列是个坏主意的解释可以在上面的链接中找到.因此我不打算在这里发布它.谢谢.

I am answering my question as I stumbled upon the answer while doing a bunch of tests and thankfully when I googled the solution someone else had the same issue. The explanation for why having categorical columns is a bad idea when doing group_by operations can be found at the above link. Thus I am not going to post it here. Thanks.

这篇关于Python:Pycharm 运行时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆