在数据框列上自动多处理“函数应用" [英] Automatically multiprocessing a 'function apply' on a dataframe column

查看：67 发布时间：2020/5/13 20:17:44 python performance python-2.7 pandas multiprocessing

本文介绍了在数据框列上自动多处理“函数应用"的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含两列的简单数据框.

I have a simple dataframe with two columns.

+---------+-------+ | subject | score |
+---------+-------+ | wow     | 0     |
+---------+-------+ | cool    | 0     |
+---------+-------+ | hey     | 0     |
+---------+-------+ | there   | 0     |
+---------+-------+ | come on | 0     |
+---------+-------+ | welcome | 0     |
+---------+-------+

对于主题"列中的每条记录，我正在调用一个函数并更新得分"列中的结果:

For every record in 'subject' column, I am calling a function and updating the results in column 'score' :

df['score'] = df['subject'].apply(find_score)

Here find_score is a function, which processes strings and returns a score :

def find_score (row):
    # Imports the Google Cloud client library
    from google.cloud import language

    # Instantiates a client
    language_client = language.Client()

    import re
    pre_text = re.sub('<[^>]*>', '', row)
    text = re.sub(r'[^\w]', ' ', pre_text)

    document = language_client.document_from_text(text)

    # Detects the sentiment of the text
    sentiment = document.analyze_sentiment().sentiment

    print("Sentiment score - %f " % sentiment.score) 

    return sentiment.score

这可以按预期工作，但是它非常缓慢，因为它一个接一个地处理记录.

This works fine as expected but its quite slow as it processes the record one by one.

有没有办法，这可以并行化?无需手动将数据帧拆分为较小的块?有没有可以自动执行此操作的库?

Is there a way, this can be parallelised ? without manually splitting the dataframe into smaller chunks ? Is there any library which does that automatically ?

欢呼

推荐答案

每次调用find_score函数时，language.Client的实例化可能是一个主要瓶颈.您无需为每次使用该函数都创建一个新的客户端实例，因此请在调用该函数之前尝试在该函数之外创建它:

The instantiation of language.Client every time you call the find_score function is likely a major bottleneck. You don't need to create a new client instance for every use of the function, so try creating it outside the function, before you call it:

# Instantiates a client
language_client = language.Client()

def find_score (row):
    # Imports the Google Cloud client library
    from google.cloud import language


    import re
    pre_text = re.sub('<[^>]*>', '', row)
    text = re.sub(r'[^\w]', ' ', pre_text)

    document = language_client.document_from_text(text)

    # Detects the sentiment of the text
    sentiment = document.analyze_sentiment().sentiment

    print("Sentiment score - %f " % sentiment.score) 

    return sentiment.score

df['score'] = df['subject'].apply(find_score)

如果您坚持要使用，则可以像这样使用多重处理:

If you insist, you can use multiprocessing like this:

from multiprocessing import Pool
# <Define functions and datasets here>
pool = Pool(processes = 8) # or some number of your choice
df['score'] = pool.map(find_score, df['subject'])
pool.terminate()

这篇关于在数据框列上自动多处理“函数应用"的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在数据框列上自动多处理“函数应用" [英] Automatically multiprocessing a 'function apply' on a dataframe column

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在数据框列上自动多处理“函数应用" [英] Automatically multiprocessing a &#39;function apply&#39; on a dataframe column

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

在数据框列上自动多处理“函数应用" [英] Automatically multiprocessing a 'function apply' on a dataframe column

登录关闭