将dask数据框中的列转换为Doc2Vec的TaggedDocument [英] Convert a column in a dask dataframe to a TaggedDocument for Doc2Vec

查看:73
本文介绍了将dask数据框中的列转换为Doc2Vec的TaggedDocument的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我正在尝试与gensim配合使用dask进行NLP文档计算,并且在将我的语料库转换为 TaggedDocument "。

Currently I am trying to use dask in concert with gensim to do NLP document computation and I'm running into an issue when converting my corpus into a "TaggedDocument".

因为我尝试了多种方法来解决此问题,所以我将列出自己的尝试。

Because I've tried so many different ways to wrangle this problem I'll list my attempts.

每次尝试解决此问题都遇到了些微的麻烦。

Each attempt at dealing with this problem is met with slightly different woes.

df.info()
<class 'dask.dataframe.core.DataFrame'>
Columns: 5 entries, claim_no to litigation
dtypes: object(2), int64(3)


  claim_no   claim_txt I                                    CL ICC lit
0 8697278-17 battery comprising interior battery active ele... 106 2 0


所需的输出


>>tagged_document[0]
>>TaggedDocument(words=['battery', 'comprising', 'interior', 'battery', 'active', 'elements', 'battery', 'cell', 'casing', 'said', 'cell', 'casing', 'comprising', 'first', 'casing', 'element', 'first', 'contact', 'surface', 'second', 'casing', 'element', 'second', 'contact', 'surface', 'wherein', 'assembled', 'position', 'first', 'second', 'contact', 'surfaces', 'contact', 'first', 'second', 'casing', 'elements', 'encase', 'active', 'materials', 'battery', 'cell', 'interior', 'space', 'wherein', 'least', 'one', 'gas', 'tight', 'seal', 'layer', 'arranged', 'first', 'second', 'contact', 'surfaces', 'seal', 'interior', 'space', 'characterized', 'one', 'first', 'second', 'contact', 'surfaces', 'comprises', 'electrically', 'insulating', 'void', 'volume', 'layer', 'first', 'second', 'contact', 'surfaces', 'comprises', 'formable', 'material', 'layer', 'fills', 'voids', 'surface', 'void', 'volume', 'layer', 'hermetically', 'assembled', 'position', 'form', 'seal', 'layer'], tags=['8697278-17'])
>>len(tagged_document) == len(df['claim_txt'])


错误号1不允许生成器


def read_corpus_tag_sub(df,corp='claim_txt',tags=['claim_no']):
    for i, line in enumerate(df[corp]):
        yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), (list(df.loc[i,tags].values)))

tagged_document = df.map_partitions(read_corpus_tag_sub,meta=TaggedDocument)
tagged_document = tagged_document.compute()

TypeError:无法序列化

TypeError: Could not serialize object of type generator.

我找不到办法o f在仍使用发电机的情况下解决此问题。解决这个问题的方法非常棒!

I found no way of getting around this while still using a generator. A fix for this would be great! As this works perfectly fine for regular pandas.

def read_corpus_tag_sub(df,corp='claim_txt',tags=['claim_no']):
    for i, line in enumerate(df[corp]):
        return gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), (list(df.loc[i,tags].values)))

tagged_document = df.map_partitions(read_corpus_tag_sub,meta=TaggedDocument)
tagged_document = tagged_document.compute()

此函数有点笨,因为该函数不会迭代(我知道),但是给出

This one is a bit dumb as the function won't iterate (I know) but gives the desired format, but only returns the first row in each partition.

def read_corpus_tag_sub(df,corp='claim_txt',tags=['claim_no']):
    tagged_list = []
    for i, line in enumerate(df[corp]):
        tagged = gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), (list(df.loc[i,tags].values)))
        tagged_list.append(tagged)
    return tagged_list

当我在循环外重构返回值时,几乎可以看出,该函数挂起将在dask客户端中建立内存,并且我的CPU利用率达到100%,但是没有任何任务在计算。请记住,我以相同的方式调用该函数。

Near as I can tell when refactoring the return outside the loop this function hangs builds memory in the dask client and my CPU utilization goes to 100% but no tasks are being computed. Keep in mind I'm calling the function the same way.

def tag_corp(corp,tag):
    return gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(corp), ([tag]))

tagged_document = [tag_corp(x,y) for x,y in list(zip(df_smple['claim_txt'],df_smple['claim_no']))]

列表comp我还没有时间测试过这个解决方案

List comp I haven't time tested this solution

tagged_document = list(read_corpus_tag_sub(df))

此解决方案将持续几个小时。但是,我做完这件事后没有足够的记忆力。

This solution will chug along pretty much for hours. However I don't have enough memory to juggle this thing when it's done.

我现在感觉超级迷路了。这是我查看过的线程列表。我承认自己是个新手,我花了这么多时间,感觉就像是在做傻瓜。

I feel Super lost right now. Here is a list of threads I've looked at. I admit to being really new to dask I've just spent so much time and I feel like I'm on a fools errand.


  1. 发电机产生的手提袋

  2. 使用Dask处理文本

  3. 加快熊猫使用速度达斯达(Dask)

  4. 如何利用一台机器上的所有内核并行化Pandas Dataframe上的apply()?

  5. python dask DataFrame,支持(t

  6. 什么?

  7. 简单的map_partitions在做什么?示例

  8. 文档

  1. Dask Bag from generator
  2. Processing Text With Dask
  3. Speed up Pandas apply using Dask
  4. How do you parallelize apply() on Pandas Dataframes making use of all cores on one machine?
  5. python dask DataFrame, support for (trivially parallelizable) row apply?
  6. What is map_partitions doing?
  7. simple dask map_partitions example
  8. The Docs


推荐答案

我不熟悉使用Dask API /限制,但通常:

I'm not familiar with the Dask APIs/limitations, but generally:


  • 如果您可以将数据迭代为(单词,标签)元组–甚至忽略 Doc2Vec / TaggedDocument 步骤–那么Dask端将被处理,并将这些元组转换为<$对于大型数据集,c $ c> TaggedDocument 实例应该很简单

  • if you can iterate over your data as (words, tags) tuples – even ignoring the Doc2Vec/TaggedDocument steps – then the Dask side will have been handled, and converting those tuples to TaggedDocument instances should be trivial

通常,您不希望这样做(并且可能不会有足够的RAM来实例化完整的数据集,作为内存中的 list –因此,您尝试使用 list() .append()可能在某种程度上可以正常工作,但是耗尽了本地内存(导致严重的交换)和/或只是没有到达数据末尾。

in general for large datasets, you don't want to (and may not have enough RAM to) instantiate the full dataset as a list in memory – so your attempts that involve a list() or .append() may be working, up to a point, but exhausting local memory (causing severe swapping) and/or just not reaching the end of your data.

大型数据集的首选方法是创建一个可迭代的对象,该对象每次被要求对数据进行迭代(因为 Doc2Vec 培训将需要多次通过),可以依次提供每个项目-但切勿将整个数据集读取到内存对象中。

The preferable approach to large datasets is to create an iterable object that, every time it is asked to iterate over the data (because Doc2Vec training will require multiple passes), can offer up each and every item in turn – but never reading the entire dataset into an in-memory object.

关于此模式的好博客文章是: Python中的数据流:生成器,迭代器,可迭代器

A good blogpost on this pattern is: Data streaming in Python: generators, iterators, iterables

鉴于您显示的代码,我怀疑正确的方法可能是这样的:

Given the code you've shown, I suspect the right approach for you may be like:

from gensim.utils import simple_preprocess

class MyDataframeCorpus(object):
    def __init__(self, source_df, text_col, tag_col):
        self.source_df = source_df
        self.text_col = text_col
        self.tag_col = tag_col

    def __iter__(self):
        for i, row in self.source_df.iterrows():
            yield TaggedDocument(words=simple_preprocess(row[self.text_col]), 
                                 tags=[row[self.tag_col]])

corpus_for_doc2vec = MyDataframeCorpus(df, 'claim_txt', 'claim_no')

这篇关于将dask数据框中的列转换为Doc2Vec的TaggedDocument的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆