使用 spaCy 删除标点符号;属性错误 [英] Removing punctuation using spaCy; AttributeError

查看:154
本文介绍了使用 spaCy 删除标点符号;属性错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前我正在使用以下代码使用 spaCy 对某些文本数据进行词形还原和计算 TF-IDF 值:

Currently I'm using the following code to lemmatize and calculate TF-IDF values for some text data using spaCy:

lemma = []

for doc in nlp.pipe(df['col'].astype('unicode').values, batch_size=9844,
                        n_threads=3):
    if doc.is_parsed:
        lemma.append([n.lemma_ for n in doc if not n.lemma_.is_punct | n.lemma_ != "-PRON-"])
    else:
        lemma.append(None)

df['lemma_col'] = lemma

vect = sklearn.feature_extraction.text.TfidfVectorizer()
lemmas = df['lemma_col'].apply(lambda x: ' '.join(x))
vect = sklearn.feature_extraction.text.TfidfVectorizer()
features = vect.fit_transform(lemmas)

feature_names = vect.get_feature_names()
dense = features.todense()
denselist = dense.tolist()

df = pd.DataFrame(denselist, columns=feature_names)
df = pd.DataFrame(denselist, columns=feature_names)
lemmas = pd.concat([lemmas, df])
df= pd.concat([df, lemmas])

我需要删除专有名词、标点符号和停用词,但在我当前的代码中执行此操作时遇到了一些麻烦.我已经阅读了一些文档其他资源,但我现在遇到了一个错误:

I need to strip out proper nouns, punctuation, and stop words but am having some trouble doing that within my current code. I've read some documentation and other resources, but am now running into an error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-21-e924639f7822> in <module>()
      7     if doc.is_parsed:
      8         tokens.append([n.text for n in doc])
----> 9         lemma.append([n.lemma_ for n in doc if not n.lemma_.is_punct or n.lemma_ != "-PRON-"])
     10         pos.append([n.pos_ for n in doc])
     11     else:

<ipython-input-21-e924639f7822> in <listcomp>(.0)
      7     if doc.is_parsed:
      8         tokens.append([n.text for n in doc])
----> 9         lemma.append([n.lemma_ for n in doc if not n.lemma_.is_punct or n.lemma_ != "-PRON-"])
     10         pos.append([n.pos_ for n in doc])
     11     else:

AttributeError: 'str' object has no attribute 'is_punct'

有没有更简单的方法可以将这些内容从文本中剔除,而不必彻底改变我的方法?

Is there an easier way to strip this stuff out of the text, without having to drastically change my approach?

完整代码可用此处.

推荐答案

据我所知,你这里的主要问题其实很简单:n.lemma_ 返回一个字符串,而不是一个 Token 对象.所以它没有 is_punct 属性.我认为您在这里寻找的是 n.is_punct(token 是否为标点符号).

From what I can see, your main problem here is actually quite simple: n.lemma_ returns a string, not a Token object. So it doesn't have an is_punct attribute. I think what you were looking for here is n.is_punct (whether the token is punctuation).

如果您想更优雅地执行此操作,请查看 spaCy 的新自定义处理管道组件(需要 v2.0+).这使您可以将逻辑包装在一个函数中,该函数在您对文本调用 nlp() 时自动运行.你甚至可以更进一步,添加一个自定义属性到您的 Doc - 例如,doc._.my_stripped_docdoc._.pd_columns 或其他东西.这里的优点是您可以继续使用 spaCy 的高性能内置数据结构,如 Doc(及其视图 TokenSpan)作为您申请的单一事实来源".这样,不会丢失任何信息,而且您将始终保留对原始文档的引用——这对于调试也非常有用.

If you want to do this more elegantly, check out spaCy's new custom processing pipeline components (requires v2.0+). This lets you wrap your logic in a function which is run automatically when you call nlp() on your text. You could even take this one step further, and add a custom attribute to your Doc – for example, doc._.my_stripped_doc or doc._.pd_columns or something. The advantage here is that you can keep using spaCy's performant, built-in data structures like the Doc (and its views Token and Span) as the "single source of truth" of your application. This way, no information is lost and you'll always keep a reference to the original document – which is also very useful for debugging.

这篇关于使用 spaCy 删除标点符号;属性错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆