根据字符串的特定条件删除行 [英] Drop rows based on specific conditions on strings

查看:62
本文介绍了根据字符串的特定条件删除行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定这个数据框(它是我的一个子集):

Given this dataframe (which is a subset of mine):

<头>
用户名user_message
波罗普我喜欢这张照片,很漂亮
艺术
ArtingoEs un cuadro preciosa, me recuerda a mi infancia.
区域喜欢
Soi哎呀,说我讨厌那是一种委婉的说法
伊雨NaN

我想要做的是删除单词(标记)少于 5 个单词并且不是用英语写的行.我不熟悉熊猫,所以我想出了一个不太漂亮的解决方案:

What I'm trying to do is drop rows for which a number of words (tokens) is less than 5 words, and that are not written in English. I'm not familiar with pandas, so I imagined a not so pretty solution:

import pandas as pd
from langdetect import detect
index = 0
index_list = []
for review in df["user_message"]:
    count = 0
    if str(review) == "NaN":
        index_list.append(index)
        continue
    for i in review:
        if(i.isspace()):
            count=count+1
    if len(review) == 0:
        index_list.append(index)
    elif review.isspace() is True:
        index_list.append(index)
    elif count < 5:
        index_list.append(index)
    else:
        try:
            detect(review)
            if detect(review) != "en":
                index_list.append(index)
            else:
                pass
        except:
            pass
    index = index + 1
df = df.drop(index_list, axis = 0).reset_index(drop = True)

这个解决方案显然不起作用(我的数据框中有空行,只有一个单词的行),我确信它存在另一种有效的方法,那就是更快.您对如何解决这个问题有什么想法吗?

This solution apparently is not working (I'm having blank lines that remains in my dataframe and row with only one word) and I'm sure that it exists another efficient method, that is faster. Do you have an idea on how to tackle this issue?

谢谢.

编辑:多亏了@ansev 的回答,我终于让它工作了.由于如果发送的请求过多,TextBlob 会引发错误,因此我依赖于 langdetect 模块.对应的代码如下:

EDIT: So I finally got it to work, thanks to the answer of @ansev. Since TextBlob raises an error if too many requests are sent, I relied on the langdetect module. Here is the corresponding code:

m1 = df['user_message'].str.split(' ').str.len() > 5 
m2 = df['user_message'].str.isspace() 
df_filtered = df.loc[m1 | m2 == False].reset_index(drop=True) 
m3 = df_filtered['user_message'].astype(str).apply(lambda x: detect(x) if len(x) >= 5).eq('en')
df_filtered = df_filtered.loc[m3].reset_index(drop=True)

我必须单独执行 m3,因为如果检测无法识别文本,则会引发错误(这通常是由仅包含空格的字符串引起的,这是我执行的 m2 条件,它检查单元格是否仅包含空格(因此,如果是这种情况,则返回 True))).

I had to do m3 separately, since detect raises an error if it cannot identify the text (it is often cause by strings that only contains whitespaces, which is my I did the m2 condition, that checks if cells only contains whitespaces (thus returning True if that is the case)).

推荐答案

使用:

from textblob import TextBlob
m1 = df['user_message'].astype(str).apply(lambda x: TextBlob(x).detect_language() 
                                          if len(x) >= 3 else '').eq('en') 
m2 = df['user_message'].str.split(' ').str.len() > 5
df_filtered = df.loc[m1 | m2]
print(df_filtered)

  username                                       user_message
0    Polop       I love this picture, which is very beautiful
2  Artingo  Es un cuadro preciosa, me recuerda a mi infancia.
3     Zona                                          I like it
4      Soi        Yuck, to say I hate it would be a euphemism

检查安装

没有名为 textblob 的模块

这篇关于根据字符串的特定条件删除行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆