根据字符串的特定条件删除行 [英] Drop rows based on specific conditions on strings

查看：62 发布时间：2021/6/13 20:57:48 python pandas

本文介绍了根据字符串的特定条件删除行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

给定这个数据框(它是我的一个子集):

Given this dataframe (which is a subset of mine):

<头>

用户名	user_message
波罗普	我喜欢这张照片，很漂亮
艺术	嗯
Artingo	Es un cuadro preciosa, me recuerda a mi infancia.
区域	喜欢
Soi	哎呀，说我讨厌那是一种委婉的说法
伊雨	NaN

我想要做的是删除单词(标记)少于 5 个单词并且不是用英语写的行.我不熟悉熊猫，所以我想出了一个不太漂亮的解决方案:

What I'm trying to do is drop rows for which a number of words (tokens) is less than 5 words, and that are not written in English. I'm not familiar with pandas, so I imagined a not so pretty solution:

import pandas as pd
from langdetect import detect
index = 0
index_list = []
for review in df["user_message"]:
    count = 0
    if str(review) == "NaN":
        index_list.append(index)
        continue
    for i in review:
        if(i.isspace()):
            count=count+1
    if len(review) == 0:
        index_list.append(index)
    elif review.isspace() is True:
        index_list.append(index)
    elif count < 5:
        index_list.append(index)
    else:
        try:
            detect(review)
            if detect(review) != "en":
                index_list.append(index)
            else:
                pass
        except:
            pass
    index = index + 1
df = df.drop(index_list, axis = 0).reset_index(drop = True)

这个解决方案显然不起作用(我的数据框中有空行，只有一个单词的行)，我确信它存在另一种有效的方法，那就是更快.您对如何解决这个问题有什么想法吗?

This solution apparently is not working (I'm having blank lines that remains in my dataframe and row with only one word) and I'm sure that it exists another efficient method, that is faster. Do you have an idea on how to tackle this issue?

谢谢.

编辑:多亏了@ansev 的回答，我终于让它工作了.由于如果发送的请求过多，TextBlob 会引发错误，因此我依赖于 langdetect 模块.对应的代码如下:

EDIT: So I finally got it to work, thanks to the answer of @ansev. Since TextBlob raises an error if too many requests are sent, I relied on the langdetect module. Here is the corresponding code:

m1 = df['user_message'].str.split(' ').str.len() > 5 
m2 = df['user_message'].str.isspace() 
df_filtered = df.loc[m1 | m2 == False].reset_index(drop=True) 
m3 = df_filtered['user_message'].astype(str).apply(lambda x: detect(x) if len(x) >= 5).eq('en')
df_filtered = df_filtered.loc[m3].reset_index(drop=True)

我必须单独执行 m3，因为如果检测无法识别文本，则会引发错误(这通常是由仅包含空格的字符串引起的，这是我执行的 m2 条件，它检查单元格是否仅包含空格(因此，如果是这种情况，则返回 True))).

I had to do m3 separately, since detect raises an error if it cannot identify the text (it is often cause by strings that only contains whitespaces, which is my I did the m2 condition, that checks if cells only contains whitespaces (thus returning True if that is the case)).

根据字符串的特定条件删除行 [英] Drop rows based on specific conditions on strings

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

根据字符串的特定条件删除行 [英] Drop rows based on specific conditions on strings

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭