根据字符串的特定条件删除行 [英] Drop rows based on specific conditions on strings
问题描述
给定这个数据框(它是我的一个子集):
Given this dataframe (which is a subset of mine):
用户名 | user_message |
---|---|
波罗普 | 我喜欢这张照片,很漂亮 |
艺术 | 嗯 |
Artingo | Es un cuadro preciosa, me recuerda a mi infancia. |
区域 | 喜欢 |
Soi | 哎呀,说我讨厌那是一种委婉的说法 |
伊雨 | NaN |
我想要做的是删除单词(标记)少于 5 个单词并且不是用英语写的行.我不熟悉熊猫,所以我想出了一个不太漂亮的解决方案:
What I'm trying to do is drop rows for which a number of words (tokens) is less than 5 words, and that are not written in English. I'm not familiar with pandas, so I imagined a not so pretty solution:
import pandas as pd
from langdetect import detect
index = 0
index_list = []
for review in df["user_message"]:
count = 0
if str(review) == "NaN":
index_list.append(index)
continue
for i in review:
if(i.isspace()):
count=count+1
if len(review) == 0:
index_list.append(index)
elif review.isspace() is True:
index_list.append(index)
elif count < 5:
index_list.append(index)
else:
try:
detect(review)
if detect(review) != "en":
index_list.append(index)
else:
pass
except:
pass
index = index + 1
df = df.drop(index_list, axis = 0).reset_index(drop = True)
这个解决方案显然不起作用(我的数据框中有空行,只有一个单词的行),我确信它存在另一种有效的方法,那就是更快.您对如何解决这个问题有什么想法吗?
This solution apparently is not working (I'm having blank lines that remains in my dataframe and row with only one word) and I'm sure that it exists another efficient method, that is faster. Do you have an idea on how to tackle this issue?
谢谢.
编辑:多亏了@ansev 的回答,我终于让它工作了.由于如果发送的请求过多,TextBlob 会引发错误,因此我依赖于 langdetect 模块.对应的代码如下:
EDIT: So I finally got it to work, thanks to the answer of @ansev. Since TextBlob raises an error if too many requests are sent, I relied on the langdetect module. Here is the corresponding code:
m1 = df['user_message'].str.split(' ').str.len() > 5
m2 = df['user_message'].str.isspace()
df_filtered = df.loc[m1 | m2 == False].reset_index(drop=True)
m3 = df_filtered['user_message'].astype(str).apply(lambda x: detect(x) if len(x) >= 5).eq('en')
df_filtered = df_filtered.loc[m3].reset_index(drop=True)
我必须单独执行 m3,因为如果检测无法识别文本,则会引发错误(这通常是由仅包含空格的字符串引起的,这是我执行的 m2 条件,它检查单元格是否仅包含空格(因此,如果是这种情况,则返回 True))).
I had to do m3 separately, since detect raises an error if it cannot identify the text (it is often cause by strings that only contains whitespaces, which is my I did the m2 condition, that checks if cells only contains whitespaces (thus returning True if that is the case)).
推荐答案
使用:
from textblob import TextBlob
m1 = df['user_message'].astype(str).apply(lambda x: TextBlob(x).detect_language()
if len(x) >= 3 else '').eq('en')
m2 = df['user_message'].str.split(' ').str.len() > 5
df_filtered = df.loc[m1 | m2]
print(df_filtered)
username user_message
0 Polop I love this picture, which is very beautiful
2 Artingo Es un cuadro preciosa, me recuerda a mi infancia.
3 Zona I like it
4 Soi Yuck, to say I hate it would be a euphemism
检查安装
这篇关于根据字符串的特定条件删除行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!