如何在 pandas 数据框中的多行中搜索文本? [英] How to search for text across multiple rows in a pandas dataframe?

查看:49
本文介绍了如何在 pandas 数据框中的多行中搜索文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我对 Python 还是很陌生,我只是想知道是否可以使用它来跨多行搜索文本.这是我的数据框的屏幕截图:

https://i.stack.imgur.com/jeqpv.png

为了更清楚,我想做的是搜索包含多个单词的短语或表达,例如New Jersey",但是,每个单词组成单独的一行,所以我不知道如何去关于在查询中包含多于一行.如果可能,我还想创建一个新列,用M"和不带N"的匹配标记任何匹配项.感谢所有帮助,让我更轻松!

解决方案

想法是将所有行连接起来,以便能够搜索多个连续单词.

例如,我们要查找短语she want to";在整个数据框中:

>>>df字幕0 她# <- 从这里开始 (1)1 想要#2 到 # <- 到此结束 (1)3 唱4 她# <- 从这里开始 (2)5 想要#6 到 # <- 到此结束 (2)7幕8 她# <- 从这里开始 (3)9 想要#10 到 # <- 到此结束 (3)11 舞

import re搜索 =她想要"文字 = "".join(df["字幕"])# 文本中单词开始/结束位置的索引end = df[字幕"].apply(len).cumsum() + pd.RangeIndex(len(df))开始 = end.shift(fill_value=-1) + 1# 创建额外的列df[开始"] = start.tolist()df[end"] = end.tolist()df[匹配"] = False# 查找搜索文本的所有迭代对于 re.finditer(search, text, re.IGNORECASE) 中的匹配:idx1 = df[df[开始"] == match.start()].index[0]idx2 = df[df[end"] == match.end()].index[0]df.loc[idx1:idx2, 匹配"] = 真

<预><代码>>>>df字幕开始结束匹配0 她 0 3 真的1 想要 4 9 正确2 到 10 12 真3 唱 13 17 假4 她 18 21 真的5 想要 22 27 真6 到 28 30 真7 法案 31 34 错误8 她 35 38 真的9 想要 39 44 真10 到 45 47 真11 跳舞 48 53 假

更新:搜索多个词:

仅更改:

# search = 她想要"search = [她想要"、如果你"、我会"]search = fr"({'|'.join(search)})";

# df = pd.DataFrame({'subtitle': ['She', 'wants', 'to', 'sing', 'she', 'wants', 'to', 'act', '她', '想要', 'to', 'dance', 'If', 'you', 'sing', 'I', 'will', 'smile', 'if', 'you', '笑', '我', '会', '微笑', '如果', '你', '爱', '我', '会', '微笑']})>>>df字幕开始结束匹配0 她 0 3 真的1 想要 4 9 正确2 到 10 12 真3 唱 13 17 假4 她 18 21 真的5 想要 22 27 真6 到 28 30 真7 法案 31 34 错误8 她 35 38 真的9 想要 39 44 真10 到 45 47 真11 跳舞 48 53 假12 如果 54 56 真13 你 57 60 真14 唱 61 65 假15 I 66 67 正确16 将 68 72 真17 微笑 73 78 假18 如果 79 81 真19 你 82 85 真的20 笑 86 91 假21 I 92 93 正确22 将 94 98 真23 微笑 99 104 假24 如果 105 107 真25 你 108 111 真26 爱 112 116 假27 117 118 真28 将 119 123 真29 微笑 124 129 假

更新 2:文本文件中的术语:

$ cat terms.txt她想如果你我会

search = [term.strip() for term in open("terms.txt").readlines()]search = fr"({'|'.join(search)})";

So I'm quite new to Python, and I was just wondering if it is possible for me to use it in order to search for text across multiple rows. Here is a screenshot of my dataframe:

https://i.stack.imgur.com/jeqpv.png

To make it clearer, what I would like to do is search for phrases or expressions containing more than one word, such as 'New Jersey,' however, each word makes up a separate row so I do not know how to go about including more than one row in the query. I would also, if possible, like to create a new column which will label any matches with 'M' and those without 'N.' All help is appreciated to make this easier for me!

解决方案

The idea is to join all rows to be able to search multiple continuous words.

For example, we want to find the phrase "she wants to" in whole dataframe:

>>> df
   subtitle
0       She  # <- start here (1)
1     wants  #
2        to  # <- end here (1)
3      sing
4       she  # <- start here (2)
5     wants  #
6        to  # <- end here (2)
7       act
8       she  # <- start here (3)
9     wants  # 
10       to  # <- end here (3)
11    dance

import re

search = "she wants to"
text = " ".join(df["subtitle"])

# index of start / end position of the word in text
end = df["subtitle"].apply(len).cumsum() + pd.RangeIndex(len(df))
start = end.shift(fill_value=-1) + 1

# create additional columns
df["start"] = start.tolist()
df["end"] = end.tolist()
df["match"] = False

# find all iteration of the search text
for match in re.finditer(search, text, re.IGNORECASE):
    idx1 = df[df["start"] == match.start()].index[0]
    idx2 = df[df["end"] == match.end()].index[0]
    df.loc[idx1:idx2, "match"] = True

>>> df
   subtitle  start  end  match
0       She      0    3   True
1     wants      4    9   True
2        to     10   12   True
3      sing     13   17  False
4       she     18   21   True
5     wants     22   27   True
6        to     28   30   True
7       act     31   34  False
8       she     35   38   True
9     wants     39   44   True
10       to     45   47   True
11    dance     48   53  False

Update: search for multiple terms:

Change only:

# search = "she wants to"
search = ["she wants to", "if you", "I will"]
search = fr"({'|'.join(search)})"

# df = pd.DataFrame({'subtitle': ['She', 'wants', 'to', 'sing', 'she', 'wants', 'to', 'act', 'she', 'wants', 'to', 'dance', 'If', 'you', 'sing', 'I', 'will', 'smile', 'if', 'you', 'laugh', 'I', 'will', 'smile', 'if', 'you', 'love', 'I', 'will', 'smile']})
>>> df
   subtitle  start  end  match
0       She      0    3   True
1     wants      4    9   True
2        to     10   12   True
3      sing     13   17  False
4       she     18   21   True
5     wants     22   27   True
6        to     28   30   True
7       act     31   34  False
8       she     35   38   True
9     wants     39   44   True
10       to     45   47   True
11    dance     48   53  False
12       If     54   56   True
13      you     57   60   True
14     sing     61   65  False
15        I     66   67   True
16     will     68   72   True
17    smile     73   78  False
18       if     79   81   True
19      you     82   85   True
20    laugh     86   91  False
21        I     92   93   True
22     will     94   98   True
23    smile     99  104  False
24       if    105  107   True
25      you    108  111   True
26     love    112  116  False
27        I    117  118   True
28     will    119  123   True
29    smile    124  129  False

Update 2: terms into text file:

$ cat terms.txt
she wants to
if you
I will

search = [term.strip() for term in open("terms.txt").readlines()]
search = fr"({'|'.join(search)})"

这篇关于如何在 pandas 数据框中的多行中搜索文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆