如何在 pandas 数据框中的多行中搜索文本? [英] How to search for text across multiple rows in a pandas dataframe?
问题描述
所以我对 Python 还是很陌生,我只是想知道是否可以使用它来跨多行搜索文本.这是我的数据框的屏幕截图:
https://i.stack.imgur.com/jeqpv.png
为了更清楚,我想做的是搜索包含多个单词的短语或表达,例如New Jersey",但是,每个单词组成单独的一行,所以我不知道如何去关于在查询中包含多于一行.如果可能,我还想创建一个新列,用M"和不带N"的匹配标记任何匹配项.感谢所有帮助,让我更轻松!
想法是将所有行连接起来,以便能够搜索多个连续单词.
例如,我们要查找短语she want to";在整个数据框中:
>>>df字幕0 她# <- 从这里开始 (1)1 想要#2 到 # <- 到此结束 (1)3 唱4 她# <- 从这里开始 (2)5 想要#6 到 # <- 到此结束 (2)7幕8 她# <- 从这里开始 (3)9 想要#10 到 # <- 到此结束 (3)11 舞
import re搜索 =她想要"文字 = "".join(df["字幕"])# 文本中单词开始/结束位置的索引end = df[字幕"].apply(len).cumsum() + pd.RangeIndex(len(df))开始 = end.shift(fill_value=-1) + 1# 创建额外的列df[开始"] = start.tolist()df[end"] = end.tolist()df[匹配"] = False# 查找搜索文本的所有迭代对于 re.finditer(search, text, re.IGNORECASE) 中的匹配:idx1 = df[df[开始"] == match.start()].index[0]idx2 = df[df[end"] == match.end()].index[0]df.loc[idx1:idx2, 匹配"] = 真
<预><代码>>>>df字幕开始结束匹配0 她 0 3 真的1 想要 4 9 正确2 到 10 12 真3 唱 13 17 假4 她 18 21 真的5 想要 22 27 真6 到 28 30 真7 法案 31 34 错误8 她 35 38 真的9 想要 39 44 真10 到 45 47 真11 跳舞 48 53 假
更新:搜索多个词:
仅更改:
# search = 她想要"search = [她想要"、如果你"、我会"]search = fr"({'|'.join(search)})";
# df = pd.DataFrame({'subtitle': ['She', 'wants', 'to', 'sing', 'she', 'wants', 'to', 'act', '她', '想要', 'to', 'dance', 'If', 'you', 'sing', 'I', 'will', 'smile', 'if', 'you', '笑', '我', '会', '微笑', '如果', '你', '爱', '我', '会', '微笑']})>>>df字幕开始结束匹配0 她 0 3 真的1 想要 4 9 正确2 到 10 12 真3 唱 13 17 假4 她 18 21 真的5 想要 22 27 真6 到 28 30 真7 法案 31 34 错误8 她 35 38 真的9 想要 39 44 真10 到 45 47 真11 跳舞 48 53 假12 如果 54 56 真13 你 57 60 真14 唱 61 65 假15 I 66 67 正确16 将 68 72 真17 微笑 73 78 假18 如果 79 81 真19 你 82 85 真的20 笑 86 91 假21 I 92 93 正确22 将 94 98 真23 微笑 99 104 假24 如果 105 107 真25 你 108 111 真26 爱 112 116 假27 117 118 真28 将 119 123 真29 微笑 124 129 假
更新 2:文本文件中的术语:
$ cat terms.txt她想如果你我会
search = [term.strip() for term in open("terms.txt").readlines()]search = fr"({'|'.join(search)})";
So I'm quite new to Python, and I was just wondering if it is possible for me to use it in order to search for text across multiple rows. Here is a screenshot of my dataframe:
https://i.stack.imgur.com/jeqpv.png
To make it clearer, what I would like to do is search for phrases or expressions containing more than one word, such as 'New Jersey,' however, each word makes up a separate row so I do not know how to go about including more than one row in the query. I would also, if possible, like to create a new column which will label any matches with 'M' and those without 'N.' All help is appreciated to make this easier for me!
The idea is to join all rows to be able to search multiple continuous words.
For example, we want to find the phrase "she wants to" in whole dataframe:
>>> df
subtitle
0 She # <- start here (1)
1 wants #
2 to # <- end here (1)
3 sing
4 she # <- start here (2)
5 wants #
6 to # <- end here (2)
7 act
8 she # <- start here (3)
9 wants #
10 to # <- end here (3)
11 dance
import re
search = "she wants to"
text = " ".join(df["subtitle"])
# index of start / end position of the word in text
end = df["subtitle"].apply(len).cumsum() + pd.RangeIndex(len(df))
start = end.shift(fill_value=-1) + 1
# create additional columns
df["start"] = start.tolist()
df["end"] = end.tolist()
df["match"] = False
# find all iteration of the search text
for match in re.finditer(search, text, re.IGNORECASE):
idx1 = df[df["start"] == match.start()].index[0]
idx2 = df[df["end"] == match.end()].index[0]
df.loc[idx1:idx2, "match"] = True
>>> df
subtitle start end match
0 She 0 3 True
1 wants 4 9 True
2 to 10 12 True
3 sing 13 17 False
4 she 18 21 True
5 wants 22 27 True
6 to 28 30 True
7 act 31 34 False
8 she 35 38 True
9 wants 39 44 True
10 to 45 47 True
11 dance 48 53 False
Update: search for multiple terms:
Change only:
# search = "she wants to"
search = ["she wants to", "if you", "I will"]
search = fr"({'|'.join(search)})"
# df = pd.DataFrame({'subtitle': ['She', 'wants', 'to', 'sing', 'she', 'wants', 'to', 'act', 'she', 'wants', 'to', 'dance', 'If', 'you', 'sing', 'I', 'will', 'smile', 'if', 'you', 'laugh', 'I', 'will', 'smile', 'if', 'you', 'love', 'I', 'will', 'smile']})
>>> df
subtitle start end match
0 She 0 3 True
1 wants 4 9 True
2 to 10 12 True
3 sing 13 17 False
4 she 18 21 True
5 wants 22 27 True
6 to 28 30 True
7 act 31 34 False
8 she 35 38 True
9 wants 39 44 True
10 to 45 47 True
11 dance 48 53 False
12 If 54 56 True
13 you 57 60 True
14 sing 61 65 False
15 I 66 67 True
16 will 68 72 True
17 smile 73 78 False
18 if 79 81 True
19 you 82 85 True
20 laugh 86 91 False
21 I 92 93 True
22 will 94 98 True
23 smile 99 104 False
24 if 105 107 True
25 you 108 111 True
26 love 112 116 False
27 I 117 118 True
28 will 119 123 True
29 smile 124 129 False
Update 2: terms into text file:
$ cat terms.txt
she wants to
if you
I will
search = [term.strip() for term in open("terms.txt").readlines()]
search = fr"({'|'.join(search)})"
这篇关于如何在 pandas 数据框中的多行中搜索文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!