如何根据单词对的存在选择子字符串？蟒蛇 [英] How to select sub-strings based on the presence of word pairs? Python

查看：61 发布时间：2020/10/17 1:33:28 python dataframe

本文介绍了如何根据单词对的存在选择子字符串？蟒蛇的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有很多句子，我想从中提取以某些单词组合开头的子句。例如，我想提取以 what do或 what is等开头的句子段（从出现在单词对之前的句子中基本上删除单词）。句子和单词对都是存储在 DataFrame 中：

I have a large number of sentences, from which I want to extract sub-sentences that start with certain word combinations. For example, I want to extract sentence segments that begin with "what does" or "what is', etc. (essentially eliminating the words from the sentence that appear before the word-pairs). Both the sentences and the word-pairs are stored in a DataFrame:

'Sentence'                                    'First2'                                    
0  If this is a string what does it say?      0 can I    
1  And this is a string, should it say more?  1 should it    
2  This is yet another string.                2 what does
3  etc. etc.                                  3 etc. etc

上面示例中的结果是是：

The result I want from the above example would be:

0 what does it say?
1 should it say more?
2

下面最明显的解决方案（至少对我来说）不起作用，它仅使用第一个单词对 b 遍历所有句子 r ，但不能另一个 b 。

The most obvious solution (at least to me) below does not work. It only uses the first word-pair b to go over all the sentences r, but not the other b's.

a = df['Sentence']
b = df['First2'] 

#The function seems to loop over all r's but only over the first b:
def func(z): 
    for x in b:
        if x in r:
            s = z[z.index(x):] 
            return s
        else:
            return ‘’

df['Segments'] = a.apply(func)

以这种方式同时循环两个DataFrame无效。有没有更有效的方法？

It seems that looping over two DataFrames simultaneously in this way does not work. Is there a more efficient and effective way to do this?

推荐答案

我相信您的代码中存在错误。

I believe there is a bug in your code.

else:
    return ''

这意味着如果第一个比较不匹配，则'func'将立即返回。这就是为什么代码未返回任何匹配项的原因。

This means if the 1st comparison is not a match, 'func' will return immediately. That might be why the code does not return any matches.

示例代码如下：

# The function seems to loop over all r's but only over the first b:
def func(sentence, first_twos=b):
    for first_two in first_twos:
        if first_two in sentence:
            s = sentence[sentence.index(first_two):]
            return s
    return ''

df['Segments'] = a.apply(func)

输出：

df:   
{   
'First2': ['can I', 'should it', 'what does'],   
'Segments': ['what does it say? ', 'should it say more?', ''],   
'Sentence': ['If this is a string what does it say? ', 'And this is a string, should it say more?', 'This is yet another string.  '  ]  
}

这篇关于如何根据单词对的存在选择子字符串？蟒蛇的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何根据单词对的存在选择子字符串？蟒蛇 [英] How to select sub-strings based on the presence of word pairs? Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何根据单词对的存在选择子字符串？蟒蛇 [英] How to select sub-strings based on the presence of word pairs? Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭