如何根据单词对的存在选择子字符串?蟒蛇 [英] How to select sub-strings based on the presence of word pairs? Python

查看:61
本文介绍了如何根据单词对的存在选择子字符串?蟒蛇的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多句子,我想从中提取以某些单词组合开头的子句。例如,我想提取以 what do或 what is等开头的句子段(从出现在单词对之前的句子中基本上删除单词)。句子和单词对都是存储在 DataFrame 中:

I have a large number of sentences, from which I want to extract sub-sentences that start with certain word combinations. For example, I want to extract sentence segments that begin with "what does" or "what is', etc. (essentially eliminating the words from the sentence that appear before the word-pairs). Both the sentences and the word-pairs are stored in a DataFrame:

'Sentence'                                    'First2'                                    
0  If this is a string what does it say?      0 can I    
1  And this is a string, should it say more?  1 should it    
2  This is yet another string.                2 what does
3  etc. etc.                                  3 etc. etc

上面示例中的结果是是:

The result I want from the above example would be:

0 what does it say?
1 should it say more?
2

下面最明显的解决方案(至少对我来说)不起作用,它仅使用第一个单词对 b 遍历所有句子 r ,但不能另一个 b

The most obvious solution (at least to me) below does not work. It only uses the first word-pair b to go over all the sentences r, but not the other b's.

a = df['Sentence']
b = df['First2'] 

#The function seems to loop over all r's but only over the first b:
def func(z): 
    for x in b:
        if x in r:
            s = z[z.index(x):] 
            return s
        else:
            return ‘’

df['Segments'] = a.apply(func)

以这种方式同时循环两个DataFrame无效。有没有更有效的方法?

It seems that looping over two DataFrames simultaneously in this way does not work. Is there a more efficient and effective way to do this?

推荐答案

我相信您的代码中存在错误。

I believe there is a bug in your code.

else:
    return ''

这意味着如果第一个比较不匹配,则'func'将立即返回。这就是为什么代码未返回任何匹配项的原因。

This means if the 1st comparison is not a match, 'func' will return immediately. That might be why the code does not return any matches.

示例代码如下:

# The function seems to loop over all r's but only over the first b:
def func(sentence, first_twos=b):
    for first_two in first_twos:
        if first_two in sentence:
            s = sentence[sentence.index(first_two):]
            return s
    return ''

df['Segments'] = a.apply(func)

输出:

df:   
{   
'First2': ['can I', 'should it', 'what does'],   
'Segments': ['what does it say? ', 'should it say more?', ''],   
'Sentence': ['If this is a string what does it say? ', 'And this is a string, should it say more?', 'This is yet another string.  '  ]  
} 

这篇关于如何根据单词对的存在选择子字符串?蟒蛇的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆