Python:如何按子字符串相关性对字符串列表进行排序? [英] Python: how to sort a list of strings by substring relevance?

查看:136
本文介绍了Python:如何按子字符串相关性对字符串列表进行排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些字符串列表,例如:

I have some list of strings, for example:

["foo bar SOME baz TEXT bob",
"SOME foo bar baz bob TEXT",
"SOME foo TEXT",
"foo bar SOME TEXT baz",     
"SOME TEXT"]

我希望按准确度对SOME TEXT子字符串进行排序(大写无关紧要).像这样的命令:

I want it to be sorted by exactness to SOME TEXT substring (upper case doesn't matter). Something like this order:

["SOME TEXT",
"foo bar SOME TEXT baz",
"SOME foo TEXT",
"foo bar SOME baz TEXT bob",
"SOME foo bar baz bob TEXT"]

这个想法是-最好的分数将获得与子字符串单词位置最匹配的字符串.而对于更大数量的马虎",子字符串的单词之间的单词-它获得的较低顺序.

我发现了一些库,例如 fuzzyset

The idea is - the best score gets the string with the best match to substring words position. And for bigger amount of "sloppy" words between substring's words - the lower ordering it gets.

I have found some libraries like fuzzyset, or Levenshtein distance but I'm not sure this is what I need. I know the exact substring by what I want to sort and those libs search the similar words, as I understood.

Actually I need to do this sort after some database query (Postgresql) in my Django project. I have already tried full-text search with its ORM, but didn't get this relevant sort order (it doesn't count the distance between substring words). Next I have tried Haystack+Whoosh, but also at this moment didn't find info how to do this sort there. So idea now is to get query set and next sort it out of the database (yep, I know that might be a bad decision, but for now I want it just work). But if anybody tells me how to do this within any of technologies, I have mentioned here - that will be also super cool. Thank you!

p.s.子字符串的长度应该在最多20个单词的字符串中为2-10个单词.

p.s. The length of substring supposed to be 2-10 words in max 20 word string.

推荐答案

您可以使用

You can use difflib.SequenceMatcher, to achieve something very similar to your desired output:

>>> import difflib
>>> l = ["foo bar SOME baz TEXT bob", "SOME foo bar baz bob TEXT", "SOME foo TEXT", "foo bar SOME TEXT baz", "SOME TEXT"]
>>> sorted(l, key=lambda z: difflib.SequenceMatcher(None, z, "SOME TEXT").ratio(), reverse=True)
['SOME TEXT', 'SOME foo TEXT', 'foo bar SOME TEXT baz', 'foo bar SOME baz TEXT bob', 'SOME foo bar baz bob TEXT']

如果您不知道唯一的区别,就是与所需的输出相比,两个元素"foo bar SOME TEXT baz""SOME foo TEXT"的位置已交换.

If you can't tell the only difference is that the position of the two elements "foo bar SOME TEXT baz" and "SOME foo TEXT" are swapped compared to your desired output.

这篇关于Python:如何按子字符串相关性对字符串列表进行排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆