给定大量街道名称,测试文本是否包含该街道名称之一的最有效方法是什么? [英] Given a huge set of street names, what is the most efficient way to test whether a text contains one of the street names from the set?

查看:80
本文介绍了给定大量街道名称,测试文本是否包含该街道名称之一的最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个有趣的问题,需要帮助.我目前正在开发程序的功能,却偶然发现了这个问题

I have an interesting problem that I need help with. I am currently working on a feature of my program and stumbled into this issues

  1. 我在数据库中存储了印度尼西亚的大量街道名称列表(> 10万行), 每个街道名称都可以包含1个以上的单词.例如:"Sudirman","Gatot Subroto"或"Jalan Asia Afrika"都是合法的街道名称

  1. I have a huge list of street names in Indonesia ( > 100k rows ) stored in database, Each street name may have more than 1 word. For example : "Sudirman", "Gatot Subroto", or "Jalan Asia Afrika" are all legit street names

在数据库中有一堆文本(> 1百万行),我将其拆分为多个句子.现在,我需要做的功能(准确地说是功能)是测试句子中是否有街道名称,所以只对/错测试

have a bunch of texts ( > 1 Million rows ) in databases, that I split into sentences. Now, the features ( function to be exact ) that I need to do , is to test whether there are street names inside the sentences or no, so just a true / false test

我尝试通过执行以下步骤来解决它:

I have tried to solve it by doing these steps:

a.将街道名称放入键值散列"中

a. Putting the street names into a Key,Value Hash

b.将每个句子分成单词

b. Split each sentences into words

c.测试单词是否在哈希中

c. Test whether words are in the hash

这是快速的方法,但不能同时使用多个单词

This is fast, but will not work with multiple words

我想到的另一种替代方法是执行以下步骤:

Another alternatives that I thought of is to do these steps:

a.将每个句子拆分成单词

a. Split each sentences into words

b.用LIKE语句查询数据库(即SELECT #### FROM street_table WHERE名称,例如'%word%')

b. Query the database with LIKE statement ( i,e. SELECT #### FROM street_table WHERE name like '%word%' )

c.如果查询返回一行,则表示该句子包含街道名称

c. If query returned a row, it means that the sentence contains street names

现在,此解决方案将需要大量的IO.

Now, this solution is going to be a very IO intensive.

所以我的问题是进行此测试的最有效方法是什么?"?不管编程语言如何.我主要是在python中进行此操作,但是只要我能掌握这些概念,任何语言都可以做到

So my question is "What is the most efficient way to do this test" ? regardless of the programming language. I do this in python mainly, but any language will do as long as I can grasp the concepts

============编辑1 ================

============EDIT 1 =================

这将是期刊吗?

是的,我将以1分钟的间隔调用此功能.每次通话至少要获取100行文字,并根据街道名称数据库对其进行测试

Yes, I will call this feature / function with an interval of 1 minute. Each call will take 100 row of texts at least and test them against the street name database

推荐答案

一个简单的解决方案是使用第一个单词的街道名称=>完整的街道名称创建字典/多图.当您遍历句子中的每个单词时,您将查找潜在的街道名称,并检查您是否有匹配项(通过查看下一个单词).

A simple solution would be to create a dictionary/multimap with first-word-of-street-name=>full-street-name(s). When you iterate each word in your sentence you'll look up potential street names, and check if you have a match (by looking at the next words).

该算法应该很容易实现,并且应该也表现不错.

This algorithm should be fairly easy to implement and should perform pretty good too.

这篇关于给定大量街道名称,测试文本是否包含该街道名称之一的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆