比较数据库中两个表格之间的字符串或本地字符串 [英] Compare strings of text between two tables in a database or locally

查看:248
本文介绍了比较数据库中两个表格之间的字符串或本地字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编辑:SQL对此不起作用。我刚刚发现了Solr / Sphinx,它似乎是解决这个问题的正确工具,所以如果你知道Solr或狮身人面像,我很想听听你的消息。



基本上,我有一个.tsv与专利信息和.csv与产品名称。我需要将专利列的每一行与产品名称进行匹配,并在新的.csv列中提取出现情况。



您可以向下滚动并查看示例结束。





原始问题



这里的SQL新手很适合我:)。我不知道该怎么做:

$ C>的MySQL> SHOW TABLES;
+ ----------------------- +
| Tables_in_prodpatdb |
+ ----------------------- +
|受让人|
|专利|
| patent_info |
|产品|
+ ----------------------- +
mysql> DESCRIBE专利;
+ ------------- + ------------- + ------ + ----- + ---- ----- + ------- +
|字段|类型|空| Key |默认|额外|
+ ------------- + ------------- + ------ + ----- + ---- ----- + ------- +
| ... | | | | | |
| patent_id | varchar(20)|是| | NULL | |
|文字|文字|是| | NULL | |
| ... | | | | | |
+ ------------- + ------------- + ------ + ----- + ---- ----- + ------- +
mysql> DESCRIBE产品;
+ ------------- + ------------- + ------ + ----- + ---- ----- + ------- +
|字段|类型|空| Key |默认|额外|
+ ------------- + ------------- + ------ + ----- + ---- ----- + ------- +
|名称|文字|是| | NULL | |
+ ------------- + ------------- + ------ + ----- + ---- ----- + ------- +

我必须使用列 name text ,它们看起来像这样:

 名称
产品1
产品2
产品3
...
〜10M行






 文字
长文字说明1
长文字说明2
长文字说明3
...
〜88M行

我需要检查 patents.text 第1行,并将其与 products.name 列进行匹配找到该行中的每个产品名称,然后将这些产品名称存储在一个新表格中。然后检查第2行并重复。

如果一个 patents.text 行有一个产品名称,则多次只将其复制到新表中一次。如果某行没有产品名称,则跳过它。输出应该是这样的:

 操作产品
1 prod5,prod6
2 prod7
...

一个例子:

< pre $ 名称
阀门
a / c fan
养殖三文鱼
...






  text 
该专利涉及一种新的空中方法风扇。随着一些新阀门的空调风扇是
好​​多了。新阀门很棒。
该专利中没有产品名称。
这项专利谈到养殖鲑鱼。
...






所需输出:

 操作产品
1阀门,a / c风扇
2养殖三文鱼
.. 。


解决方案

合理的表现是全文搜索。我自己很少做这些事情(20多年以来可能会有3次);所以我会推迟给别人带来更多的经验。



使用 https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html 作为出发点。



提供全文索引已被创建,它可能是这样简单的:

  SELECT pat.patent_ID,group_concat(P.Name )
FROM PAT拍卖
CROSS JOIN产品p
匹配(pat.text)
反对(自然语言模式下的p.name)
GROUP BY pat.patent_ID ;

由于每件产品和每件专利都必须交叉连接,因此我们现在拥有8.8亿行...仅此而已。然而,我在这方面做的阅读越多,我越意识到我们正在处理RDBMS中的非结构化数据。由于它的天性,这不是一个理想的契合;并且可能有更多的优化方法来处理RDBMS之外的这种情况。要么;我们必须花时间在RDBMS中构造数据,以便在索引中更有效(比如将文本分割成每个词的索引中的单行)

<最后,
我们是否真的需要寻找所有产品?涉及两种大小的数据的剪切大小意味着这将花费时间在不能很好地处理非结构化数据的数据库中。


划伤以下部分,因为它无法有效处理负载。但是为了后代保留它



我认为 concat() group_concat()可以做到这一点。



我们加入了patent.text与产生多行的产品名称相同的地方。然后group_concat将这些行组合成一条记录。

  SELECT pat.text,group_concat(P.Name) as产品
来自专利pat
INNER JOIN文本
对于pat.text像concat('%',p.name,'%')
GROUP by pat.text

然而,不要指望这很快;因为我们在两端使用%进行通配符搜索;所以不能使用索引。


Edit: SQL doesn't work for this. I just found out about Solr/Sphinx and it seems like the right tool for this problem, so if you know Solr or Sphinx I'm eager to hear from you.

Basically, I have a .tsv with patent info and a .csv with product names. I need to match each row of the patents column against the product names and extract the occurrences in a new .csv column.

You can scroll down and see the example at the end.

Original question:

SQL newbie here so bear with me :). I can't figure out how to do this:

My database:

mysql> SHOW TABLES;
+-----------------------+
| Tables_in_prodpatdb   |
+-----------------------+
| assignee              |
| patents               |
| patent_info           |
| products              |
+-----------------------+
mysql> DESCRIBE patents;
+-------------+-------------+------+-----+---------+-------+
| Field       | Type        | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| ...         |             |      |     |         |       |
| patent_id   | varchar(20) | YES  |     | NULL    |       |
| text        | text        | YES  |     | NULL    |       |
| ...         |             |      |     |         |       |
+-------------+-------------+------+-----+---------+-------+
mysql> DESCRIBE products;
+-------------+-------------+------+-----+---------+-------+
| Field       | Type        | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| name        | text        | YES  |     | NULL    |       |
+-------------+-------------+------+-----+---------+-------+

I have to work with the columns name and text, they look like this:

name
product1
product2
product3
...
~10M rows


text
long text description 1
long text description 2
long text description 3
...
~88M rows

I need to check patents.text row 1 and match it against products.name column to find every product name in that row, then store those products names in a new table. Then check row 2 and repeat.

If a patents.text row has a product name several times only copy it to the new table once. If some row has no product names just skip it. The output should be something like this:

Operation  Product
1          prod5, prod6
2          prod7
...

An example:

name
valve
a/c fan
farmed salmon
...


  text
  This patent deals with a new approach to air-conditioned fan. With some new valve the a/c fan is 
so much better. The new valve is great.
  This patent has no product names in it.
  This patent talks about farmed salmon.
  ...


Desired output:

Operation   Product
1           valve, a/c fan
2           farmed salmon
...

解决方案

The only way I can see doing this with a reasonable performance is a full text search. I've seldom done these myself (maybe 3 times in 20+ years now); so I'll defer to someone else w/ more experience.

Using https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html as a starting point.

Provided the full text index has been created, it may be something as simple as:

SELECT pat.patent_ID, group_concat(P.Name)  
FROM patents pat 
CROSS JOIN products p 
WHERE MATCH (pat.text)
        AGAINST (p.name IN NATURAL LANGUAGE MODE)
GROUP BY pat.patent_ID;

Since every product vs every patent we have to cross join so we now have 880 million rows... That alone is a alot. The more reading I do on this however, the more I realize we're dealing with unstructured data in a RDBMS. by it's nature that's not an ideal fit; and there may be much more optimized methods to handle this outside of a RDBMS. or; we have to spend the time to structure the data in the RDBMS so it can be more effective iwth the indexes (such as splitting the text into it's own rows per word for indexing)

Lastly, Do we really need to look for ALL products? the shear size of the data involved on both sizes means this is going to take time in a database that doesn't handle unstructured data well.

Scratch the below as it will not be able to handle the load effectively. But keeping it out there for posterity

I think concat() and group_concat() may do the trick.

We join where the patent.text is like the product name generating multiple rows. the group_concat then combines these rows into one record. I'm not sure where "Operation" comes from in your result.

SELECT pat.text, group_concat(P.Name) as Product
FROM patents pat
INNER JOIN text
 on pat.text like concat('%',p.name,'%')
GROUP by pat.text

However don't expect this to be fast; as we're doing a wild card search using a % on both ends; so no index can be used.

这篇关于比较数据库中两个表格之间的字符串或本地字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆