PostgreSQL-替换HTML实体 [英] PostgreSQL - Replace HTML Entities
问题描述
我刚刚开始着手从数据库中剥离HTML实体的任务,因为我们进行了大量的爬网,并且某些爬网程序在输入时没有这样做:(
I have just set about the task of stripping out HTML entities from our database, as we do a lot of crawling and some of the crawlers didn't do this at input time :(
所以我开始写一堆看起来像这样的查询;
So I started writing a bunch of queries that look like;
UPDATE nodes SET name=regexp_replace(name, 'à', 'à', 'g') WHERE name LIKE '%#xe0%';
UPDATE nodes SET name=regexp_replace(name, 'á', 'á', 'g') WHERE name LIKE '%#xe1%';
UPDATE nodes SET name=regexp_replace(name, 'â', 'â', 'g') WHERE name LIKE '%#xe2%';
这显然是一种非常幼稚的方法,我一直在尝试找出如果有什么聪明的事情我可以使用解码功能;也许可以通过正则表达式来获取html实体,例如 /& #x(..); /
,然后传递只是将%1
部分传递给ascii解码器,然后重建字符串...或其他内容...
Which is clearly a pretty naive approach. I've been trying to figure out if there is something clever I can do with the decode function; maybe grabbing the html entity by regex like /&#x(..);/
, then passing just the %1
part to the ascii decoder, and reconstructing the string...or something...
我要继续查询吗? ly中只有40个左右。
Shall I just press on with the queries? There will probably only be 40 or so of them.
推荐答案
使用pl / perlu编写函数并使用此模块 https://metacpan.org/pod/HTML::Entities
Write a function using pl/perlu and use this module https://metacpan.org/pod/HTML::Entities
当然,您需要安装perl并提供pl / perl。
Of course you need to have perl installed and pl/perl available.
1)
首先创建程序语言pl / perlu:
1) First of all create the procedural language pl/perlu:
CREATE EXTENSION plperlu;
2)然后创建如下函数:
CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
use HTML::Entities;
return decode_entities($_[0]);
$$ LANGUAGE plperlu;
3)然后,您可以像这样使用它:
3) Then you can use it like this:
select decode_html_entities('aaabbb&.... asasdasdasd …');
decode_html_entities
---------------------------
aaabbb&.... asasdasdasd …
(1 row)
这篇关于PostgreSQL-替换HTML实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!