我该如何去创造一个高效的内容过滤某些职位? [英] How do I go about creating an efficient content filter for certain posts?

查看:93
本文介绍了我该如何去创造一个高效的内容过滤某些职位?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经标记这个帖子为Word preSS,但我不能完全肯定它的话语preSS专用,所以我的计算器,而不是WPSE张贴。 的解决方案并不一定按字preSS-具体,简单的PHP 的。

I've tagged this post as WordPress, but I'm not entirely sure it's WordPress-specific, so I'm posting it on StackOverflow rather than WPSE. The solution doesn't have to be WordPress-specific, simply PHP.

场景
我运行一个水族饲养网站与一些热带鱼​​种配置文件词汇表项。

The Scenario
I run a fishkeeping website with a number of tropical fish Species Profiles and Glossary entries.

我们的网站是面向在我们的配置文件。他们是,你可能把它称为,该网站的面包和奶油。

Our website is oriented around our profiles. They are, as you may term it, the bread and butter of the website.

什么我希望达到的是,其中提到了另一种物种或词汇表条目每一个物种的资料,我可以用一个链接代替那些话 - 比如你会看到的here 。理想情况下,我也希望这发生在新闻,文章和博客文章了。

What I'm hoping to achieve is that, in every species profile which mentions another species or a glossary entry, I can replace those words with a link - such as you'll see here. Ideally, I would also like this to occur in news, articles and blog posts too.

我们有将近 1400种配置文件 1700术语条目。我们的品种配置文件往往是漫长的,最后单独算我们种型材编号170多万字的信息。

We have nearly 1400 species profiles and 1700 glossary entries. Our species profiles are often lengthy and at last count our species profiles alone numbered more than 1.7 million words of information.

什么我目前正在尝试
目前,我有一个 filter.php 与功能 - 我相信 - 做什么,我需要做的。在code是相当漫长的,并可以在全这里找到。

What I'm Currently Attempting
Currently, I have a filter.php with a function that - I believe - does what I need it to do. The code is quite lengthy, and can be found in full here.

此外,在我的Word preSS主题的的functions.php ,我有以下几点:

In addition, in my WordPress theme's functions.php, I have the following:

# ==============================================================================================
# [Filter]
#
# Every hour, using WP_Cron, `my_updated_posts` is checked. If there are new Post IDs in there,
# it will run a filter on all of the post's content. The filter will search for Glossary terms
# and scientific species names. If found, it will replace those names with links including a 
# pop-up.

    include "filter.php";

# ==============================================================================================
# When saving a post (new or edited), check to make sure it isn't a revision then add its ID
# to `my_updated_posts`.

    add_action( 'save_post', 'my_set_content_filter' );
    function my_set_content_filter( $post_id ) {
        if ( !wp_is_post_revision( $post_id ) ) {

            $post_type = get_post_type( $post_id );

            if ( $post_type == "species" || ( $post_type == "post" && in_category( "articles", $post_id ) ) || ( $post_type == "post" && in_category( "blogs", $post_id ) ) ) {
                //get the previous value
                $ids = get_option( 'my_updated_posts' );

                //add new value if necessary
                if( !in_array( $post_id, $ids ) ) {
                    $ids[] = $post_id;
                    update_option( 'my_updated_posts', $ids );
                }
            }
        }
    }

# ==============================================================================================
# Add the filter to WP_Cron.

    add_action( 'my_filter_posts_content', 'my_filter_content' );
    if( !wp_next_scheduled( 'my_filter_posts_content' ) ) {
        wp_schedule_event( time(), 'hourly', 'my_filter_posts_content' );
    }

# ==============================================================================================
# Run the filter.

    function my_filter_content() {
        //check to see if posts need to be parsed
        if ( !get_option( 'my_updated_posts' ) )
            return false;

        //parse posts
        $ids = get_option( 'my_updated_posts' );

        update_option( 'error_check', $ids );

        foreach( $ids as $v ) {
            if ( get_post_status( $v ) == 'publish' )
                run_filter( $v );

            update_option( 'error_check', "filter has run at least once" );
        }

        //make sure no values have been added while loop was running
        $id_recheck = get_option( 'my_updated_posts' );
        my_close_out_filter( $ids, $id_recheck );

        //once all options, including any added during the running of what could be a long cronjob are done, remove the value and close out
        delete_option( 'my_updated_posts' );
        update_option( 'error_check', 'working m8' );
        return true;
    }

# ==============================================================================================
# A "difference" function to make sure no new posts have been added to `my_updated_posts` whilst
# the potentially time-consuming filter was running.

    function my_close_out_filter( $beginning_array, $end_array ) {
        $diff = array_diff( $beginning_array, $end_array );
        if( !empty ( $diff ) ) {
            foreach( $diff as $v ) {
                run_filter( $v );
            }
        }
        my_close_out_filter( $end_array, get_option( 'my_updated_posts' ) );
    }

这工作,结合的方式(希望)由code的评论描述,是每个小时的字preSS运行一个cron作业(这就像一个假的cron - 工程根据用户的点击,但没有按'T真正重要的时机并不重要),它上面运行中发现的过滤器。

The way this works, as (hopefully) described by the code's comments, is that each hour WordPress operates a cron job (which is like a false cron - works upon user hits, but that doesn't really matter as the timing isn't important) which runs the filter found above.

在后面每小时运行它的理由是,如果我们试图运行它时,每个帖子被保存,这将是对作者的损害。一旦我们得到客座作者参与,这显然不是绕了一个可以接受的方式。

The rationale behind running it on an hourly basis was that if we tried to run it when each post was saved, it would be to the detriment of the author. Once we get guest authors involved, that is obviously not an acceptable way of going about it.

的问题...
几个月来我一直有收到这个过滤器可靠运行的问题。我不相信,问题在于过滤器本身,而是用的功能之一,使过滤器 - 即cron作业,或者选择哪些职位被过滤的功能,或prepares的生词功能等的过滤器。

The Problem...
For months now I've been having problems getting this filter running reliably. I don't believe that the problem lies with the filter itself, but with one of the functions that enables the filter - i.e. the cron job, or the function that chooses which posts are filtered, or the function which prepares the wordlists etc. for the filter.

不幸的是,诊断问题是相当困难的(我可以看到),这要归功于它的背景,只按小时运行。我一直在试图使用Word preSS' update_option 功能(这基本上是写一个简单的数据库值)错误检查,但我没有多少运气 - 而且说实话,我很困惑,问题出在哪里

Unfortunately, diagnosing the problem is quite difficult (that I can see), thanks to it running in the background and only on an hourly basis. I've been trying to use WordPress' update_option function (which basically writes a simple database value) to error-check, but I haven't had much luck - and to be honest, I'm quite confused as to where the problem lies.

我们最后决定将在网站上没有这个过滤器正常工作。有时它似乎工作,有时没有。其结果是,我们现在有哪些是不正确过滤不少物种型材

We ended up putting the website live without this filter working correctly. Sometimes it seems to work, sometimes it doesn't. As a result, we now have quite a few species profiles which aren't correctly filtered.

我想...
我基本上就着手运行该过滤器的最佳途径寻求建议。

What I'd Like...
I'm basically seeking advice on the best way to go about running this filter.

是一个cron作业的答案吗?我可以建立一个的.php 文件,该文件运行每一天,这不会是一个问题。它如何确定哪些职位需要被过滤?将它让服务器在它运行的时间有什么影响?

Is a Cron Job the answer? I can set up a .php file which runs every day, that wouldn't be a problem. How would it determine which posts need to be filtered? What impact would it have on the server at the time it ran?

另外,就是一个字preSS管理页面的答案吗?如果我知道该怎么做,沿着页面的东西线 - 利用AJAX - 这让我来选择的职位上运行将是完美的过滤器。有一个叫做插件 AJAX重新生成缩略图它是这样工作的,也许这将是最有效的?

Alternatively, is a WordPress admin page the answer? If I knew how to do it, something along the lines of a page - utilising AJAX - which allowed me to select the posts to run the filter on would be perfect. There's a plugin called AJAX Regenerate Thumbnails which works like this, maybe that would be the most effective?

注意事项

  • 的数据库/信息的大小受着/读/写
  • 在哪些职位将被过滤
  • 的过滤器在服务器上的影响;特别是考虑到我似乎并没有能够增加的话preSS内存限制,过去的32MB。
  • 是实际的过滤器本身,有效,可靠的?

这是一个相当复杂的问题,我已经不可避免地留下了一些细节(如我被同事们在这个过程中分心约为18倍)。请随时探测我要更多的信息。

This is quite a complex question and I've inevitably (as I was distracted roughly 18 times by colleagues in the process) left out some details. Please feel free to probe me for further information.

在此先感谢,

推荐答案

做到这一点是在创建配置文件时。

Do it when the profile is created.

尝试扭转了整个过程。而不是检查的话的内容,检查的话的内容的话。

Try reversing the whole process. Rather than checking the content for the words, check the words for the content's words.

  1. 打破内容后在进入字(空间)
  2. 在消除重复,在数据库中的一个字的最小尺寸的人,那些在最大规模的大小,那些在常用词名单,你继续。
  3. 检查对每个表,如果你的一些表包含有空格的短语,做一个文字%%的搜索,否则做直线匹配(快得多),甚至建立一个哈希表,如果它确实是个大问题。 (我会做一个PHP数组并以某种方式缓存结果,没有意义的重新发明轮子)
  4. 与现在显着较小的列表创建链接。

您应该可以在1秒钟轻松地保持这个即使你搬出来,甚至10万字,你都核对。我正是这样做的,没有缓存的单词列表,对于一个贝叶斯过滤器前。

You should be able to easily keep this under 1 second even as you move out to even 100,000 words you are checking against. I've done exactly this, without caching the word lists, for a Bayesian Filter before.

使用较小的名单,哪怕是贪婪和收集的话不符合小丑将赶上小丑泥鳅,所产生的小名单应该只有几个到几十个字的链接。这将需要任何时候都做了查找和替换在文本块。

With the smaller list, even if it is greedy and gathers words that don't match "clown" will catch "clown loach", the resulting smaller list should be only a few to a few dozen words with links. Which will take no time at all to do a find and replace over a chunk of text.

上面并没有真正解决您的问题比旧的配置文件。你不说究竟有多少,只是说有大量的文字,它是在1400到3100(两个项目)放在一起。这较旧的内容,你可以做的基础上普及,如果你有信息。或者在输入的日期,最新在前。不管做到这一点的最好办法是编写暂停对PHP,只是时间限制的脚本批量运行加载/处理/保存所有的职位。如果每个人大约需要1秒(可能要少得多,但最坏的情况下),你说3100秒这是不到一个小时了一点。

The above doesn't really address your concern over the older profiles. You don't say exactly how many there are, just that there is a lot of text and that it is on 1400 to 3100 (both items) put together. This older content you could do based on popularity if you have the info. Or on date entered, newest first. Regardless the best way to do this is to write a script that suspends the time limit on PHP and just batch-runs a load/process/save on all the posts. If each one takes about 1 second (probably much less but worst case) you are talking 3100 seconds which is a little less than an hour.

这篇关于我该如何去创造一个高效的内容过滤某些职位?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆