如何处理要在cron作业中处理的新文件 [英] How to handle new files to process in cron job

查看:58
本文介绍了如何处理要在cron作业中处理的新文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何检查脚本中已经处理过的文件,所以不再进行处理?和/或 我现在的操作方式有什么问题?

How can I check files that I already processed in a script so I don't process those again? and/or What is wrong with the way I am doing this now?

你好, 我正在运行 tshark ,并使用环形缓冲区"选项将其转储到文件后5MB或1小时.我编写了一个python脚本,以XML格式读取这些文件并将其转储到数据库中,效果很好.

Hello, I am running tshark with the ring buffer option to dump to files after 5MB or 1 hour. I wrote a python script to read these files in XML and dump into a database, this works fine.

我的问题是,这确实是过程密集型的,转换为XML时,其中的5MB可以转换为200MB的文件,所以我不想做任何不必要的处理.

My issue is that this is really process intensive, one of those 5MB can turn into a 200MB file when converted to XML, so I do not want to do any unnecessary processing.

该脚本每10分钟运行一次,每次运行处理约5个文件,因为正在扫描创建文件的文件夹中是否有任何新条目,因此我将该文件的哈希值转储到数据库中,并在下次运行时检查哈希,如果它不在数据库中,我将扫描文件. 问题在于,这似乎并非每次都有效,最终会处理已经完成的文件.当我检查它一直试图处理的文件的哈希值时,它并没有显示在数据库中的任何地方,因此为什么要一遍又一遍地对其进行处理.

The script is running every 10 minutes and processes ~5 files per run, since is scanning the folder where the files are created for any new entries, I dump a hash of the file into the database and on the next run check the hash and if it isn't in the database I scan the file. The problem is that this does not appear to work every time, it ends up processing files that it has already done. When I check the hash of the file that it keeps trying to process it doesn't show up anywhere in the database, hence why is trying to process it over and over.

我正在脚本输出中打印出文件名+哈希:

将文件/var/ss01/SS01_00086_20100107100828.cap与哈希一起使用:982d664b574b84d6a8a5093889454e59
使用文件/var/ss02/SS02_00053_20100106125828.cap带有哈希值:8caceb6af7328c4aed2ea349062b74e9
使用文件/var/ss02/SS02_00075_20100106184519.cap带有哈希值:1b664b2e900d56ca9750d27ed1ec28fc
使用文件/var/ss02/SS02_00098_20100107104437.cap并带有哈希值:e0d7f5b004016febe707e9823f339fce
使用文件/var/ss02/SS02_00095_20100105132356.cap带有哈希值:41a3938150ec8e2d48ae9498c79a8d0c
使用文件/var/ss02/SS02_00097_20100107103332.cap并使用哈希:4e08b6926c87f5967484add22a76f220
使用文件/var/ss02/SS02_00090_20100105122531.cap并带有哈希值:470b378ee5a2f4a14ca28330c2009f56
使用文件/var/ss03/SS03_00089_20100107104530.cap带有哈希值:468a01753a97a6a5dfa60418064574cc
使用文件/var/ss03/SS03_00086_20100105122537.cap并使用哈希:1fb8641f10f733384de01e94926e0853
使用文件/var/ss03/SS03_00090_20100107105832.cap并带有哈希值:d6209e65348029c3d211d1715301b9f8
使用文件/var/ss03/SS03_00088_20100107103248.cap带有哈希值:56a26b4e84b853e1f2128c831628c65e
将文件/var/ss03/SS03_00072_20100105093543.cap与哈希一起使用:dca18deb04b7c08e206a3b6f62262465
将文件/var/ss03/SS03_00050_20100106140218.cap与哈希一起使用:36761e3f67017c626563601eaf68a133
使用文件/var/ss04/SS04_00010_20100105105912.cap带有哈希值:5188dc70616fa2971d57d4bfe029ec46
将文件/var/ss04/SS04_00071_20100107094806.cap与哈希一起使用:ab72eaddd9f368e01f9a57471ccead1a
使用文件/var/ss04/SS04_00072_20100107100234.cap带有哈希值:79dea347b04a05753cb4ff3576883494
使用文件/var/ss04/SS04_00070_20100107093350.cap并带有哈希值:535920197129176c4d7a9891c71e0243
使用文件/var/ss04/SS04_00067_20100107084826.cap带有哈希值:64a88ecc1253e67d49e3cb68febb2e25
使用文件/var/ss04/SS04_00042_20100106144048.cap并使用哈希值:bb9bfa773f3bf94fd3af2514395d8d9e
将文件/var/ss04/SS04_00007_20100105101951.cap与哈希一起使用:d949e673f6138af2d388884f4a6b0f08

I am printing out the filename + hash in the output of the script:

using file /var/ss01/SS01_00086_20100107100828.cap with hash: 982d664b574b84d6a8a5093889454e59
using file /var/ss02/SS02_00053_20100106125828.cap with hash: 8caceb6af7328c4aed2ea349062b74e9
using file /var/ss02/SS02_00075_20100106184519.cap with hash: 1b664b2e900d56ca9750d27ed1ec28fc
using file /var/ss02/SS02_00098_20100107104437.cap with hash: e0d7f5b004016febe707e9823f339fce 
using file /var/ss02/SS02_00095_20100105132356.cap with hash: 41a3938150ec8e2d48ae9498c79a8d0c 
using file /var/ss02/SS02_00097_20100107103332.cap with hash: 4e08b6926c87f5967484add22a76f220
using file /var/ss02/SS02_00090_20100105122531.cap with hash: 470b378ee5a2f4a14ca28330c2009f56
using file /var/ss03/SS03_00089_20100107104530.cap with hash: 468a01753a97a6a5dfa60418064574cc 
using file /var/ss03/SS03_00086_20100105122537.cap with hash: 1fb8641f10f733384de01e94926e0853
using file /var/ss03/SS03_00090_20100107105832.cap with hash: d6209e65348029c3d211d1715301b9f8 
using file /var/ss03/SS03_00088_20100107103248.cap with hash: 56a26b4e84b853e1f2128c831628c65e 
using file /var/ss03/SS03_00072_20100105093543.cap with hash: dca18deb04b7c08e206a3b6f62262465 
using file /var/ss03/SS03_00050_20100106140218.cap with hash: 36761e3f67017c626563601eaf68a133 
using file /var/ss04/SS04_00010_20100105105912.cap with hash: 5188dc70616fa2971d57d4bfe029ec46 
using file /var/ss04/SS04_00071_20100107094806.cap with hash: ab72eaddd9f368e01f9a57471ccead1a 
using file /var/ss04/SS04_00072_20100107100234.cap with hash: 79dea347b04a05753cb4ff3576883494 
using file /var/ss04/SS04_00070_20100107093350.cap with hash: 535920197129176c4d7a9891c71e0243 
using file /var/ss04/SS04_00067_20100107084826.cap with hash: 64a88ecc1253e67d49e3cb68febb2e25 
using file /var/ss04/SS04_00042_20100106144048.cap with hash: bb9bfa773f3bf94fd3af2514395d8d9e 
using file /var/ss04/SS04_00007_20100105101951.cap with hash: d949e673f6138af2d388884f4a6b0f08

它应该做的唯一文件是每个文件夹一个,因此只有4个文件.这导致不必要的处理,我不得不处理重叠的cron作业和其他服务受到影响.

The only files it should be doing are one per folder, so only 4 files. This causes unecessary processing and I have to deal with overlapping cron jobs + other services been affected.

我希望从这篇文章中得到的是一种更好的方法,或者希望有人可以告诉我为什么会发生,我知道后者可能很难,因为可能有很多原因.

What I am hoping to get from this post is a better way to do this or hopefully someone can tell me why is happening, I know that the latter might be hard since it can be a bunch of reasons.

这是代码(我不是编码人员,而是系统管理员,请客气:P )第30-32行处理哈希比较. 预先感谢.

Here is the code (I am not a coder but a sys admin so be kind :P) line 30-32 handle the hash comparisons. Thanks in advance.

推荐答案

处理/处理在随机时间创建的文件的一种好方法是使用 incron而不是cron. (注意:由于incron使用Linux内核的 inotify 系统调用,此解决方案仅适用于Linux.)

A good way to handle/process files that are created at random times is to use incron rather than cron. (Note: since incron uses the Linux kernel's inotify syscalls, this solution only works with Linux.)

cron根据日期和时间运行作业,而incron根据以下日期运行作业 监视目录中的更改.例如,您可以将incron配置为运行 每次创建或修改新文件时都会执行此操作.

Whereas cron runs a job based on dates and times, incron runs a job based on changes in a monitored directory. For example, you can configure incron to run a job every time a new file is created or modified.

在Ubuntu上,该程序包称为incron.我不确定RedHat,但我相信这是正确的软件包:

On Ubuntu, the package is called incron. I'm not sure about RedHat, but I believe this is the right package: http://rpmfind.net//linux/RPM/dag/redhat/el5/i386/incron-0.5.9-1.el5.rf.i386.html.

一旦安装了incron软件包,请阅读

Once you install the incron package, read

man 5 incrontab 

有关如何设置incrontab配置文件的信息.您的incron_config文件可能看起来像这样:

for information on how to setup the incrontab config file. Your incron_config file might look something like this:

/var/ss01/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss02/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss03/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss04/ IN_CLOSE_WRITE /path/to/processing/script.py $#

然后使用incrond守护程序注册此配置,您将运行

Then to register this config with the incrond daemon, you'd run

incrontab /path/to/incron_config

仅此而已.现在,每当在/var/ss01,/var/ss02,/var/ss03或/var/ss04中创建文件时,命令

That's all there is to it. Now whenever a file is created in /var/ss01, /var/ss02, /var/ss03 or /var/ss04, the command

/path/to/processing/script.py $#

运行,将$#替换为新创建的文件的名称.

is run, with $# replaced by the name of the newly created file.

这将消除存储/比较哈希的需要,并且文件仅在创建后立即处理一次.

This will obviate the need to store/compare hashes, and files will only get processed once -- immediately after they are created.

只需确保您的处理脚本不会写入受监视目录的顶层. 如果是这样,那么incrond将注意到创建的新文件,并再次启动script.py,将您带入无限循环.

Just make sure your processing script does not write into the top level of the monitored directories. If it does, then incrond will notice the new file created, and launch script.py again, sending you into an infinite loop.

incrond监视单个目录,并且不递归监视子目录.因此,您可以指示tshark写入/var/ss01/tobeprocessed,使用incron进行监视 /var/ss01/进行处理,例如,将您的script.py写入/var/ss01.

incrond monitors individual directories, and does not recursively monitor subdirectories. So you could direct tshark to write to /var/ss01/tobeprocessed, use incron to monitor /var/ss01/tobeprocessed, and have your script.py write to /var/ss01, for example.

PS.还有一个用于inotify的python接口,称为 pyinotify .与incron不同,pyinotify可以递归监视子目录.但是,就您而言,我认为递归监视功能不是有用或不必要的.

PS. There is also a python interface to inotify, called pyinotify. Unlike incron, pyinotify can recursively monitor subdirectories. However, in your case, I don't think the recursive monitoring feature is useful or necessary.

这篇关于如何处理要在cron作业中处理的新文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆