从 PHP 应用程序记录页面请求数据的可扩展方式? [英] Scalable way of logging page request data from a PHP application?

查看:39
本文介绍了从 PHP 应用程序记录页面请求数据的可扩展方式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发的 Web 应用程序(使用 PHP)需要能够记录每个页面请求.

A web application I am developing (in PHP) requires the ability to log each page request.

就像普通的 access_log 一样,它将存储请求的 url、源 IP 地址、日期/时间等详细信息,但我还需要它来存储登录用户的用户 ID(存储在 php 会话变量中).

Just like a normal access_log, it will store details like url requested, source ip address, date/time but I also need it to store the User ID of the logged in user (which is stored in a php session variable).

然后将查询这些数据,以在以后根据需要创建站点范围或每个用户的分析报告 - 例如总访问次数/唯一访问次数、特定时间段内的页面浏览量、IP 地理定位地址和查看位置、一天中最活跃的时间、最活跃的成员等.

This data will then be queried to create site-wide or per user analytics reports as required at a later date - things such as total number of visits/unique visits, page views in a certain time period, geo-locating the ip addresses and looking at locations, most active times of day, most active members etc.

显而易见的事情是在每个页面上都有一个 mysql 插入语句,但如果应用程序每秒接收数千个请求,这将成为数据库的一个巨大瓶颈,所以我正在寻找替代的、可扩展的无需大的基础设施要求即可做到这一点.

The obvious thing to do would be to have a mysql insert statement on each page but if the application is receiving thousands of req/sec, this is going to be a hugh bottleneck on the database so I am looking at alternative, scalable ways of doing this without big infrastructure requirements.

我的一些想法是:

1) 努力让 Nginx 能够将会话/应用程序中的 user_id 记录到普通的 Web 服务器 access_log 中,它可以被定期(每晚)解析并加载到数据库中.这感觉有点像黑客,随着系统的横向扩展,需要在每个 Web 服务器上执行此操作.

1) Work on a way for Nginx to be able to log the user_id from the session/application in the normal web server access_log, which can be parsed and loaded into a database periodically (nightly). This feels like a bit of a hack and will need doing on each web server as the system scales out.

2) 将每个页面请求记录到具有高写入速度的 Redis - 这样做的问题是缺乏在以后查询日期的能力.

2) Log each page request into Redis which has high write speeds - the problem with this is the lack of ability to query the date at a later date.

3) 将每个页面请求记录到充当缓存(或消息队列)的 Memcache/Redis 中,然后从那里定期提取、插入到 MySQL 中并删除.

3) Log each page request into either Memcache/Redis acting as a cache (or a message queue) and from there it would be regularly extracted, inserted into MySQL and removed.

4) 像 MongoDB 这样查询能力更强的东西是否合适?

4) Would something like MongoDB which has more query capability be suitable?

我很想知道您将如何处理这个问题,以及是否有人对类似应用程序有任何经验(或在网上遇到过任何事情).

I'm interested in how you would approach this and if anyone has any experience of a similar application (or has come across anything online).

我也对如何将数据进行适当的结构化以存储在 memcache/redis 中的想法感兴趣.

I'm also interested on thoughts on how the data could be suitably structured to be stored in memcache/redis.

谢谢

推荐答案

在各种方法中当然是可行的.我将解决列出的每个选项以及一些额外的评论.

It is certainly doable in a variety of methods. I'll address each listed option as well as some additional commentary.

1) 如果 NGinx 可以做到,就让它吧.我用 Apache 以及 JBOSS 和 Tomcat 来做这件事.然后我使用 syslog-ng 集中收集它们并从那里进行处理.对于这条路线,我建议使用带分隔符的日志消息格式,例如制表符分隔,因为它更易于解析和阅读.我不知道它记录 PHP 变量,但它肯定可以记录标头和 cookie 信息.如果您打算使用 NGinx 日志记录,如果可能的话,我会推荐这条路线 - 为什么要记录两次?

1) If NGinx can do it, let it. I do it with Apache as well as JBOSS and Tomcat. I then use syslog-ng to collect them centrally and process from there. For this route I'd suggest a delimited log message format such as tab-separated as it makes it easier to parse and read. I don't know about it logging PHP variables, but it can certainly log headers and cookie information. If you are going to use NGinx logging at all I'd recommend this route if possible - why log twice?

2) 没有无法在以后查询日期",更多在下面.

2) There is no "lack of ability to query the date at a later date", more down below.

3) 这是一个选项,但它是否有用取决于您希望将数据保留多长时间以及您希望写入多少清理.更多内容如下.

3) This is an option but whether or not it is useful depends on how long you want to keep the data and how much cleanup you want to write. More below.

4) MongoDB 当然可以工作.您必须编写查询,它们不是简单的 SQL 命令.

4) MongoDB could certainly work. You will have to write the queries, and they are not simple SQL commands.

现在,将数据存储在redis中.我目前使用 syslog-ng 记录事情,并使用程序目标来解析数据并将其填充到 Redis 中.就我而言,我有几个分组标准,例如按虚拟主机和按集群,所以我的结构可能有点不同.您首先需要解决的问题是我想从这些数据中获得什么数据"?其中一些将是计数器,例如交通费率.其中一些是聚合,还有更多是按受欢迎程度排序我的页面".

Now, to storing the data in redis. I currently log things with syslog-ng as noted and use a program destination to parse the data and stuff it into Redis. In my case I've got several grouping criteria such as by vhost and by cluster, so my structures may be a bit different. The question you need to address first is "what data do I want out of this data"? Some of it will be counters such as traffic rates. Some of it will be aggregates, and still more will be things like "order my pages by popularity".

我将演示一些可以轻松将其导入 redis(从而退出)的技术.

I'll demonstrate some of the techniques to easily get this into redis (and thus back out).

首先,让我们考虑一段时间内的流量统计数据.首先确定粒度.你想要每分钟的统计数据还是每小时的统计数据就足够了?这是跟踪给定 URL 流量的一种方法:

First, let us consider the traffic over time stats. First decide on the granularity. Do you want per-minute stats or will per-hour stats suffice? Here is one way to track a given URL's traffic:

使用键traffic-by-url:URL:YYYY-MM-DD"将数据存储在排序集中,您将使用zincrby 命令并提供成员H​​H:MM".例如在 Python 中,r"是你的 redis 连接:

Store the data in a sorted set using the key "traffic-by-url:URL:YYYY-MM-DD" in this sorted set you'll use the zincrby command and supply the member "HH:MM". for example in Python where "r' is your redis connection:

r.zincrby("traffic-by-url:/foo.html:2011-05-18", "01:04",1)

此示例在 5 月 18 日凌晨 1:04 增加了 url "/foo.html" 的计数器.

This example increases the counter for the url "/foo.html" on the 18th of May at 1:04 in the morning.

要检索特定日期的数据,您可以在键上调用 zrange (""traffic-by-url:URL:YYYY-MM-DD") 获取从最不流行到最流行的排序集合.例如,要获得前 10 名,您可以使用 zrevrange 并给它范围.Zrevrange 返回一个反向排序,命中最多的将在顶部.有几个更多的排序集合命令可用,让你做得很好诸如分页、按最低分数获取一系列结果等查询.

To retrieve data for a specific day, you can call zrange on the key (""traffic-by-url:URL:YYYY-MM-DD") to get a sorted set from least popular to most popular. To get the top 10, for example, you'd use zrevrange and give it the range. Zrevrange returns a reverse sort, the most hit will be at the top. Several more sorted set commands are available that allow you to do nice queries such as pagination, get a range of results by minimum score, etc..

您可以简单地更改或扩展您的密钥名称以处理不同的时间窗口.通过将其与 zunionstore 结合使用,您可以自动汇总到更细粒度的时间段.例如,您可以在一周或一个月内对所有键进行联合,并存储在一个新键中,例如traffic-by-url:monthly:URL:YYYY-MM".通过在给定的一天内对所有 URL 执行上述操作,您可以每天获得.当然,你也可以有一个每日总流量键并增加它.这主要取决于您希望何时输入数据 - 通过日志文件导入离线或作为用户体验的一部分.

You can simply alter or extend your key name to handle different temporal windows. By combining this with zunionstore you can automatically roll-up to less granular time periods. For example you could do a union of all keys in a week or month and store in a new key like "traffic-by-url:monthly:URL:YYYY-MM". By doing the above on all URLs in a given day you can get daily. Of course, you could also have a daily total traffic key and increment that. It mostly depends on when you want the data to be input - offline via logfile import or as part of the user experience.

我建议不要在实际用户会话期间做太多事情,因为这会延长用户体验它(和服务器负载)所需的时间.最终,这将是基于流量水平和资源的呼叫.

I'd recommend against doing much during the actual user session as it extends the time it takes for your users to experience it (and server load). Ultimately that will be a call based on traffic levels and resources.

正如您想象的那样,上述存储方案可以应用于您想要或确定的任何基于计数器的统计数据.例如,将 URL 更改为 userID,您就可以跟踪每个用户.

As you could imagine the above storage scheme can be applied to any counter based stat you want or determine. For example change URL to userID and you have per-user tracking.

您还可以将原始日志存储在 Redis 中.我对一些将它们存储为 JSON 字符串的日志执行此操作(我将它们作为键值对).然后我有第二个过程将它们拉出并处理数据.

You could also store logs raw in Redis. I do this for some logs storing them as JSON strings (I have them as key-value pairs). Then I have a second process that pulls them out and does things with the data.

为了存储原始命中,您还可以使用以纪元时间作为排名的排序集合,并使用 zrange/zrevrange 命令轻松获取时间窗口.或者将它们存储在基于用户 ID 的密钥中.集合适用于此,排序集合也适用.

For storing raw hits you could also use a sorted sets using the Epoch Time as the rank and easily grab a temporal window using the zrange/zrevrange commands. Or store them in a key that is based on the user ID. Sets would work for this, as would sorted sets.

我没有讨论过的另一个选项,但对于您的某些数据可能有用的是存储为散列.例如,这对于存储有关给定会话的详细信息可能很有用.

Another option I've not discussed but for some of your data may be useful is storing as a hash. This could be useful for storing detailed information about a given session for example.

如果您真的想要数据库中的数据,请尝试使用 Redis 的 Pub/Sub 功能并让订阅者将其解析为分隔格式并转储到文件中.然后有一个导入过程,该过程使用复制命令(或您的数据库的等效命令)进行批量导入.您的数据库会感谢您.

If you really want the data in a database, try using Redis' Pub/Sub feature and have a subscriber that parses it into a delimited format and dumps to a file. Then have an import process that uses the copy command (or equivalent for your DB) to import in bulk. Your DB will thank you.

这里的最后一点建议(我可能已经花了足够的精神时间)是明智和自由地使用 过期 命令.使用 Redis 2.2 或更新版本,您甚至可以设置计数器键的过期时间.这里的一大优势是自动数据清理.想象一下,您遵循我上面概述的方案.通过使用到期命令,您可以自动清除旧数据.也许您想要长达 3 个月的每小时统计数据,然后只需要每日统计数据;6 个月的每日统计数据,然后只有月度统计数据.只需在三个月 (86400*90) 和每天 6 (86400*180) 后过期您的每小时密钥,您就无需进行清理.

A final bit of advice here (I've probably taken enough mental time already) is to make judicious and liberal use of the expire command. Using Redis 2.2 or newer you can set expiration on even counter keys. The big advantage here is automatic data cleanup. Imagine you follow a scheme like I've outlined above. By using the expiration commands you can automatically purge old data. Perhaps you want hourly stats for up to 3 months, then only the daily stats; daily stats for 6 months then monthly stats only. Simply expire your hourly keys after three months (86400*90), your daily at 6 (86400*180) and you won't need to do cleanup.

对于地理标记,我对 IP 进行离线处理.想象一个具有以下关键结构的排序集:traffic-by-ip:YYYY-MM-DD",使用 IP 作为元素,并使用上面提到的 Zincryby 命令,您可以获得每个 IP 的流量数据.现在,在您的报告中,您可以获得排序集并查找 IP.为了在做报告时节省流量,您可以在 redis 中设置一个哈希值,将 IP 映射到您想要的位置.例如geo:country"作为key,IP作为hash成员,国家代码作为存储值.

For geotagging I do offline processing of the IP. Imagine a sorted set with this key structure: "traffic-by-ip:YYYY-MM-DD" using the IP as the element and using the zincryby command noted above you get per-IP traffic data. Now, in your report, you can get the sorted set and do lookups of the IP. To save traffic when doing the reports, you could set up a hash in redis that maps the IP to the location you want. For example "geo:country" as the key and IP as the hash member with country code as the stored value.

我要补充的一个重要警告是,如果您的流量级别非常高,您可能需要运行两个 Redis 实例(或更多,具体取决于流量).第一个是写入实例,它不会启用 bgsave 选项.如果您的流量很高,您将始终执行 bgsave.这就是我推荐第二个实例的原因.它是第一个的奴隶,它会保存到磁盘.您还可以对从站运行查询以分配负载.

A big caveat I would add is that if your traffic level is very high you may want to run two instances of Redis (or more depending on traffic). The first would be the write instance, It would not have the bgsave option enabled. If your traffic is pretty high you'll always be doing a bgsave. This is what I recommend the second instance for. It is a slave to the first and it does the saves to disk. You can also run your queries against the slave to distribute load.

我希望这能给你一些想法和尝试.尝试不同的选项,看看什么最适合您的特定需求.我正在 redis 中跟踪高流量网站上的大量统计数据(以及 MTA 日志统计数据),它的表现非常出色 - 结合 Django 和 Google 的可视化 API,我得到了非常漂亮的图表.

I hope that gives you some ideas and things to try out. Play around with the different options to see what works best for your specific needs. I am tracking a lot of stats on a high traffic website (and also MTA log stats) in redis and it performs beautifully - combined with Django and Google's Visualization API I get very nice looking graphs.

这篇关于从 PHP 应用程序记录页面请求数据的可扩展方式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆