拉XML提要并检测更改/删除PHP [英] pulling xml feed and detecting changes/deletion php

查看:68
本文介绍了拉XML提要并检测更改/删除PHP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想设置一个xml feed轮询系统,该系统每小时都会从给定的URL下载xml feed,并检测feed是否已更改.如果有的话,它需要做一些事情.

I want to setup an xml feed polling system which would download an xml feed from a given URL every hour and detect whether the feed has changed. If it has, it would need to do a few things.

如何有效地做到这一点?我要提取的提要中将包含数千个项目,每个项目中可能都有相当多的数据.

How can I efficiently accomplish this? The feed I would be pulling would have thousands of items inside and every item may have quite a bit of data in it.

我希望能够检测到任何新的数据/项目并将其保存到数据库中.
我希望能够检测到任何已修改的数据/项目并相应地更新数据库.
我希望能够检测到任何已删除的数据/项目并相应地更新它.

I want to be able to detect any new data/item and save it to a database.
I want to be able to detect any modified data/item and update the database accordingly.
I want to be able to detect any deleted data/item and update it the database accordingly.

物品的顺序对我来说无关紧要,因此,如果顺序发生变化,但没有其他改变,那么我们可以说提要是相同的.

The order of items doesn't matter to me, so if the order changes but nothing else does, then we can say the feeds are identical.

我已经看到一些人提到对项目和整个提要进行哈希处理,以与之前下载的内容进行比较.如果有很多物品,这可能会花费很长时间.

I've seen a few people mention hashing the items and the whole feed to compare to the previous downloaded one. If there are many items, this could potentially take long..

是否有一种简单的方法可以对上次下载的供稿和新供稿进行比较,然后以某种方式删除所有相同的项目?然后也许浏览一下剩下的项目并进行比较?

Would there be an easy way to do a diff on the last downloaded feed and new one to then somehow remove all identical items? And maybe then go through the items that are left and do the comparison?

我不确定哪种方法正确.任何建议将不胜感激.

I'm not sure what the right approach would be. Any suggestions would be greatly appreciated.

我要提取的类似供稿的示例是:

An example of a similar feed I would be pulling would be:

<properties>
<property>
<location>
<unit-number>301</unit-number>
<street-address>123 Main St</street-address>
<city-name>San Francisco</city-name>
<zipcode>94123</zipcode>
<county>San Francisco</county>
<state-code>California</state-code>
<street-intersection>Broadway</street-intersection>
<parcel-id>359-02-4158</parcel-id>
<building-name>The Avalon</building-name>
<subdivision></subdivision>
<neighborhood-name>Marina</neighborhood-name>
<neighborhood-description>The Marina is a neighborhood on the Northern part of San
Francisco</neighborhood-description>
<elevation>10</elevation>
<longitude>-70.1200</longitude>
<latitude>30.0000</latitude>
<geocode-type>exact</geocode-type>
<display-address>yes</display-address>
<directions>Take 101 North to Lombard St. Make a left on Lombard and 3rd right
onto Main. 123 is at the end of the block on the right. </directions>
</location>
<details>
<listing-title>A great deal in the Marina</listing-title>
<price>725000</price>
<year-built>1928</year-built>
<num-bedrooms>3</num-bedrooms>
<num-full-bathrooms>2</num-full-bathrooms>
<num-half-bathrooms>1</num-half-bathrooms>
<num-bathrooms></num-bathrooms>
<lot-size>0.25</lot-size>
<living-area-square-feet>1720</living-area-square-feet>
<date-listed>2010-06-20</date-listed>
<date-available></date-available>
<date-sold></date-sold>
<sale-price></sale-price>
<property-type>condo</property-type>
<description>Newly remodeled condo in great location.</description>
<mlsId>582649</mlsId>
<mlsName>SFAR</mlsName>
<provider-listingid>258136842</provider-listingid>
</details>
<landing-page>
<lp-url>http://www.BrokerRealty.com/listing?id=123456&amp;source=Trulia</lp-url>
</landing-page>
<listing-type>resale</listing-type>
<status>for sale</status>
<foreclosure-status></foreclosure-status>
<site>
<site-url>http://www.BrokerRealty.com</site-url>
<site-name>Broker Realty</site-name>
</site>

等.

推荐答案

是否有一种简便的方法可以对上次下载的供稿和新供稿进行比较,然后以某种方式删除所有相同的项目?

Would there be an easy way to do a diff on the last downloaded feed and new one to then somehow remove all identical items?

当然,实际上应该很容易.看起来这些是房地产清单,对不对?如果是这样,则MLS提供者的名称及其为列表发布的标识符形成唯一键:

Sure, in fact it should be pretty easy. It looks like these are real estate listings, right? If so, the name of the MLS provider and the identifier that they issue for the listing forms a unique key:

<details>
    <!-- ... -->
    <mlsId>582649</mlsId>
    <mlsName>SFAR</mlsName>
    <provider-listingid>258136842</provider-listingid>
</details>

现在,您可以唯一地标识每个列表,因此决定如何检测更改应该非常简单.我亲自将XML整理成一个多维关联数组,按键名对每个级别进行排序,然后对其进行序列化并通过哈希例程(例如md5)运行它,以便如此吸引人的草率但有效影响.实际上,您已经有了这个主意,

Now that you can uniquely identify each listing, it should be pretty trivial to decide how you will detect changes. I'd personally mangle the XML into a multidimensional associative array, sort every level by key name, then serialize it and run it through a hash routine (say, md5), for that oh-so-attractive sloppy-but-it-works effect. In fact, you already had that idea, kind of:

我已经看到一些人提到对项目和整个提要进行散列以将其与先前下载的内容进行比较.如果有很多物品,这可能会花费很长时间.

I've seen a few people mention hashing the items and the whole feed to compare to the previous downloaded one. If there are many items, this could potentially take long..

通过散列文档中的每个唯一条目,您可以避免在单个条目发生更改时重新导入 entire 的情况.将每个条目的哈希值与数据库中的其余数据以及组成唯一键的信息保持一致.当哈希值更改时,XML也已更改,值得重新导入.

By hashing each unique entry in the document, you avoid having to reimport the entire thing when a single entry changes. Stick the per-entry hash in with the rest of the data in your database, with the information that makes up the unique key. When the hash changes, the XML has changed, and it's worth re-importing.

同样,一旦有了唯一的密钥,就可以很容易地检测到新列表.数据库中没有匹配的键?导入.

And again, once you have that unique key, it's amazingly easy to detect new listings. No matching key in the database? Import.

同样,检测已删除的列表非常容易.密钥在数据库中,但不在XML中吗?也许应该将其取消.

Likewise, it's amazingly easy to detect deleted listings. Key's in the database but isn't in the XML? Maybe it should be nuked.

这篇关于拉XML提要并检测更改/删除PHP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆