实时抓取已标记的instagram照片 [英] Grabbing tagged instagram photos in real time
问题描述
我正在尝试实时下载带有特定标签的照片.我发现实时api几乎没有用,所以我在使用长轮询策略.下面是伪代码,其中注释了一些错误
I'm trying to download photos posted with specific tag in real time. I found real time api pretty useless so I'm using long polling strategy. Below is pseudocode with comments of sublte bugs in it
newMediaCount = getMediaCount();
delta = newMediaCount - mediaCount;
if (delta > 0) {
// if mediaCount changed by now, realDelta > delta, so realDelta - delta photos won't be grabbed and on next poll if mediaCount didn't change again realDelta - delta would be duplicated else ...
// if photo posted from private account last photo will be duplicated as counter changes but nothing is added to recent
recentMedia = getRecentMedia(delta);
// persist recentMedia
mediaCount = newMediaCount;
}
第二个问题可以通过我猜过的Set解决.但是首先真的困扰我.我已将两个呼叫移至instagram api的距离尽可能近,但这够了吗?
Second issue can be addressed with Set of some sort I gueess. But first really bothers me. I've moved two calls to instagram api as close as possible but is this enough?
修改
正如Amir所建议的那样,我已经使用min/max_tag_id
s重写了代码.但它仍会跳过照片.除了将图像保存在磁盘上一段时间并将结果与instagram.com/explore/tags/
进行比较之外,我找不到更好的测试方法.
As Amir suggested I've rewritten the code with use of min/max_tag_id
s. But it still skips photos. I couldn't find better way to test this than save images on disk for some time and compare result to instagram.com/explore/tags/
.
public class LousyInstagramApiTest {
@Test
public void testFeedContinuity() throws Exception {
Instagram instagram = new Instagram(Settings.getClientId());
final String TAG_NAME = "portrait";
String id = instagram.getRecentMediaTags(TAG_NAME).getPagination().getMinTagId();
HashtagEndpoint endpoint = new HashtagEndpoint(instagram, TAG_NAME, id);
for (int i = 0; i < 10; i++) {
Thread.sleep(3000);
endpoint.recentFeed().forEach(d -> {
try {
URL url = new URL(d.getImages().getLowResolution().getImageUrl());
BufferedImage img = ImageIO.read(url);
ImageIO.write(img, "png", new File("D:\\tmp\\" + d.getId() + ".png"));
} catch (Exception e) {
e.printStackTrace();
}
});
}
}
}
class HashtagEndpoint {
private final Instagram instagram;
private final String hashtag;
private String minTagId;
public HashtagEndpoint(Instagram instagram, String hashtag, String minTagId) {
this.instagram = instagram;
this.hashtag = hashtag;
this.minTagId = minTagId;
}
public List<MediaFeedData> recentFeed() throws InstagramException {
TagMediaFeed feed = instagram.getRecentMediaTags(hashtag, minTagId, null);
List<MediaFeedData> dataList = feed.getData();
if (dataList.size() == 0) return Collections.emptyList();
String maxTagId = feed.getPagination().getNextMaxTagId();
if (maxTagId != null && maxTagId.compareTo(minTagId) > 0) dataList.addAll(paginateFeed(maxTagId));
Collections.reverse(dataList);
// dataList.removeIf(d -> d.getId().compareTo(minTagId) < 0);
minTagId = feed.getPagination().getMinTagId();
return dataList;
}
private Collection<? extends MediaFeedData> paginateFeed(String maxTagId) throws InstagramException {
System.out.println("pagination required");
List<MediaFeedData> dataList = new ArrayList<>();
do {
TagMediaFeed feed = instagram.getRecentMediaTags(hashtag, null, maxTagId);
maxTagId = feed.getPagination().getNextMaxTagId();
dataList.addAll(feed.getData());
} while (maxTagId.compareTo(minTagId) > 0);
return dataList;
}
}
推荐答案
使用标记端点来获取带有所需标签的最新媒体,它会在其分页信息中返回min_tag_id
,该信息会在您致电时与最新标签的媒体绑定在一起.由于API还接受min_tag_id
参数,因此您可以从上一次查询传递该数字,以仅接收在上一次查询之后被标记的那些媒体.
Using the Tag endpoints to get the recent media with a desired tag, it returns a min_tag_id
in its pagination info, which is tied to the most recently tagged media at the time of your call. As the API also accepts a min_tag_id
parameter, you can pass that number from your last query to only receive those media that are tagged after your last query.
因此,基于您拥有的任何轮询机制,您只需调用API即可获取最新的新媒体(如果有的话)是基于上次收到的min_tag_id
.
So based on whatever polling mechanism you have, you just call the API to get the new recent media if any based on last received min_tag_id
.
当标记速度快于轮询速度时,您还需要传递一个大的count
参数并遵循响应的分页以接收所有数据而不会丢失任何数据.
You will also need to pass a large count
parameter and follow the pagination of the response to receive all data without losing anything when the speed of tagging is faster than your polling.
更新:
根据您更新的代码:
Update:
Based on your updated code:
public List<MediaFeedData> recentFeed() throws InstagramException {
TagMediaFeed feed = instagram.getRecentMediaTags(hashtag, minTagId, null, 100000);
List<MediaFeedData> dataList = feed.getData();
if (dataList.size() == 0) return Collections.emptyList();
// follow the pagination
MediaFeed recentMediaNextPage = instagram.getRecentMediaNextPage(feed.getPagination());
while (recentMediaNextPage.getPagination() != null) {
dataList.addAll(recentMediaNextPage.getData());
recentMediaNextPage = instagram.getRecentMediaNextPage(recentMediaNextPage.getPagination());
}
Collections.reverse(dataList);
minTagId = feed.getPagination().getMinTagId();
return dataList;
}
这篇关于实时抓取已标记的instagram照片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!