获取特定语言的imdb电影标题 [英] Getting imdb movie titles in a specific language

查看:184
本文介绍了获取特定语言的imdb电影标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用Java编写一个爬虫,该爬虫将检查IMDB电影页面并提取一些信息,例如名称,年份等.用户编写(或复制/粘贴)标题的链接,其余部分由我的程序来完成.

I am writing a crawler in java that examines an IMDB movie page and extracts some info like name, year etc. User writes (or copy/pastes) the link of the tittle and my program should do the rest.

检查了几个(imdb)页面的html源并浏览了爬虫的工作方式后,我设法编写了代码.

After examining html sources of several (imdb) pages and browsing on how crawlers work I managed to write a code.

我得到的信息(例如标题)是我的母语.如果我的母语没有任何信息,我将获得原始标题.我想要的是用我选择的特定语言获得标题.

The info I get (for example title) is in my mother tongue. If there is no info in my mother tongue I get the original title. What I want is to get the title in a specific language of my choosing.

我对此很陌生,所以如果我错了,请纠正我,但是我会用母语得到结果,因为imdb看到"我来自塞尔维亚,而不是为我自定义结果.因此,基本上我需要以某种方式告诉我我更喜欢英语的结果吗?那有可能吗(我想是这样),我该怎么办?

I'm fairly new to this so correct me if I'm wrong but I get the results in my mother tongue because imdb "sees" that I'm from Serbia and than customizes the results for me. So basically I need to tell it somehow that I prefer results in English? Is that possible (i imagine it is) and how do I do it?

程序这样爬网:它获取String中的url路径,将其转换为url,使用bufferedreader读取所有源,并检查其获取内容.我不确定这是否是正确的方法,但它是否有效(减去语言问题) 代码:

edit: Program crawls like this: it gets the url path in String, converts it to url, reads all of the source with bufferedreader and inspects what it gets. I'm not sure if that is the right way to do it but it's working (minus the language problem) code:

public static Info crawlUrl(String urlPath) throws IOException{
        Info info = new Info();

        //
        URL url = new URL(urlPath);
        URLConnection uc = url.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(
                uc.getInputStream(), "UTF-8"));
        String inputLine;
        while ((inputLine = in.readLine()) != null){
            if(inputLine.contains("<title>")) System.out.println(inputLine);
        }
        in.close();
        //
        return info;
    }

此代码经过一页,并在控制台上打印主要标题.

this code goes trough a page and prints the main title on console.

推荐答案

尝试查看您的搜寻器使用的请求标头,因为我的标头包含Accept-Language:fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4,所以我得到了法语的标题.

Try to look at the request headers used by your crawler, mine is containing Accept-Language:fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4 so I get the title in French.

我在Google Chrome上使用ModifyHeaders插件进行了检查,值en-US使我获得了电影的英语标题=)

I checked with ModifyHeaders add-on on Google Chrome and the value en-US is getting me the English title for the movie =)

这篇关于获取特定语言的imdb电影标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆