将NFL球队名称规范化 [英] Canonicalize NFL team names

查看:140
本文介绍了将NFL球队名称规范化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这实际上是机器学习分类问题,但我想有一种完美的快捷方法.我想将描述NFL球队的字符串(例如"San Francisco"或"49ers"或"San Francisco 49ers"或"SF 40-Niners")映射到该队的规范名称. (有32个NFL球队,所以它实际上只是意味着找到32个垃圾箱中最接近的一个以放入给定的字符串.)

This is actually a machine learning classification problem but I imagine there's a perfectly good quick-and-dirty way to do it. I want to map a string describing an NFL team, like "San Francisco" or "49ers" or "San Francisco 49ers" or "SF forty-niners", to a canonical name for the team. (There are 32 NFL teams so it really just means finding the nearest of 32 bins to put a given string in.)

传入的字符串实际上并不是完全任意的(它们来自结构化数据源,例如:

The incoming strings are not actually totally arbitrary (they're from structured data sources like this: http://www.repole.com/sun4cast/stats/nfl2008lines.csv) so it's not really necessary to handle every crazy corner case like in the 49ers example above.

我还应该补充一点,如果有人知道包含金钱线拉斯维加斯赔率和过去几年NFL游戏的实际比赛结果的数据源,那将不需要它.我需要规范化的原因是要对这两个不同的数据集进行匹配,其中一个具有几率,而一个具有结果:

I should also add that in case anyone knows of a source of data containing both moneyline Vegas odds as well as actual game outcomes for the past few years of NFL games, that would obviate the need for this. The reason I need the canonicalization is to match up these two disparate data sets, one with odds and one with outcomes:

  • http://www.footballlocks.com/nfl_odds.shtml
  • http://www.repole.com/sun4cast/freepick.shtml

非常欢迎获得更好,更可解析的数据源的想法!

Ideas for better, more parsable, sources of data are very welcome!

添加:子字符串匹配的想法可能足以满足此数据;谢谢!可以通过选择距离最近的莱文施泰因距离更近的球队名称来增强它的健壮性吗?

Added: The substring matching idea might well suffice for this data; thanks! Could it be made a little more robust by picking the team name with the nearest levenshtein distance?

推荐答案

即使对于任意用户输入,这里的功能也足够强大.首先,将每个团队(我使用3个字母的代码作为每个团队的规范名称)映射到带有城市和团队名称以及城市和团队名称之间的括号中的任何昵称的完全拼写的版本.

Here's something plenty robust even for arbitrary user input, I think. First, map each team (I'm using a 3-letter code as the canonical name for each team) to a fully spelled out version with city and team name as well as any nicknames in parentheses between city and team name.

Scan[(fullname[First@#] = #[[2]])&, {
  {"ari", "Arizona Cardinals"},                 {"atl", "Atlanta Falcons"}, 
  {"bal", "Baltimore Ravens"},                  {"buf", "Buffalo Bills"}, 
  {"car", "Carolina Panthers"},                 {"chi", "Chicago Bears"}, 
  {"cin", "Cincinnati Bengals"},                {"clv", "Cleveland Browns"}, 
  {"dal", "Dallas Cowboys"},                    {"den", "Denver Broncos"}, 
  {"det", "Detroit Lions"},                     {"gbp", "Green Bay Packers"}, 
  {"hou", "Houston Texans"},                    {"ind", "Indianapolis Colts"}, 
  {"jac", "Jacksonville Jaguars"},              {"kan", "Kansas City Chiefs"}, 
  {"mia", "Miami Dolphins"},                    {"min", "Minnesota Vikings"}, 
  {"nep", "New England Patriots"},              {"nos", "New Orleans Saints"}, 
  {"nyg", "New York Giants NYG"},               {"nyj", "New York Jets NYJ"}, 
  {"oak", "Oakland Raiders"},                   {"phl", "Philadelphia Eagles"}, 
  {"pit", "Pittsburgh Steelers"},               {"sdc", "San Diego Chargers"}, 
  {"sff", "San Francisco 49ers forty-niners"},  {"sea", "Seattle Seahawks"}, 
  {"stl", "St Louis Rams"},                     {"tam", "Tampa Bay Buccaneers"}, 
  {"ten", "Tennessee Titans"},                  {"wsh", "Washington Redskins"}}]

然后,对于任何给定的字符串,为团队的每个全名找到最长的公共子序列.要优先考虑在开头或结尾处匹配的字符串(例如,"car"应匹配"carolina panthers"而不是"arizona cardinals"),将输入字符串和全名都夹在空格之间.无论哪个团队的全名都有[sic:]最长且最长的公共子序列,并且带有输入字符串,这就是我们返回的团队.这是该算法的Mathematica实现:

Then, for any given string, find the longest common subsequence for each of the full names of the teams. To give preference to strings matching at the beginning or the end (eg, "car" should match "carolina panthers" rather than "arizona cardinals") sandwich both the input string and the full names between spaces. Whichever team's full name has the [sic:] longest longest-common-subsequence with the input string is the team we return. Here's a Mathematica implementation of the algorithm:

teams = keys@fullnames;

(* argMax[f, domain] returns the element of domain for which f of that element is
   maximal -- breaks ties in favor of first occurrence. *)
SetAttributes[argMax, HoldFirst];
argMax[f_, dom_List] := Fold[If[f[#1] >= f[#2], #1, #2] &, First@dom, Rest@dom]

canonicalize[s_] := argMax[StringLength@LongestCommonSubsequence[" "<>s<>" ", 
                                 " "<>fullname@#<>" ", IgnoreCase->True]&, teams]

这篇关于将NFL球队名称规范化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆