Let’s just begin with the obvious, there’s no visualization or analysis available that will generate the perfect bracket. There are just too many variables involved. I haven’t even made an attempt at that. What this article does present is several visualizations that can be used as tools towards a better bracket prediction. It’s not a substitute for expert opinions. But, if you’re less than expert yourself, it might be helpful in fine-honing your bracket this year and for years to come. While creating these visualizations, I learned a few lessons along the way.
Lesson 1, Some results just ask for more data
If you know little about NCAA basketball, looking at seeding can give you a good start to the right bracket. How far it will take you is limited. In charting the data, I expected something that would look close to a 45 degree angled line. The actual results are quite different.
I started this process with five years of data. That’s 8 first round games in each of 4 regions, 160 games total. This works out to 20 games for each seed (1-16). This area chart shows the total number of games won and lost by seed:
What we get is not an angled line, but a descending wave. The green portion shows total wins, while blue is displaying total losses. As an example, Seed 1 has 20 wins, and 0 losses telling us that a number 1 seeded team has not lost in the first round in at least 5 years. This is reflected at the other end of the chart by Seed 16, showing all losses. As a matter of fact, due to the setup of first round games (seed 1 plays seed 16, seed 2 plays seed 15, etc.), the right half of the chart will always show a perfect inverse reflection of the left half of the chart.
Lesson 2, More data doesn’t always clear things up
Getting back to the visualization, I wanted to know if the waved results were a fluke from the last 5 years or something deeper. This led me to double the data set to 10 years of history, 40 games per seed:
Over 10 years, the waved pattern looks very similar to the five year pattern. At first glance, it appears that the first ‘hill’ (seeds 7 and 8) is an anomaly. Looking deeper, the real anomaly is the valley created by seeds 5/6 (which is mirrored at 11/12). To get a clearer look, I chose to visualize the same data in a different format.
Lesson 3, Waves are visually pleasing, but bars can be more precise:
Same data, slightly different presentation. On the chart below, I have switched to bars, one for each seed. I have also moved from a count of wins and losses to a percentage of each. At 50% wins and 50% losses, a coin toss would be just as effective, so I marked off the area in the middle (between 40% and 60%). If a seed has bars that meet in this area, it is an ineffectual predictor.
It is now clear that in those games involving seeds 1, 2, 3 and 4 these teams have more than an 80% win record. The anomalous areas (circled in red) that are those games involving seeds 5 and 6 are much easier to quantify. These teams have a just better than 50% win record which makes seed an ineffectual predictor. Seed 7 wins just about 65% of the time that isn’t quite enough to call it a predictor. In summary, about half the games, those with lower seed teams, can be fairly well predicted based on seed alone.
Lesson 4, A secondary measure can help with decisions:
Now we know that seed isn’t enough to make a prediction for half of the first round games. I have created a simple line chart using team season win/loss rations as a secondary predictor. I am only looking at games for seeds 5-12. Note: seeds 5, 6, 7, and 8 play seeds 12, 11, 10 and 9 respectively which is why the chart only shows four games. For each game, the aggregate win loss ratio of winning and losing teams is charted. On average teams with a better season record are winning these games. It is not a strong predictor on its own, but it does add a layer to our ability to make a decision.
Lesson 5, Heat Maps visualize complex relationships:
To this point, my analysis has been based on the two strong measures available; seed and season record. I still have other team attributes available such as conference, team, and coach. Pulling data from these attributes is a more complex process. I started with conference, using a visualization similar to first one in this article:
What does this tell us? Some conferences clearly win more than they lose. A couple conferences, American East and Big Sky, haven’t had a first round win in more than ten years. But there isn’t enough information in this chart to help with making a first round pick.
Games aren’t played by a single team, so I switched to a comparative approach using a heat map:
This is a chart of every time one conference is pitted against another. The colors are coded from the perspective of the conference on the Y axis. If we are looking at American East, the chart tells us they have historically lost against everyone, as noted above. In comparison, Ohio Valley loses to the ACC, Big 12 and Pac 12, wins against Mountain West and SEC while pushing against the Big East.
I’ve given you a few tools to help in making your round 1 picks. There will always be games that go against the grain and others where you just need to use your gut. Hopefully this can help cast light on some of the gray areas. In articles to come, I will look at later rounds and examine Cinderella teams and those that I call Step-sisters.