Contents

Instructions

Throughout this website, green text (like above) jumps to the named section of this website. Blue text links to outside websites. Hover over orange text to see relevant pop-up image and/or text.

Motivation

The MLB First-Year Player (Rule IV) Draft involves some of the most important organizational decisions for any MLB franchise. In an era where revenue and payroll disparities among teams are increasing, identifying and signing undervalued talent is essential for on-field competitiveness. Because the MLB and MLBPA's collective bargaining agreement financially rewards veterans eligible for free agency, organizations wish to sign and develop young players internally -- namely, through the draft.

All 30 MLB teams draft players in reverse order of their record the previous year for 50 rounds, with additional supplemental picks between the first and second and second and third rounds (so 1500+ players are drafted every year). High school, junior college, and four-year college players (who have either completed their junior and senior years or are 21 years old) and are from the U.S., Canada, or Puerto Rico are eligible. The MLB Labor Relations department issues "slot" recommendations for the money each overall pick number should sign for, but because these are not enforced, top players can demand an expensive contract and be passed over by certain teams so that they are picked below where their talent suggests.

Given these circumstances, the book Moneyball and other research have analyzed draft strategy, specifically whether college or high school hitters or pitchers are better investments. No matter which type of player you select, the draft is very much an inexact science; the failure rate for top picks in the MLB draft may be higher than for the NFL, NBA, and NHL because players play more years in the minors before making an impact in the pros. This website continues analysis in this predictive direction, but within the subsect of NCAA D1 hitters and pitchers. Scouting is a combination of quantitative observations and qualitative explanations, and high school and lower level college statistics are not reliable enough (inconsistent records, small sample sizes, and varying competition levels) for meaningful statistical analysis. Can NCAA D1 statistics predict professional performance? The pages here attempt to provide an answer.

Process

This website's statistical analysis follows a process known as exploratory data analysis (EDA). Running regressions "blind" based on biases using un-investigated data can lead to dangerous(ly inaccurate) conclusions. EDA relies on "seeing" the data through visual displays such as histograms, boxplots, and scatterplots. By identifying patterns as they naturally exist from the data, instead of attempting to validate a hypothesis against the data, a more complete picture can be understood.

The following pages support such a philosophy, as extensive single-variable (variables page) and double-variable (relationships page) analysis informs the multiple regression models that are built on the last page. While the prediction tool on the home page gives a precise answer to out-of-sample data (projections), the other content here provides a crucial background perspective.

EDA is meant to be an interactive and user-driven process. While the statistical software used (Data Desk) does not easily translate to web interactivity, forms and hovers are meant to recreate the experience as much as possible.

Data

All MLB draft and performance data are from Baseball Reference, while all NCAA D1 player, statistics, and team information is from Boyd's World. The data was partly joined using this data standardizer.

There are 44,280 D1 seasons in this resulting database. Playing time cut-offs of 50 AB for hitters and 15 IP for hitters ensure a standard minimum sample size and reduce the pool to 27,649 seasons (16,278 hitters and 11,371 pitchers). Of these seasons, 2,120 (8.3%) led to that player being drafted, and only 350 (1.3%) made it all the way to the major leagues. See these percentages broken down for hitters by AVG, and for pitchers by ERA.

Why years 2002-06? Over half of all D1 teams did not have 2001 and prior data on Boyd's World. Additionally, changes in the metal bat requirements prior to 2002 led to much higher offensive statistics than the relatively stable period that followed. The 2006 cut-off is somewhat subjective, as the 2007 draft class and subsequent years may not have played long enough entering the 2011 season to make an accurate assessment on valuable draft picks.

Sources of Uncertainty

Drafted Platform Year - Ideally, the database would average a draft-eligible player's career statistics against only those other draft-eligible players. Because the data didn't contain year or age fields, this analysis is unfortunately not possible. MLB scouting directors evaluate the total body of a player's career performance (often including summer leagues and high school career), not simply his hitting or pitching statistics the year in which he was drafted.

Signing Data - Data on drafted player's contracts and signing bonus are not included. As previously discussed, players who sign for well above their slotted recommendation have more talent than their overall draft pick number would suggest. Additionally, the drop-off in talent between the 10th and 30th pick is much greater than between the 910th and 930th pick, for which signing bonuses account for (2009 overall #1 pick Stephen Strasburg received a $7.5 million signing bonus, but bonuses after the 7th round tend to level out at less than $100,000). Re-expressing overall pick number helps to control for this, but not as much as money information.

Out-of-Sample Predictions - Because all models are based on 2002-06 data, predicting MLB draft standing and performance for stats before and after these years are more unreliable. Teams may change what college statistics they value for the draft, and NCAA D1 statistics vary over time according to rule changes, such as bat regulations. Extremely high or low individual input relative to this sample may also result in unsound predictions.

Outlier Effect - Is the purpose of the draft to acquire solid major league-ready talent, or sign a high-risk high-potential prospect in the hopes he develops into a superstar? Both types of picks are obviously important for any organization, but teams may vary in their draft strategy. However, from a statistics perspective, superstar outliers are just that -- outliers -- and are meant to be excluded from the data, not predicted. There are nearly 30,000 college seasons in the data, of which only 17 were turned in by players who would be selected to the major league all-star team. This boxplot of position and WAR% helps to identify the scale of these outliers among those who already made the majors. Such a large discrepancy between "failures" and "successes" means that attempting to predict quantity of major leaguers should be more reliable than predicting quality.