How To Use Data Science & Machine Learning to Create Pre-Draft Rankings for Fantasy Baseball Leagues

04/2025
✍️ Author: Nick Brennan

🏁 The Goal Ain’t Just Ranking It’s Predictive Advantage

Anyone can build a top 300 player list, slap names in tiers, and pretend they’re ready for draft day. But that’s just coloring inside the lines. The true edge in fantasy baseball comes from predictive rankings grounded in data science and machine learning. This isn’t a list—this is an algorithmic war machine tailored to your league’s scoring, your draft style, and your competition’s behavioral patterns.

Traditional rankings fail because they lean too heavily on last year’s stats or regurgitate consensus ADP from the fantasy hive mind. Instead, we aim to predict forward, using player trends, situational context, and probabilistic modeling to forecast who will actually win you categories or rack up points.

This process involves:

  • Engineering custom data pipelines
  • Building feature-rich historical datasets
  • Training machine learning models to project performance
  • Calibrating output to your league’s scoring system
  • Simulating entire draft flows to find inefficiencies

It’s not about being right every time. It’s about being more accurate than your opponents, more often. That 5% edge turns into league titles.


🔧 Step 1: Build Your Dataset – Clean, Feature-Rich, and Ruthless

Before anything else, you need data. And not just any data—you need a layered dataset with a mix of surface stats, advanced analytics, and contextual features. Your baseline should include at least:

Raw Data Sources:

  • FanGraphs (statcast, projections, splits)
  • Baseball Savant (xwOBA, EV, LA, Sprint Speed)
  • Baseball Reference (age, games played, team context)
  • ADP sources (NFBC, Fantrax, CBS, Yahoo)
  • Fantasy scoring logs (custom to your league)

Feature Engineering:

Here are some custom engineered metrics you can build:

# Rolling average of xwOBA over last 60 days
player_df['xwOBA_rolling60'] = player_df['xwOBA'].rolling(window=60).mean()

# Home park run factor adjustment
player_df['ParkAdj_HR'] = player_df['HR'] * player_df['HomeRunParkFactor']

# Injury Index (custom scale based on IL stints)
player_df['InjuryIndex'] = np.where(player_df['IL_days'] > 60, 0.4, 0.1) + (player_df['IL_stints'] * 0.05)

# Age-adjusted speed index
player_df['SpeedAgeIndex'] = player_df['SprintSpeed'] / (player_df['Age']**0.33)

Stat Normalization: Make sure your features are scaled appropriately. Normalize EV, Barrel%, K%, BB% with Min-Max or Z-score normalization. This prevents dominant stats from skewing ML model weightings.

Also, inject volatility signals:

  • Standard deviation in hard-hit rate
  • Launch angle consistency
  • K-BB fluctuation over time

Players with stable skill metrics are safer; volatile ones may need upside inflation.


🧠 Step 2: Train Models to Predict Future Performance

Now that you’ve got your high-fidelity dataset, it’s time to build the crystal ball. We train machine learning models to project forward-facing fantasy stats. But first, let’s define your goals:

Prediction Targets (Dependent Variables):

  • Hitting: HR, SB, xOBP, Barrel%, TB-HR
  • Pitching: IP, K, SIERA, CSW%, SVH3, wOBAA
  • Fantasy Value: Total roto points or auction dollar value

You can train separate models for each category or build a stacked ensemble.

Model Selection:

from xgboost import XGBRegressor
model = XGBRegressor(max_depth=5, learning_rate=0.1, n_estimators=300)

XGBoost is highly efficient and great at handling non-linear interactions. You can use GridSearchCV to optimize hyperparameters.

Custom Fantasy Value Formula:

Let’s say we want to project a fantasy hitter’s roto value in a 7×7 league:

FantasyValue = (
    (HR * 2.5) +
    (SB * 2.0) +
    (xOBP * 100) +
    (Barrel% * 1.5) +
    ((TB - HR) * 0.75) +
    ((Runs + RBI) * 0.75) +
    (OPS * 50)
)

Weightings are adjusted per league scoring tendencies and positional scarcity.

Train/Test Split: Always validate on prior unseen years. For example:

  • Train on 2021-2023
  • Validate on 2024 outcomes

Use R² score, MAE, RMSE to gauge performance.


🔄 Step 3: Incorporate Human Noise – League ADP & Behavioral Clustering

Every league has that guy who drafts closers in round 6 or hoards prospects for “value.” These tendencies matter. You’re not just modeling players—you’re modeling the market.

Integrating ADP:

# Deviation Score = Your Model Rank - Market Rank
player_df['DeviationScore'] = player_df['ModelRank'] - player_df['ADP']

Positive deviation = underrated. Negative = overhyped.

Cluster your players by volatility and role security. Use:

from sklearn.cluster import KMeans
features = player_df[['DeviationScore', 'InjuryIndex', 'InconsistencyMetric']]
kmeans = KMeans(n_clusters=5)
player_df['RiskCluster'] = kmeans.fit_predict(features)

Now label them:

  • Cluster 0: Boring but reliable
  • Cluster 1: High ceiling, high risk
  • Cluster 2: Aging vets with volume
  • Cluster 3: Injury rebounders
  • Cluster 4: Young wild cards

You can choose players based on roster context, not just raw value.


🎯 Step 4: Custom League Fit – Reverse Engineer the Point System

A fantasy league’s scoring system is the terrain. If you’re not mapping your model to it, you’re blind.

Let’s say your league uses this 7×7 format:

Hitting:

  • OPS, HR, Net SB, Barrel%, xOBP, TB-HR, Run Impact

Pitching:

  • SIERA, WHIP, CSW%, IP, SVH3, QA4, wOBAA

You must now re-weight all projection outputs.

Custom League Scoring Adjustments:

# Re-weighted fantasy points per hitting category
player_df['LeagueAdjustedHittingValue'] = (
    (player_df['HR'] * 3.0) +
    (player_df['NetSB'] * 2.0) +
    (player_df['Barrel%'] * 2.5) +
    ((player_df['TB'] - player_df['HR']) * 1.2) +
    (player_df['xOBP'] * 100) +
    (player_df['OPS'] * 50)
)

Do the same for pitchers. Simulate full seasons with Monte Carlo methods using 1,000 runs per player based on their stat variance. Then rank by mean outcome.

Account for positional scarcity by calculating z-scores per position:

player_df['z_HR'] = (player_df['HR'] - mean_HR_pos) / std_HR_pos

Z-scores show category dominance relative to peers, not the league.


🥇 Step 5: The ML Draft Board – Real-Time Adjustments + Tier Mapping

This is the part where your spreadsheet becomes sentient.

Create live tiers using confidence intervals and projections:

# Tier thresholds based on std dev and projection variance
player_df['Tier'] = pd.qcut(player_df['FantasyValue'], q=5, labels=["E", "D", "C", "B", "A"])

Draft Room AI

  • Predict opponents’ picks based on roster construction
  • Flag best available at each position with positional drop-off alerts
  • Re-weight values if you’re punting or stacking a category

Integrate it with Streamlit or Gradio so you can draft with a real-time war table.

Bonus Custom Formula: DVOR

Draft Value Over Replacement:

DVOR = PlayerProjectedValue - PositionalReplacementLevel

Apply DVOR as your final draft ranking, broken down by:

  • Value
  • Scarcity
  • Risk tier
  • ADP differential

Once built, export to Google Sheets with gspread and sync live updates.


🏆 The Strik3 Ethos

What separates the top 1% from the rest? Obsession. Obsession with numbers, with trends, with signal hidden under a thousand decibels of noise. We don’t draft by feel. We don’t trust the experts. We build the model.

You can create the most accurate pre-draft rankings in your league. You can dominate the auction table. You can win.

You just have to out-code the competition.

💡

🚨 Join the FBCS Fantasy League Now 🚨

Think you’re elite?
The Fantasy Baseball Championship Series (FBCS) is calling out the boldest GMs in the game.
$250 entry. $26,250 championship pool. No standard rankings. No shortcuts.
Every team must draft from their own custom board.

🏆 Sign up now and claim your spot before the leagues fill!

FBCS Logo

FBCS

Fantasy Baseball Championship Series


Discover more from A Sports Fan

Subscribe to get the latest posts sent to your email.

Leave a comment

Discover more from A Sports Fan

Subscribe now to keep reading and get access to the full archive.

Continue reading