Build Your Own Custom Dataset
Create a custom training dataset with EODHD¶
Learn how to create custom crypto training datasets (features and targets) from scratch. This tutorial demonstrates the full pipeline from data download to ML-ready features and modeling.
What you'll learn:
- Download historical crypto data with numerblox and eodhd
- Engineer cross-sectional features with centimators
- Create ranking targets for prediction
- Validate data quality
- Structure datasets for ML
Why build custom datasets? While CrowdCent provides training data, building your own allows you to:
- Experiment with different features and time windows
- Include additional data sources
- Test hypotheses on historical data
- Develop a unique edge in predictions
Get EODHD API Access¶
Special offer for CrowdCent users: Get 10% off all EODHD plans 👉 https://eodhd.com/pricing-special-10?via=crowdcent
Recommended plan: EOD Historical Data - All World ($17.99/mo) which provides 100k calls/day - sufficient for most use cases
!pip install crowdcent-challenge numerblox eod centimators polars altair vegafusion vl-convert-python
import os
import polars as pl
from datetime import datetime
import requests
# Set your EODHD API key
EOD_API_KEY = "YOUR_EODHD_API_KEY_HERE" # Get from https://eodhd.com/cp/dashboard
if EOD_API_KEY == "YOUR_EODHD_API_KEY_HERE":
print("⚠️ Please set your EODHD API key above")
print("Get one free at: https://eodhd.com/pricing-special-10?via=crowdcent")
Step 1: Fetch Live Hyperliquid Universe¶
We'll get the current cryptocurrency universe directly from the Hyperliquid API. This includes both active and delisted perpetuals, giving us maximum historical data coverage.
def get_hyperliquid_perpetuals(include_delisted: bool = False):
"""Get perpetual IDs as a polars DataFrame
Args:
include_delisted: If True, include delisted perpetuals in the output
"""
url = "https://api.hyperliquid.xyz/info"
payload = {"type": "meta"}
special_map_dict = {
"POPCAT-USD.CC": "POPCAT28782-USD.CC",
"VVV-USD.CC": "VVV.CC",
"BRETT-USD.CC": "BRETT29743-USD.CC",
"UNIBOT-USD.CC": "UNIBOT27009-USD.CC",
"ZRO-USD.CC": "ZRO26997-USD.CC",
"MOVE-USD.CC": "MOVE32452-USD.CC",
"STG-USD.CC": "STG18934-USD.CC",
"GOAT-USD.CC": "GOAT33440-USD.CC",
"PEPE-USD.CC": "PEPE24478-USD.CC",
"PROMPT-USD.CC": "PROMPT-USD.CC",
"NIL-USD.CC": "NIL35702-USD.CC",
"MNT-USD.CC": "MNT27075-USD.CC",
"ACE-USD.CC": "ACE28674-USD.CC",
"HYPE-USD.CC": "HYPE32196-USD.CC",
"IMX-USD.CC": "IMX10603-USD.CC",
"INIT-USD.CC": "INIT-USD.CC",
"PURR-USD.CC": "PURR34332-USD.CC",
"MOODENG-USD.CC": "MOODENG33093-USD.CC",
"CHILLGUY-USD.CC": "CHILLGUY-USD.CC",
"FARTCOIN-USD.CC": "FARTCOIN-USD.CC",
"GRASS-USD.CC": "GRASS32956-USD.CC",
"GRIFFAIN-USD.CC": "GRIFFAIN-USD.CC",
"MELANIA-USD.CC": "MELANIA35347-USD.CC",
"KAITO-USD.CC": "KAITO-USD.CC",
"SUI-USD.CC": "SUI20947-USD.CC",
"BERA-USD.CC": "BERA-USD.CC",
"MEW-USD.CC": "MEW30126-USD.CC",
"ANIME-USD.CC": "ANIME35319-USD.CC",
"NEIRO-USD.CC": "NEIRO32521-USD.CC",
"DOGS-USD.CC": "DOGS32698-USD.CC",
"STX-USD.CC": "STX4847-USD.CC",
"S-USD.CC": "S32684-USD.CC",
"COMP-USD.CC": "COMP5692-USD.CC",
"TRUMP-USD.CC": "TRUMP-OFFICIAL-USD.CC",
"BLAST-USD.CC": "BLAST28480-USD.CC",
"TAO-USD.CC": "TAO22974-USD.CC",
"SAGA-USD.CC": "SAGA30372-USD.CC",
"TON-USD.CC": "TON11419-USD.CC",
"BIO-USD.CC": "BIO.CC",
"GMX-USD.CC": "GMX11857-USD.CC",
"NTRN-USD.CC": "NTRN26680-USD.CC",
"SUPER-USD.CC": "SUPER8290-USD.CC",
"SCR-USD.CC": "SCR26998-USD.CC",
"BANANA-USD.CC": "BANANA28066-USD.CC",
"ME-USD.CC": "ME32197-USD.CC",
"GMT-USD.CC": "GMT18069-USD.CC",
"IO-USD.CC": "IO29835-USD.CC",
"ZK-USD.CC": "ZKSYNC.CC",
"ALT-USD.CC": "ALT29073-USD.CC",
"POL-USD.CC": "POL28321-USD.CC",
"WCT-USD.CC": "WCT33152-USD.CC",
"XAI-USD.CC": "XAI28933-USD.CC",
"JUP-USD.CC": "JUP29210-USD.CC",
"APE-USD.CC": "APE3-USD.CC",
"SPX-USD.CC": "SPX28081-USD.CC",
"HYPER-USD.CC": "HYPER36281-USD.CC",
"IP-USD.CC": "IP-USD.CC",
"ZORA-USD.CC": "ZORA35931-USD.CC",
"PEOPLE-USD.CC": "PEOPLE-USD.CC",
"BABY-USD.CC": "BABY32198-USD.CC",
"ARB-USD.CC": "ARB11841-USD.CC",
"UNI-USD.CC": "UNI7083-USD.CC",
"OMNI-USD.CC": "OMNI30315-USD.CC",
"SOPH-USD.CC": "SOPHON-USD.CC",
"NEIROETH-USD.CC": "NEIRO-USD.CC",
"APT-USD.CC": "APT21794-USD.CC",
"STRK-USD.CC": "STRK22691-USD.CC",
"RESOLV-USD.CC": "RESOLV-USD.CC",
"TST-USD.CC": "TST35647-USD.CC",
"PUMP-USD.CC": "PUMP29601-USD.CC",
"WLFI-USD.CC": "WLFI33251-USD.CC",
"ASTER-USD.CC": "ASTER36341-USD.CC",
"SKY-USD.CC": "SKY33038-USD.CC",
}
response = requests.post(url, json=payload)
response.raise_for_status()
data = response.json()
universe = data.get("universe", [])
# Extract names of perpetuals based on delisted status
perpetual_ids = [
perp["name"]
for perp in universe
if include_delisted or not perp.get("isDelisted", False)
]
# Create DataFrame with just id column
df = pl.DataFrame({"id": perpetual_ids})
df = df.with_columns(
(pl.col("id") + "-USD.CC")
.str.replace("k", "")
.replace(special_map_dict)
.alias("eodhd_id")
)
return df
# Get the Hyperliquid universe
perpetuals_df = get_hyperliquid_perpetuals(include_delisted=True)
print(perpetuals_df)
# Extract ticker lists
eodhd_tickers = perpetuals_df["eodhd_id"].to_list()
id_mapping = dict(perpetuals_df.select("eodhd_id", "id").iter_rows())
print(f"\n📊 Dataset will include {len(eodhd_tickers)} cryptocurrencies")
print(f"Sample tickers: {eodhd_tickers[:10]}")
print(f"Sample IDs: {list(id_mapping.values())[:10]}")
shape: (216, 2) ┌───────┬──────────────┐ │ id ┆ eodhd_id │ │ --- ┆ --- │ │ str ┆ str │ ╞═══════╪══════════════╡ │ BTC ┆ BTC-USD.CC │ │ ETH ┆ ETH-USD.CC │ │ ATOM ┆ ATOM-USD.CC │ │ MATIC ┆ MATIC-USD.CC │ │ DYDX ┆ DYDX-USD.CC │ │ … ┆ … │ │ HEMI ┆ HEMI-USD.CC │ │ APEX ┆ APEX-USD.CC │ │ 2Z ┆ 2Z-USD.CC │ │ ZEC ┆ ZEC-USD.CC │ │ MON ┆ MON-USD.CC │ └───────┴──────────────┘ 📊 Dataset will include 216 cryptocurrencies Sample tickers: ['BTC-USD.CC', 'ETH-USD.CC', 'ATOM-USD.CC', 'MATIC-USD.CC', 'DYDX-USD.CC', 'SOL-USD.CC', 'AVAX-USD.CC', 'BNB-USD.CC', 'APE3-USD.CC', 'OP-USD.CC'] Sample IDs: ['BTC', 'ETH', 'ATOM', 'MATIC', 'DYDX', 'SOL', 'AVAX', 'BNB', 'APE', 'OP']
Step 2: Download Historical Data with numerblox¶
from numerblox.download import EODDownloader
# Set date range
start_date = "20200101"
end_date = datetime.now()
# Initialize EOD downloader
eod = EODDownloader(
directory_path="data",
key=EOD_API_KEY,
tickers=eodhd_tickers
)
eod.end_date = end_date
print("Downloading historical data...")
print("This may take a few minutes depending on number of tickers")
# Download data
eod.download_training_data(start=start_date)
# Load the downloaded data
filename = f"data/eod_{start_date}_{end_date.strftime('%Y%m%d')}.parquet"
eod_df = pl.read_parquet(filename)
eod_df = eod_df.with_columns(pl.col("date").str.to_datetime())
# Add clean ID column using Hyperliquid naming
eod_df = eod_df.with_columns(pl.col("ticker").replace(id_mapping).alias("id"))
# Check coverage
requested_tickers = len(eodhd_tickers)
downloaded_tickers = eod_df["ticker"].n_unique()
coverage_pct = (downloaded_tickers / requested_tickers) * 100
print(f"✅ Downloaded {len(eod_df)} rows for {downloaded_tickers} tickers")
print(f"📊 Coverage: {downloaded_tickers}/{requested_tickers} tickers ({coverage_pct:.1f}%)")
print(f"📅 Date range: {eod_df['date'].min()} to {eod_df['date'].max()}")
eod_df.head()
Downloading historical data... This may take a few minutes depending on number of tickers
EOD price data extraction: 0%| | 0/216 [00:00<?, ?it/s]
WARNING: Date pull failed on ticker: 'HPOS-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/HPOS-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101 WARNING: Date pull failed on ticker: 'FRIEND-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/FRIEND-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101 WARNING: Date pull failed on ticker: 'OX-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/OX-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101 WARNING: Date pull failed on ticker: 'SHIA-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/SHIA-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101 WARNING: Date pull failed on ticker: 'CANTO-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/CANTO-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101 WARNING: Date pull failed on ticker: 'NFTI-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/NFTI-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101 WARNING: Date pull failed on ticker: 'ALT29073-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/PANDORA-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101WARNING: Date pull failed on ticker: 'PANDORA-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/PANDORA-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101 WARNING: Date pull failed on ticker: 'AI-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/AI-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101 WARNING: Date pull failed on ticker: 'TST35647-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/JELLY-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101 WARNING: Date pull failed on ticker: 'APEX-USD.CC'. Exception: "None of ['date'] are in the columns" WARNING: Date pull failed on ticker: 'HEMI-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/HEMI-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101 WARNING: Date pull failed on ticker: '2Z-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/2Z-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101 ✅ Downloaded 219879 rows for 203 tickers 📊 Coverage: 203/216 tickers (94.0%) 📅 Date range: 2020-01-01 00:00:00 to 2025-10-08 00:00:00
open | high | low | close | adjusted_close | volume | ticker | date | id |
---|---|---|---|---|---|---|---|---|
f64 | f64 | f64 | f64 | f64 | i64 | str | datetime[μs] | str |
1.796991 | 1.796991 | 1.104477 | 1.331082 | 1.331082 | 2314949442 | "ARB11841-USD.CC" | 2023-03-23 00:00:00 | "ARB" |
1.325396 | 1.555872 | 1.18606 | 1.272492 | 1.272492 | 2537709581 | "ARB11841-USD.CC" | 2023-03-24 00:00:00 | "ARB" |
1.272393 | 1.307232 | 1.19297 | 1.224705 | 1.224705 | 1294894243 | "ARB11841-USD.CC" | 2023-03-25 00:00:00 | "ARB" |
1.224117 | 1.341853 | 1.208092 | 1.283315 | 1.283315 | 1059587959 | "ARB11841-USD.CC" | 2023-03-26 00:00:00 | "ARB" |
1.282521 | 1.320275 | 1.124495 | 1.162705 | 1.162705 | 1014240603 | "ARB11841-USD.CC" | 2023-03-27 00:00:00 | "ARB" |
Visualize Raw Price Data¶
Let's look at what the raw downloaded data looks like - this is the "Input" to our feature engineering pipeline.
import altair as alt
alt.data_transformers.enable('vegafusion')
# Select top tickers by data coverage for visualization
top_tickers = (
eod_df.group_by("id")
.agg(pl.col("date").count().alias("count"))
.sort("count", descending=True)
.head(20)["id"]
.to_list()
)
# Create visualization of raw prices
viz_df = eod_df.filter(pl.col("id").is_in(top_tickers)).to_pandas()
chart = (
alt.Chart(viz_df)
.mark_line(opacity=0.6)
.encode(
x=alt.X("date:T", title="Date"),
y=alt.Y("close:Q", title="Close Price (USD)", scale=alt.Scale(type="log")),
color=alt.Color("id:N", title="Ticker", legend=alt.Legend(columns=2)),
tooltip=["id:N", "date:T", "close:Q"]
)
.properties(
width=700,
height=300,
title="Input: Raw Stock Prices Over Time (Log Scale)"
)
)
chart
Step 3: Create Ranking Targets with CrowdCent Library¶
We'll use the create_ranking_targets
function from crowdcent-challenge to create standard 10-day and 30-day ranking targets. This matches exactly what CrowdCent uses in the hyperliquid-ranking challenge.
from crowdcent_challenge.scoring import create_ranking_targets
# Create standard 10d and 30d ranking targets using CrowdCent's official function
print("🎯 Creating ranking targets using CrowdCent's methodology...")
df = create_ranking_targets(
eod_df,
horizons=[10, 30], # Standard CrowdCent horizons
price_col="close",
date_col="date",
ticker_col="ticker",
return_raw_returns=False, # We only need the normalized targets
drop_incomplete=True # Drop rows without complete targets
)
print(f"✅ Created targets: target_10d, target_30d")
print(f"📊 Rows with complete targets: {len(df):,}")
# Show sample targets
print("\nSample targets:")
df.select(["date", "ticker", "id", "target_10d", "target_30d"]).tail(10)
🎯 Creating ranking targets using CrowdCent's methodology... ✅ Created targets: target_10d, target_30d 📊 Rows with complete targets: 213,841 Sample targets:
date | ticker | id | target_10d | target_30d |
---|---|---|---|---|
datetime[μs] | str | str | f64 | f64 |
2025-08-30 00:00:00 | "ZEC-USD.CC" | "ZEC" | 0.974093 | 0.963731 |
2025-08-31 00:00:00 | "ZEC-USD.CC" | "ZEC" | 0.958549 | 0.994819 |
2025-09-01 00:00:00 | "ZEC-USD.CC" | "ZEC" | 0.92228 | 0.989637 |
2025-09-02 00:00:00 | "ZEC-USD.CC" | "ZEC" | 0.782383 | 1.0 |
2025-09-03 00:00:00 | "ZEC-USD.CC" | "ZEC" | 0.849741 | 1.0 |
2025-09-04 00:00:00 | "ZEC-USD.CC" | "ZEC" | 0.92228 | 1.0 |
2025-09-05 00:00:00 | "ZEC-USD.CC" | "ZEC" | 0.906736 | 1.0 |
2025-09-06 00:00:00 | "ZEC-USD.CC" | "ZEC" | 0.953368 | 1.0 |
2025-09-07 00:00:00 | "ZEC-USD.CC" | "ZEC" | 0.611399 | 1.0 |
2025-09-08 00:00:00 | "ZEC-USD.CC" | "ZEC" | 0.378238 | 1.0 |
Visualize Target Distributions¶
The ranking targets are normalized to [0, 1] where 0 means worst performer and 1 means best performer on that day. This cross-sectional ranking approach means our model predicts relative performance, not absolute returns.
# Sample recent data for visualization
sample_dates = df["date"].unique().sort().tail(30).to_list()
target_viz_df = df.filter(pl.col("date").is_in(sample_dates)).to_pandas()
# Create histograms for both targets
hist_10d = (
alt.Chart(target_viz_df)
.mark_bar(opacity=0.7)
.encode(
x=alt.X("target_10d:Q", bin=alt.Bin(maxbins=30), title="Target 10d Value"),
y=alt.Y("count()", title="Count"),
tooltip=["count()"]
)
.properties(
width=350,
height=200,
title="10-Day Target Distribution (Last 30 Days)"
)
)
hist_30d = (
alt.Chart(target_viz_df)
.mark_bar(opacity=0.7, color="orange")
.encode(
x=alt.X("target_30d:Q", bin=alt.Bin(maxbins=30), title="Target 30d Value"),
y=alt.Y("count()", title="Count"),
tooltip=["count()"]
)
.properties(
width=350,
height=200,
title="30-Day Target Distribution (Last 30 Days)"
)
)
alt.hconcat(hist_10d, hist_30d)
Step 4: Engineer features with centimators¶
This is the step where you can and should put most of your energy. Try different feature engineering strategies, different moving average windows, lag windows, and anything else you can think of.
from sklearn import set_config
from sklearn.pipeline import make_pipeline
from centimators.feature_transformers import (
LogReturnTransformer,
RankTransformer,
LagTransformer,
MovingAverageTransformer,
)
# Enable metadata routing for sklearn
set_config(enable_metadata_routing=True)
print("🔧 Building feature engineering pipeline...")
# Define transformers with custom parameters
log_return_transformer = LogReturnTransformer().set_transform_request(ticker_series=True)
ranker = RankTransformer().set_transform_request(date_series=True)
ma_transformer = MovingAverageTransformer(
windows=[2, 10] # Custom moving average windows
).set_transform_request(ticker_series=True)
lagger = LagTransformer(windows=[0, 5, 10, 15, 20]).set_transform_request(ticker_series=True) # Custom lag windows
# Create feature pipeline
feature_pipeline = make_pipeline(
log_return_transformer,
ranker,
ma_transformer,
lagger,
verbose=True
)
feature_pipeline
🔧 Building feature engineering pipeline...
Pipeline(steps=[('logreturntransformer', LogReturnTransformer()), ('ranktransformer', RankTransformer()), ('movingaveragetransformer', MovingAverageTransformer(windows=[2, 10])), ('lagtransformer', LagTransformer(windows=[20, 15, 10, 5, 0]))], verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
steps | [('logreturntransformer', ...), ('ranktransformer', ...), ...] | |
transform_input | None | |
memory | None | |
verbose | True |
Parameters
feature_names | None |
Parameters
feature_names | None |
Parameters
windows | [2, 10] | |
feature_names | None |
Parameters
windows | [20, 15, ...] | |
feature_names | None |
# Apply feature engineering
input_features = ["close", "open", "high", "low", "volume"]
print(f"Transforming {input_features} into features...")
feature_df = feature_pipeline.fit_transform(
df[input_features],
date_series=df["date"],
ticker_series=df["ticker"]
)
# Get feature names and add to dataframe
feature_names = feature_pipeline.get_feature_names_out()
df = df.with_columns(feature_df).drop_nulls(subset=feature_names)
print(f"\n✅ Created {len(feature_names)} features")
print(f"Final dataset shape: {df.shape}")
print(f"\nFeature names:")
for i, name in enumerate(feature_names[:10]):
print(f" {name}")
if len(feature_names) > 10:
print(f" ... and {len(feature_names) - 10} more")
df.head()
Transforming ['close', 'open', 'high', 'low', 'volume'] into features... [Pipeline] (step 1 of 4) Processing logreturntransformer, total= 0.0s [Pipeline] ... (step 2 of 4) Processing ranktransformer, total= 0.0s [Pipeline] (step 3 of 4) Processing movingaveragetransformer, total= 0.0s [Pipeline] .... (step 4 of 4) Processing lagtransformer, total= 0.0s ✅ Created 50 features Final dataset shape: (207972, 61) Feature names: close_logreturn_rank_ma2_lag20 close_logreturn_rank_ma10_lag20 open_logreturn_rank_ma2_lag20 open_logreturn_rank_ma10_lag20 high_logreturn_rank_ma2_lag20 high_logreturn_rank_ma10_lag20 low_logreturn_rank_ma2_lag20 low_logreturn_rank_ma10_lag20 volume_logreturn_rank_ma2_lag20 volume_logreturn_rank_ma10_lag20 ... and 40 more
open | high | low | close | adjusted_close | volume | ticker | date | id | target_10d | target_30d | close_logreturn_rank_ma2_lag20 | close_logreturn_rank_ma10_lag20 | open_logreturn_rank_ma2_lag20 | open_logreturn_rank_ma10_lag20 | high_logreturn_rank_ma2_lag20 | high_logreturn_rank_ma10_lag20 | low_logreturn_rank_ma2_lag20 | low_logreturn_rank_ma10_lag20 | volume_logreturn_rank_ma2_lag20 | volume_logreturn_rank_ma10_lag20 | close_logreturn_rank_ma2_lag15 | close_logreturn_rank_ma10_lag15 | open_logreturn_rank_ma2_lag15 | open_logreturn_rank_ma10_lag15 | high_logreturn_rank_ma2_lag15 | high_logreturn_rank_ma10_lag15 | low_logreturn_rank_ma2_lag15 | low_logreturn_rank_ma10_lag15 | volume_logreturn_rank_ma2_lag15 | volume_logreturn_rank_ma10_lag15 | close_logreturn_rank_ma2_lag10 | close_logreturn_rank_ma10_lag10 | open_logreturn_rank_ma2_lag10 | open_logreturn_rank_ma10_lag10 | high_logreturn_rank_ma2_lag10 | high_logreturn_rank_ma10_lag10 | low_logreturn_rank_ma2_lag10 | low_logreturn_rank_ma10_lag10 | volume_logreturn_rank_ma2_lag10 | volume_logreturn_rank_ma10_lag10 | close_logreturn_rank_ma2_lag5 | close_logreturn_rank_ma10_lag5 | open_logreturn_rank_ma2_lag5 | open_logreturn_rank_ma10_lag5 | high_logreturn_rank_ma2_lag5 | high_logreturn_rank_ma10_lag5 | low_logreturn_rank_ma2_lag5 | low_logreturn_rank_ma10_lag5 | volume_logreturn_rank_ma2_lag5 | volume_logreturn_rank_ma10_lag5 | close_logreturn_rank_ma2_lag0 | close_logreturn_rank_ma10_lag0 | open_logreturn_rank_ma2_lag0 | open_logreturn_rank_ma10_lag0 | high_logreturn_rank_ma2_lag0 | high_logreturn_rank_ma10_lag0 | low_logreturn_rank_ma2_lag0 | low_logreturn_rank_ma10_lag0 | volume_logreturn_rank_ma2_lag0 | volume_logreturn_rank_ma10_lag0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
f64 | f64 | f64 | f64 | f64 | i64 | str | datetime[μs] | str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
1.463775 | 1.463893 | 1.316034 | 1.339242 | 1.339242 | 624706911 | "ARB11841-USD.CC" | 2023-04-23 00:00:00 | "ARB" | 0.14433 | 0.185567 | 0.063158 | 0.351579 | 0.026316 | 0.391579 | 0.052632 | 0.350526 | 0.110526 | 0.483158 | 0.536842 | 0.364211 | 0.268421 | 0.361053 | 0.342105 | 0.358947 | 0.473684 | 0.433684 | 0.421053 | 0.466316 | 0.426316 | 0.415789 | 0.726316 | 0.54 | 0.252632 | 0.431579 | 0.610526 | 0.515789 | 0.347368 | 0.54 | 0.805263 | 0.472632 | 0.489691 | 0.613728 | 0.43299 | 0.517124 | 0.752577 | 0.601042 | 0.572165 | 0.588117 | 0.546392 | 0.56191 | 0.046392 | 0.443082 | 0.525773 | 0.53834 | 0.484536 | 0.525426 | 0.458763 | 0.556831 | 0.417526 | 0.469018 |
1.339198 | 1.361657 | 1.293739 | 1.338344 | 1.338344 | 645855049 | "ARB11841-USD.CC" | 2023-04-24 00:00:00 | "ARB" | 0.608247 | 0.474227 | 0.468421 | 0.416842 | 0.057895 | 0.349474 | 0.415789 | 0.426316 | 0.5 | 0.48 | 0.257895 | 0.392632 | 0.447368 | 0.342105 | 0.257895 | 0.356842 | 0.247368 | 0.370526 | 0.326316 | 0.451579 | 0.405263 | 0.369474 | 0.994737 | 0.554737 | 0.636842 | 0.521053 | 0.994737 | 0.537895 | 0.731579 | 0.56 | 0.973684 | 0.528421 | 0.520619 | 0.551275 | 0.489691 | 0.595279 | 0.479381 | 0.571503 | 0.350515 | 0.559154 | 0.515464 | 0.522767 | 0.293814 | 0.399783 | 0.046392 | 0.442507 | 0.046392 | 0.430602 | 0.103093 | 0.474357 | 0.283505 | 0.407249 |
1.338297 | 1.381048 | 1.300433 | 1.379554 | 1.379554 | 636337759 | "ARB11841-USD.CC" | 2023-04-25 00:00:00 | "ARB" | 0.587629 | 0.360825 | 0.668421 | 0.38 | 0.468421 | 0.414737 | 0.636842 | 0.388421 | 0.831579 | 0.487368 | 0.342105 | 0.391579 | 0.8 | 0.390526 | 0.447368 | 0.338947 | 0.584211 | 0.383158 | 0.578947 | 0.413684 | 0.631579 | 0.441053 | 0.557895 | 0.517895 | 0.989474 | 0.535789 | 0.868421 | 0.562105 | 0.994737 | 0.572632 | 0.647368 | 0.533684 | 0.242268 | 0.502181 | 0.525773 | 0.532805 | 0.185567 | 0.521313 | 0.252577 | 0.522843 | 0.376289 | 0.510852 | 0.685567 | 0.468616 | 0.293814 | 0.399208 | 0.438144 | 0.439371 | 0.530928 | 0.464069 | 0.381443 | 0.415833 |
1.379177 | 1.494646 | 1.341212 | 1.384658 | 1.384658 | 1232822646 | "ARB11841-USD.CC" | 2023-04-26 00:00:00 | "ARB" | 0.546392 | 0.350515 | 0.373684 | 0.402105 | 0.668421 | 0.377895 | 0.384211 | 0.393684 | 0.542105 | 0.502105 | 0.210526 | 0.391579 | 0.510526 | 0.407368 | 0.805263 | 0.386316 | 0.563158 | 0.404211 | 0.715789 | 0.492632 | 0.478947 | 0.418947 | 0.447368 | 0.569474 | 0.557895 | 0.498947 | 0.421053 | 0.545263 | 0.531579 | 0.557895 | 0.257895 | 0.537895 | 0.649485 | 0.579067 | 0.247423 | 0.483711 | 0.458763 | 0.550624 | 0.592784 | 0.534552 | 0.597938 | 0.546565 | 0.840206 | 0.478351 | 0.680412 | 0.46701 | 0.881443 | 0.52268 | 0.891753 | 0.546392 | 0.634021 | 0.482474 |
1.384743 | 1.448856 | 1.382342 | 1.422934 | 1.422934 | 747265158 | "ARB11841-USD.CC" | 2023-04-27 00:00:00 | "ARB" | 0.216495 | 0.42268 | 0.347368 | 0.362105 | 0.373684 | 0.4 | 0.521053 | 0.468421 | 0.478947 | 0.543158 | 0.331579 | 0.427368 | 0.336842 | 0.449474 | 0.510526 | 0.405263 | 0.252632 | 0.422105 | 0.526316 | 0.513684 | 0.4 | 0.385263 | 0.44102 | 0.536625 | 0.439338 | 0.54892 | 0.336028 | 0.5251 | 0.387358 | 0.554314 | 0.166793 | 0.500727 | 0.525773 | 0.539967 | 0.649485 | 0.560597 | 0.747423 | 0.620271 | 0.912371 | 0.600054 | 0.572165 | 0.545285 | 0.783505 | 0.537113 | 0.835052 | 0.478351 | 0.623711 | 0.496907 | 0.623711 | 0.51134 | 0.510309 | 0.484536 |
Visualize Pipeline Output: Engineered Features¶
After feature engineering, all features are normalized and cross-sectionally ranked. This makes them comparable across different tickers and time periods. Let's visualize a few features for our top tickers.
# Select a few representative features to visualize
sample_features = [
"close_logreturn_rank_ma10_lag0", # Most recent smoothed close feature
"volume_logreturn_rank_ma10_lag0", # Most recent smoothed volume feature
]
# Filter to top tickers and recent dates for cleaner visualization
recent_dates = df["date"].unique().sort().tail(100).to_list()
feature_viz_df = (
df.filter(pl.col("id").is_in(top_tickers[:10]))
.filter(pl.col("date").is_in(recent_dates))
.select(["id", "date"] + sample_features)
.to_pandas()
)
# Melt for plotting multiple features
feature_viz_melted = feature_viz_df.melt(
id_vars=["id", "date"],
value_vars=sample_features,
var_name="feature",
value_name="value"
)
# Create feature visualization
feature_chart = (
alt.Chart(feature_viz_melted)
.mark_line(opacity=0.6)
.encode(
x=alt.X("date:T", title="Date"),
y=alt.Y("value:Q", title="Normalized Feature Value [0, 1]", scale=alt.Scale(domain=[0, 1])),
color=alt.Color("id:N", title="Ticker"),
strokeDash=alt.StrokeDash("feature:N", title="Feature Type"),
tooltip=["id:N", "date:T", "feature:N", "value:Q"]
)
.properties(
width=700,
height=300,
title="Pipeline Output: Normalized/Smoothed Features (Recent 100 Days)"
)
)
feature_chart
What changed after the pipeline:
- All features now range from 0 to 1 (normalized through ranking)
- Different price levels no longer matter - we're comparing relative performance
- Features are smoothed (moving averages reduce noise)
- The data is now ready for machine learning models to find patterns
This transformation is crucial for cross-sectional ranking models where we predict which assets will outperform others, not their absolute price movements.
Step 5: Structure Dataset for ML¶
# Create final dataset structure
final_df = (
df.rename({"ticker": "eodhd_id"})
.select([
"id",
"eodhd_id",
"date"
] + list(feature_names) + [
"target_10d",
"target_30d"
])
)
print(f"📊 Final Dataset Summary:")
print(f"Shape: {final_df.shape}")
print(f"Date range: {final_df['date'].min()} to {final_df['date'].max()}")
print(f"Cryptocurrencies: {final_df['id'].n_unique()}")
print(f"Features: {len(feature_names)}")
print(f"Targets: 2 (target_10d, target_30d)")
# Show sample with first few features
sample_cols = ["id", "date"] + list(feature_names)[:3] + ["target_10d", "target_30d"]
print(f"\nSample data:")
final_df.select(sample_cols).head()
📊 Final Dataset Summary: Shape: (207972, 55) Date range: 2020-02-01 00:00:00 to 2025-09-09 00:00:00 Cryptocurrencies: 195 Features: 50 Targets: 2 (target_10d, target_30d) Sample data:
id | date | close_logreturn_rank_ma2_lag20 | close_logreturn_rank_ma10_lag20 | open_logreturn_rank_ma2_lag20 | target_10d | target_30d |
---|---|---|---|---|---|---|
str | datetime[μs] | f64 | f64 | f64 | f64 | f64 |
"ARB" | 2023-04-23 00:00:00 | 0.063158 | 0.351579 | 0.026316 | 0.14433 | 0.185567 |
"ARB" | 2023-04-24 00:00:00 | 0.468421 | 0.416842 | 0.057895 | 0.608247 | 0.474227 |
"ARB" | 2023-04-25 00:00:00 | 0.668421 | 0.38 | 0.468421 | 0.587629 | 0.360825 |
"ARB" | 2023-04-26 00:00:00 | 0.373684 | 0.402105 | 0.668421 | 0.546392 | 0.350515 |
"ARB" | 2023-04-27 00:00:00 | 0.347368 | 0.362105 | 0.373684 | 0.216495 | 0.42268 |
Step 6: Save Dataset¶
# Save to parquet
output_path = "custom_crypto_dataset.parquet"
final_df.write_parquet(output_path)
# Get file size, why not?
file_size_mb = os.path.getsize(output_path) / 1024 / 1024
print(f"✅ Dataset saved to: {output_path}")
print(f"File size: {file_size_mb:.2f} MB")
print(f"Rows: {len(final_df):,}")
print(f"\nYou can now use this dataset for ML experiments!")
✅ Dataset saved to: custom_crypto_dataset.parquet File size: 49.20 MB Rows: 207,972 You can now use this dataset for ML experiments!
Step 7: Quick Model Validation¶
Let's verify our dataset works with a simple train-test split and xgboost model to ensure data quality.
# Prepare features
feature_cols = list(feature_names)
print(f"Feature columns: {len(feature_cols)} features")
# Time-based split with embargo period
embargo_days = 30
sorted_dates = final_df["date"].unique().sort()
split_idx = int(len(sorted_dates) * 0.8)
split_date = sorted_dates[split_idx]
embargo_end = split_date + pl.duration(days=embargo_days)
print(f"Split date: {split_date}")
print(f"Embargo: {embargo_days} days")
train_df = final_df.filter(pl.col("date") < split_date)
test_df = final_df.filter(pl.col("date") > embargo_end)
print(f"Train period: {train_df['date'].min()} to {train_df['date'].max()} ({len(train_df)} rows)")
print(f"Test period: {test_df['date'].min()} to {test_df['date'].max()} ({len(test_df)} rows)")
Feature columns: 50 features Split date: 2024-07-27 00:00:00 Embargo: 30 days Train period: 2020-02-01 00:00:00 to 2024-07-26 00:00:00 (138679 rows) Test period: 2024-08-27 00:00:00 to 2025-09-09 00:00:00 (64829 rows)
from xgboost import XGBRegressor
from crowdcent_challenge.scoring import evaluate_hyperliquid_submission
# Train simple, untuned multi-output XGBoost model for both 10d and 30d targets
model = XGBRegressor(n_estimators=500, random_state=42, verbosity=0, device="cuda")
X_train = train_df[feature_cols].to_pandas()
y_train = train_df[["target_10d", "target_30d"]].to_pandas()
print(f"Training on {X_train.shape} features, {y_train.shape} targets")
print("This may take a few minutes depending on number of features, model parameters, GPU, etc...")
model.fit(X_train, y_train)
# Make predictions for both horizons
X_test = test_df[feature_cols].to_pandas()
test_preds = model.predict(X_test)
# Extract predictions for both horizons and create predictions dataframe
test_results = test_df.select(["date", "id", "target_10d", "target_30d"]).with_columns([
pl.Series("pred_10d", test_preds[:, 0]),
pl.Series("pred_30d", test_preds[:, 1])
])
# Evaluate per date (cross-sectionally) then average
daily_scores = (
test_results
.group_by("date")
.agg([
pl.col("target_10d"),
pl.col("pred_10d"),
pl.col("target_30d"),
pl.col("pred_30d")
])
.with_columns(
pl.struct(["pred_10d", "target_10d", "pred_30d", "target_30d"])
.map_elements(
lambda x: evaluate_hyperliquid_submission(
y_true_10d=x["target_10d"],
y_pred_10d=x["pred_10d"],
y_true_30d=x["target_30d"],
y_pred_30d=x["pred_30d"],
),
return_dtype=pl.Struct,
)
.alias("metrics")
)
.unnest("metrics")
.sort("date")
)
# Average scores across all test dates
metrics = ["spearman_10d", "spearman_30d", "ndcg@40_10d", "ndcg@40_30d"]
avg_scores = daily_scores.select([pl.col(metric).mean() for metric in metrics]).to_dicts()[0]
print(f"\n📈 Model Validation Results (CrowdCent Official Metrics):")
print(f"Spearman Correlation 10d: {avg_scores['spearman_10d']:.4f}")
print(f"Spearman Correlation 30d: {avg_scores['spearman_30d']:.4f}")
print(f"NDCG@40 10d: {avg_scores['ndcg@40_10d']:.4f}")
print(f"NDCG@40 30d: {avg_scores['ndcg@40_30d']:.4f}")
print(f"\n🎯 Your custom dataset is ready for advanced modeling!")
Training on (138679, 50) features, (138679, 2) targets This may take a few minutes depending on number of features, model parameters, GPU, etc... 📈 Model Validation Results (CrowdCent Official Metrics): Spearman Correlation 10d: 0.1805 Spearman Correlation 30d: 0.1027 NDCG@40 10d: 0.6406 NDCG@40 30d: 0.6017 🎯 Your custom dataset is ready for advanced modeling!
Next Steps¶
🎉 Congratulations! You've built a complete crypto prediction dataset. Here's what to try next, although the list is endless:
Ready to submit predictions with your new dataset? Check out the Hyperliquid End-to-End Tutorial where you'll learn how to download inference data, make predictions, and submit to CrowdCent competitions using your custom model. You will need to replace the download data steps with your own production pipeline using your new custom dataset!
Immediate experiments and next steps:
- Try different models: Random Forest, Neural Networks, and other model estimators from Centimators like LSTMRegressor
- Adjust time horizons: Create 3d, 14d, 50d targets
- Additional data cleaning: Check the raw data for anomolies and remove/adjust samples.
- Feature engineering: Add lags, differences, and other combinatorial feature combinations. Centimators provides additional
feature transformers
like GroupStatsTransformer and will always be adding more! - Cross-validation: Use walk-forward, time-series CV for robust validation like in this advanced tutorial notebook.
- More data sources: Add social sentiment, onchain data, or macro indicators
- Ensemble methods: Combine multiple models and time horizons
- Real-time pipeline: Set up automated data updates and retraining
- Automate submissions: Use techniques from the submission automation guide to set it and forget it.
Other Resources¶
- EODHD API: https://eodhd.com/pricing-special-10?via=crowdcent
- Centimators GitHub: https://github.com/crowdcent/centimators
- CrowdCent Docs: https://docs.crowdcent.com
- Join our Discord: Join here to get help and share your results with the community
Happy modeling! 🧪