Build Your Own Custom Dataset

Create a custom training dataset with EODHD¶

Learn how to create custom crypto training datasets (features and targets) from scratch. This tutorial demonstrates the full pipeline from data download to ML-ready features and modeling.

What you'll learn:

Download historical crypto data with numerblox and eodhd
Engineer cross-sectional features with centimators
Create ranking targets for prediction
Validate data quality
Structure datasets for ML

Why build custom datasets? While CrowdCent provides training data, building your own allows you to:

Experiment with different features and time windows
Include additional data sources
Test hypotheses on historical data
Develop a unique edge in predictions

Get EODHD API Access¶

Special offer for CrowdCent users: Get 10% off all EODHD plans 👉 https://eodhd.com/pricing-special-10?via=crowdcent

Recommended plan: EOD Historical Data - All World ($17.99/mo) which provides 100k calls/day - sufficient for most use cases

In [ ]:

Copied!

!pip install crowdcent-challenge numerblox eod centimators polars altair vegafusion vl-convert-python
!pip install crowdcent-challenge numerblox eod centimators polars altair vegafusion vl-convert-python

In [ ]:

Copied!





import os
import polars as pl
from datetime import datetime
import requests

# Set your EODHD API key
EOD_API_KEY = "YOUR_EODHD_API_KEY_HERE"  # Get from https://eodhd.com/cp/dashboard

if EOD_API_KEY == "YOUR_EODHD_API_KEY_HERE":
    print("⚠️ Please set your EODHD API key above")
    print("Get one free at: https://eodhd.com/pricing-special-10?via=crowdcent")
import os
import polars as pl
from datetime import datetime
import requests

# Set your EODHD API key
EOD_API_KEY = "YOUR_EODHD_API_KEY_HERE"  # Get from https://eodhd.com/cp/dashboard

if EOD_API_KEY == "YOUR_EODHD_API_KEY_HERE":
    print("⚠️ Please set your EODHD API key above")
    print("Get one free at: https://eodhd.com/pricing-special-10?via=crowdcent")

Step 1: Fetch Live Hyperliquid Universe¶

We'll get the current cryptocurrency universe directly from the Hyperliquid API. This includes both active and delisted perpetuals, giving us maximum historical data coverage.

In [3]:

Copied!





def get_hyperliquid_perpetuals(include_delisted: bool = False):
    """Get perpetual IDs as a polars DataFrame

    Args:
        include_delisted: If True, include delisted perpetuals in the output
    """

    url = "https://api.hyperliquid.xyz/info"
    payload = {"type": "meta"}
    special_map_dict = {
        "POPCAT-USD.CC": "POPCAT28782-USD.CC",
        "VVV-USD.CC": "VVV.CC",
        "BRETT-USD.CC": "BRETT29743-USD.CC",
        "UNIBOT-USD.CC": "UNIBOT27009-USD.CC",
        "ZRO-USD.CC": "ZRO26997-USD.CC",
        "MOVE-USD.CC": "MOVE32452-USD.CC",
        "STG-USD.CC": "STG18934-USD.CC",
        "GOAT-USD.CC": "GOAT33440-USD.CC",
        "PEPE-USD.CC": "PEPE24478-USD.CC",
        "PROMPT-USD.CC": "PROMPT-USD.CC",
        "NIL-USD.CC": "NIL35702-USD.CC",
        "MNT-USD.CC": "MNT27075-USD.CC",
        "ACE-USD.CC": "ACE28674-USD.CC",
        "HYPE-USD.CC": "HYPE32196-USD.CC",
        "IMX-USD.CC": "IMX10603-USD.CC",
        "INIT-USD.CC": "INIT-USD.CC",
        "PURR-USD.CC": "PURR34332-USD.CC",
        "MOODENG-USD.CC": "MOODENG33093-USD.CC",
        "CHILLGUY-USD.CC": "CHILLGUY-USD.CC",
        "FARTCOIN-USD.CC": "FARTCOIN-USD.CC",
        "GRASS-USD.CC": "GRASS32956-USD.CC",
        "GRIFFAIN-USD.CC": "GRIFFAIN-USD.CC",
        "MELANIA-USD.CC": "MELANIA35347-USD.CC",
        "KAITO-USD.CC": "KAITO-USD.CC",
        "SUI-USD.CC": "SUI20947-USD.CC",
        "BERA-USD.CC": "BERA-USD.CC",
        "MEW-USD.CC": "MEW30126-USD.CC",
        "ANIME-USD.CC": "ANIME35319-USD.CC",
        "NEIRO-USD.CC": "NEIRO32521-USD.CC",
        "DOGS-USD.CC": "DOGS32698-USD.CC",
        "STX-USD.CC": "STX4847-USD.CC",
        "S-USD.CC": "S32684-USD.CC",
        "COMP-USD.CC": "COMP5692-USD.CC",
        "TRUMP-USD.CC": "TRUMP-OFFICIAL-USD.CC",
        "BLAST-USD.CC": "BLAST28480-USD.CC",
        "TAO-USD.CC": "TAO22974-USD.CC",
        "SAGA-USD.CC": "SAGA30372-USD.CC",
        "TON-USD.CC": "TON11419-USD.CC",
        "BIO-USD.CC": "BIO.CC",
        "GMX-USD.CC": "GMX11857-USD.CC",
        "NTRN-USD.CC": "NTRN26680-USD.CC",
        "SUPER-USD.CC": "SUPER8290-USD.CC",
        "SCR-USD.CC": "SCR26998-USD.CC",
        "BANANA-USD.CC": "BANANA28066-USD.CC",
        "ME-USD.CC": "ME32197-USD.CC",
        "GMT-USD.CC": "GMT18069-USD.CC",
        "IO-USD.CC": "IO29835-USD.CC",
        "ZK-USD.CC": "ZKSYNC.CC",
        "ALT-USD.CC": "ALT29073-USD.CC",
        "POL-USD.CC": "POL28321-USD.CC",
        "WCT-USD.CC": "WCT33152-USD.CC",
        "XAI-USD.CC": "XAI28933-USD.CC",
        "JUP-USD.CC": "JUP29210-USD.CC",
        "APE-USD.CC": "APE3-USD.CC",
        "SPX-USD.CC": "SPX28081-USD.CC",
        "HYPER-USD.CC": "HYPER36281-USD.CC",
        "IP-USD.CC": "IP-USD.CC",
        "ZORA-USD.CC": "ZORA35931-USD.CC",
        "PEOPLE-USD.CC": "PEOPLE-USD.CC",
        "BABY-USD.CC": "BABY32198-USD.CC",
        "ARB-USD.CC": "ARB11841-USD.CC",
        "UNI-USD.CC": "UNI7083-USD.CC",
        "OMNI-USD.CC": "OMNI30315-USD.CC",
        "SOPH-USD.CC": "SOPHON-USD.CC",
        "NEIROETH-USD.CC": "NEIRO-USD.CC",
        "APT-USD.CC": "APT21794-USD.CC",
        "STRK-USD.CC": "STRK22691-USD.CC",
        "RESOLV-USD.CC": "RESOLV-USD.CC",
        "TST-USD.CC": "TST35647-USD.CC",
        "PUMP-USD.CC": "PUMP29601-USD.CC",
        "WLFI-USD.CC": "WLFI33251-USD.CC",
        "ASTER-USD.CC": "ASTER36341-USD.CC",
        "SKY-USD.CC": "SKY33038-USD.CC",
    }
    response = requests.post(url, json=payload)
    response.raise_for_status()

    data = response.json()
    universe = data.get("universe", [])

    # Extract names of perpetuals based on delisted status
    perpetual_ids = [
        perp["name"]
        for perp in universe
        if include_delisted or not perp.get("isDelisted", False)
    ]

    # Create DataFrame with just id column
    df = pl.DataFrame({"id": perpetual_ids})
    df = df.with_columns(
        (pl.col("id") + "-USD.CC")
        .str.replace("k", "")
        .replace(special_map_dict)
        .alias("eodhd_id")
    )


    return df
        

# Get the Hyperliquid universe
perpetuals_df = get_hyperliquid_perpetuals(include_delisted=True)
print(perpetuals_df)
# Extract ticker lists
eodhd_tickers = perpetuals_df["eodhd_id"].to_list()
id_mapping = dict(perpetuals_df.select("eodhd_id", "id").iter_rows())

print(f"\n📊 Dataset will include {len(eodhd_tickers)} cryptocurrencies")
print(f"Sample tickers: {eodhd_tickers[:10]}")
print(f"Sample IDs: {list(id_mapping.values())[:10]}")
def get_hyperliquid_perpetuals(include_delisted: bool = False):
    """Get perpetual IDs as a polars DataFrame

    Args:
        include_delisted: If True, include delisted perpetuals in the output
    """

    url = "https://api.hyperliquid.xyz/info"
    payload = {"type": "meta"}
    special_map_dict = {
        "POPCAT-USD.CC": "POPCAT28782-USD.CC",
        "VVV-USD.CC": "VVV.CC",
        "BRETT-USD.CC": "BRETT29743-USD.CC",
        "UNIBOT-USD.CC": "UNIBOT27009-USD.CC",
        "ZRO-USD.CC": "ZRO26997-USD.CC",
        "MOVE-USD.CC": "MOVE32452-USD.CC",
        "STG-USD.CC": "STG18934-USD.CC",
        "GOAT-USD.CC": "GOAT33440-USD.CC",
        "PEPE-USD.CC": "PEPE24478-USD.CC",
        "PROMPT-USD.CC": "PROMPT-USD.CC",
        "NIL-USD.CC": "NIL35702-USD.CC",
        "MNT-USD.CC": "MNT27075-USD.CC",
        "ACE-USD.CC": "ACE28674-USD.CC",
        "HYPE-USD.CC": "HYPE32196-USD.CC",
        "IMX-USD.CC": "IMX10603-USD.CC",
        "INIT-USD.CC": "INIT-USD.CC",
        "PURR-USD.CC": "PURR34332-USD.CC",
        "MOODENG-USD.CC": "MOODENG33093-USD.CC",
        "CHILLGUY-USD.CC": "CHILLGUY-USD.CC",
        "FARTCOIN-USD.CC": "FARTCOIN-USD.CC",
        "GRASS-USD.CC": "GRASS32956-USD.CC",
        "GRIFFAIN-USD.CC": "GRIFFAIN-USD.CC",
        "MELANIA-USD.CC": "MELANIA35347-USD.CC",
        "KAITO-USD.CC": "KAITO-USD.CC",
        "SUI-USD.CC": "SUI20947-USD.CC",
        "BERA-USD.CC": "BERA-USD.CC",
        "MEW-USD.CC": "MEW30126-USD.CC",
        "ANIME-USD.CC": "ANIME35319-USD.CC",
        "NEIRO-USD.CC": "NEIRO32521-USD.CC",
        "DOGS-USD.CC": "DOGS32698-USD.CC",
        "STX-USD.CC": "STX4847-USD.CC",
        "S-USD.CC": "S32684-USD.CC",
        "COMP-USD.CC": "COMP5692-USD.CC",
        "TRUMP-USD.CC": "TRUMP-OFFICIAL-USD.CC",
        "BLAST-USD.CC": "BLAST28480-USD.CC",
        "TAO-USD.CC": "TAO22974-USD.CC",
        "SAGA-USD.CC": "SAGA30372-USD.CC",
        "TON-USD.CC": "TON11419-USD.CC",
        "BIO-USD.CC": "BIO.CC",
        "GMX-USD.CC": "GMX11857-USD.CC",
        "NTRN-USD.CC": "NTRN26680-USD.CC",
        "SUPER-USD.CC": "SUPER8290-USD.CC",
        "SCR-USD.CC": "SCR26998-USD.CC",
        "BANANA-USD.CC": "BANANA28066-USD.CC",
        "ME-USD.CC": "ME32197-USD.CC",
        "GMT-USD.CC": "GMT18069-USD.CC",
        "IO-USD.CC": "IO29835-USD.CC",
        "ZK-USD.CC": "ZKSYNC.CC",
        "ALT-USD.CC": "ALT29073-USD.CC",
        "POL-USD.CC": "POL28321-USD.CC",
        "WCT-USD.CC": "WCT33152-USD.CC",
        "XAI-USD.CC": "XAI28933-USD.CC",
        "JUP-USD.CC": "JUP29210-USD.CC",
        "APE-USD.CC": "APE3-USD.CC",
        "SPX-USD.CC": "SPX28081-USD.CC",
        "HYPER-USD.CC": "HYPER36281-USD.CC",
        "IP-USD.CC": "IP-USD.CC",
        "ZORA-USD.CC": "ZORA35931-USD.CC",
        "PEOPLE-USD.CC": "PEOPLE-USD.CC",
        "BABY-USD.CC": "BABY32198-USD.CC",
        "ARB-USD.CC": "ARB11841-USD.CC",
        "UNI-USD.CC": "UNI7083-USD.CC",
        "OMNI-USD.CC": "OMNI30315-USD.CC",
        "SOPH-USD.CC": "SOPHON-USD.CC",
        "NEIROETH-USD.CC": "NEIRO-USD.CC",
        "APT-USD.CC": "APT21794-USD.CC",
        "STRK-USD.CC": "STRK22691-USD.CC",
        "RESOLV-USD.CC": "RESOLV-USD.CC",
        "TST-USD.CC": "TST35647-USD.CC",
        "PUMP-USD.CC": "PUMP29601-USD.CC",
        "WLFI-USD.CC": "WLFI33251-USD.CC",
        "ASTER-USD.CC": "ASTER36341-USD.CC",
        "SKY-USD.CC": "SKY33038-USD.CC",
    }
    response = requests.post(url, json=payload)
    response.raise_for_status()

    data = response.json()
    universe = data.get("universe", [])

    # Extract names of perpetuals based on delisted status
    perpetual_ids = [
        perp["name"]
        for perp in universe
        if include_delisted or not perp.get("isDelisted", False)
    ]

    # Create DataFrame with just id column
    df = pl.DataFrame({"id": perpetual_ids})
    df = df.with_columns(
        (pl.col("id") + "-USD.CC")
        .str.replace("k", "")
        .replace(special_map_dict)
        .alias("eodhd_id")
    )


    return df
        

# Get the Hyperliquid universe
perpetuals_df = get_hyperliquid_perpetuals(include_delisted=True)
print(perpetuals_df)
# Extract ticker lists
eodhd_tickers = perpetuals_df["eodhd_id"].to_list()
id_mapping = dict(perpetuals_df.select("eodhd_id", "id").iter_rows())

print(f"\n📊 Dataset will include {len(eodhd_tickers)} cryptocurrencies")
print(f"Sample tickers: {eodhd_tickers[:10]}")
print(f"Sample IDs: {list(id_mapping.values())[:10]}")

shape: (216, 2)
┌───────┬──────────────┐
│ id    ┆ eodhd_id     │
│ ---   ┆ ---          │
│ str   ┆ str          │
╞═══════╪══════════════╡
│ BTC   ┆ BTC-USD.CC   │
│ ETH   ┆ ETH-USD.CC   │
│ ATOM  ┆ ATOM-USD.CC  │
│ MATIC ┆ MATIC-USD.CC │
│ DYDX  ┆ DYDX-USD.CC  │
│ …     ┆ …            │
│ HEMI  ┆ HEMI-USD.CC  │
│ APEX  ┆ APEX-USD.CC  │
│ 2Z    ┆ 2Z-USD.CC    │
│ ZEC   ┆ ZEC-USD.CC   │
│ MON   ┆ MON-USD.CC   │
└───────┴──────────────┘

📊 Dataset will include 216 cryptocurrencies
Sample tickers: ['BTC-USD.CC', 'ETH-USD.CC', 'ATOM-USD.CC', 'MATIC-USD.CC', 'DYDX-USD.CC', 'SOL-USD.CC', 'AVAX-USD.CC', 'BNB-USD.CC', 'APE3-USD.CC', 'OP-USD.CC']
Sample IDs: ['BTC', 'ETH', 'ATOM', 'MATIC', 'DYDX', 'SOL', 'AVAX', 'BNB', 'APE', 'OP']

Step 2: Download Historical Data with numerblox¶

In [4]:

Copied!





from numerblox.download import EODDownloader

# Set date range
start_date = "20200101"
end_date = datetime.now()

# Initialize EOD downloader
eod = EODDownloader(
    directory_path="data",
    key=EOD_API_KEY,
    tickers=eodhd_tickers
)
eod.end_date = end_date

print("Downloading historical data...")
print("This may take a few minutes depending on number of tickers")

# Download data
eod.download_training_data(start=start_date)

# Load the downloaded data
filename = f"data/eod_{start_date}_{end_date.strftime('%Y%m%d')}.parquet"
eod_df = pl.read_parquet(filename)
eod_df = eod_df.with_columns(pl.col("date").str.to_datetime())

# Add clean ID column using Hyperliquid naming
eod_df = eod_df.with_columns(pl.col("ticker").replace(id_mapping).alias("id"))

# Check coverage
requested_tickers = len(eodhd_tickers)
downloaded_tickers = eod_df["ticker"].n_unique()
coverage_pct = (downloaded_tickers / requested_tickers) * 100

print(f"✅ Downloaded {len(eod_df)} rows for {downloaded_tickers} tickers")
print(f"📊 Coverage: {downloaded_tickers}/{requested_tickers} tickers ({coverage_pct:.1f}%)")
print(f"📅 Date range: {eod_df['date'].min()} to {eod_df['date'].max()}")

eod_df.head()
from numerblox.download import EODDownloader

# Set date range
start_date = "20200101"
end_date = datetime.now()

# Initialize EOD downloader
eod = EODDownloader(
    directory_path="data",
    key=EOD_API_KEY,
    tickers=eodhd_tickers
)
eod.end_date = end_date

print("Downloading historical data...")
print("This may take a few minutes depending on number of tickers")

# Download data
eod.download_training_data(start=start_date)

# Load the downloaded data
filename = f"data/eod_{start_date}_{end_date.strftime('%Y%m%d')}.parquet"
eod_df = pl.read_parquet(filename)
eod_df = eod_df.with_columns(pl.col("date").str.to_datetime())

# Add clean ID column using Hyperliquid naming
eod_df = eod_df.with_columns(pl.col("ticker").replace(id_mapping).alias("id"))

# Check coverage
requested_tickers = len(eodhd_tickers)
downloaded_tickers = eod_df["ticker"].n_unique()
coverage_pct = (downloaded_tickers / requested_tickers) * 100

print(f"✅ Downloaded {len(eod_df)} rows for {downloaded_tickers} tickers")
print(f"📊 Coverage: {downloaded_tickers}/{requested_tickers} tickers ({coverage_pct:.1f}%)")
print(f"📅 Date range: {eod_df['date'].min()} to {eod_df['date'].max()}")

eod_df.head()

Downloading historical data...
This may take a few minutes depending on number of tickers

EOD price data extraction:   0%|          | 0/216 [00:00<?, ?it/s]

WARNING: Date pull failed on ticker: 'HPOS-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/HPOS-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101
WARNING: Date pull failed on ticker: 'FRIEND-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/FRIEND-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101
WARNING: Date pull failed on ticker: 'OX-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/OX-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101
WARNING: Date pull failed on ticker: 'SHIA-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/SHIA-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101
WARNING: Date pull failed on ticker: 'CANTO-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/CANTO-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101
WARNING: Date pull failed on ticker: 'NFTI-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/NFTI-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101
WARNING: Date pull failed on ticker: 'ALT29073-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/PANDORA-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101WARNING: Date pull failed on ticker: 'PANDORA-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/PANDORA-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101

WARNING: Date pull failed on ticker: 'AI-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/AI-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101
WARNING: Date pull failed on ticker: 'TST35647-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/JELLY-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101
WARNING: Date pull failed on ticker: 'APEX-USD.CC'. Exception: "None of ['date'] are in the columns"
WARNING: Date pull failed on ticker: 'HEMI-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/HEMI-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101
WARNING: Date pull failed on ticker: '2Z-USD.CC'. Exception: 404 Client Error: Not Found for url: https://eodhistoricaldata.com/api/eod/2Z-USD.CC?period=d&to=2025-10-08+18%3A02%3A25.499812&fmt=json&api_token=YOUR_API_KEY&from=20200101
✅ Downloaded 219879 rows for 203 tickers
📊 Coverage: 203/216 tickers (94.0%)
📅 Date range: 2020-01-01 00:00:00 to 2025-10-08 00:00:00

Out[4]:

shape: (5, 9)

open	high	low	close	adjusted_close	volume	ticker	date	id
f64	f64	f64	f64	f64	i64	str	datetime[μs]	str
1.796991	1.796991	1.104477	1.331082	1.331082	2314949442	"ARB11841-USD.CC"	2023-03-23 00:00:00	"ARB"
1.325396	1.555872	1.18606	1.272492	1.272492	2537709581	"ARB11841-USD.CC"	2023-03-24 00:00:00	"ARB"
1.272393	1.307232	1.19297	1.224705	1.224705	1294894243	"ARB11841-USD.CC"	2023-03-25 00:00:00	"ARB"
1.224117	1.341853	1.208092	1.283315	1.283315	1059587959	"ARB11841-USD.CC"	2023-03-26 00:00:00	"ARB"
1.282521	1.320275	1.124495	1.162705	1.162705	1014240603	"ARB11841-USD.CC"	2023-03-27 00:00:00	"ARB"

Visualize Raw Price Data¶

Let's look at what the raw downloaded data looks like - this is the "Input" to our feature engineering pipeline.

In [5]:

Copied!





import altair as alt
alt.data_transformers.enable('vegafusion')

# Select top tickers by data coverage for visualization
top_tickers = (
    eod_df.group_by("id")
    .agg(pl.col("date").count().alias("count"))
    .sort("count", descending=True)
    .head(20)["id"]
    .to_list()
)

# Create visualization of raw prices
viz_df = eod_df.filter(pl.col("id").is_in(top_tickers)).to_pandas()

chart = (
    alt.Chart(viz_df)
    .mark_line(opacity=0.6)
    .encode(
        x=alt.X("date:T", title="Date"),
        y=alt.Y("close:Q", title="Close Price (USD)", scale=alt.Scale(type="log")),
        color=alt.Color("id:N", title="Ticker", legend=alt.Legend(columns=2)),
        tooltip=["id:N", "date:T", "close:Q"]
    )
    .properties(
        width=700,
        height=300,
        title="Input: Raw Stock Prices Over Time (Log Scale)"
    )
)

chart
import altair as alt
alt.data_transformers.enable('vegafusion')

# Select top tickers by data coverage for visualization
top_tickers = (
    eod_df.group_by("id")
    .agg(pl.col("date").count().alias("count"))
    .sort("count", descending=True)
    .head(20)["id"]
    .to_list()
)

# Create visualization of raw prices
viz_df = eod_df.filter(pl.col("id").is_in(top_tickers)).to_pandas()

chart = (
    alt.Chart(viz_df)
    .mark_line(opacity=0.6)
    .encode(
        x=alt.X("date:T", title="Date"),
        y=alt.Y("close:Q", title="Close Price (USD)", scale=alt.Scale(type="log")),
        color=alt.Color("id:N", title="Ticker", legend=alt.Legend(columns=2)),
        tooltip=["id:N", "date:T", "close:Q"]
    )
    .properties(
        width=700,
        height=300,
        title="Input: Raw Stock Prices Over Time (Log Scale)"
    )
)

chart

Out[5]:

Step 3: Create Ranking Targets with CrowdCent Library¶

We'll use the create_ranking_targets function from crowdcent-challenge to create standard 10-day and 30-day ranking targets. This matches exactly what CrowdCent uses in the hyperliquid-ranking challenge.

In [6]:

Copied!





from crowdcent_challenge.scoring import create_ranking_targets

# Create standard 10d and 30d ranking targets using CrowdCent's official function
print("🎯 Creating ranking targets using CrowdCent's methodology...")

df = create_ranking_targets(
    eod_df,
    horizons=[10, 30],  # Standard CrowdCent horizons
    price_col="close",
    date_col="date", 
    ticker_col="ticker",
    return_raw_returns=False,  # We only need the normalized targets
    drop_incomplete=True       # Drop rows without complete targets
)

print(f"✅ Created targets: target_10d, target_30d")
print(f"📊 Rows with complete targets: {len(df):,}")

# Show sample targets
print("\nSample targets:")
df.select(["date", "ticker", "id", "target_10d", "target_30d"]).tail(10)
from crowdcent_challenge.scoring import create_ranking_targets

# Create standard 10d and 30d ranking targets using CrowdCent's official function
print("🎯 Creating ranking targets using CrowdCent's methodology...")

df = create_ranking_targets(
    eod_df,
    horizons=[10, 30],  # Standard CrowdCent horizons
    price_col="close",
    date_col="date", 
    ticker_col="ticker",
    return_raw_returns=False,  # We only need the normalized targets
    drop_incomplete=True       # Drop rows without complete targets
)

print(f"✅ Created targets: target_10d, target_30d")
print(f"📊 Rows with complete targets: {len(df):,}")

# Show sample targets
print("\nSample targets:")
df.select(["date", "ticker", "id", "target_10d", "target_30d"]).tail(10)

🎯 Creating ranking targets using CrowdCent's methodology...
✅ Created targets: target_10d, target_30d
📊 Rows with complete targets: 213,841

Sample targets:

Out[6]:

shape: (10, 5)

date	ticker	id	target_10d	target_30d
datetime[μs]	str	str	f64	f64
2025-08-30 00:00:00	"ZEC-USD.CC"	"ZEC"	0.974093	0.963731
2025-08-31 00:00:00	"ZEC-USD.CC"	"ZEC"	0.958549	0.994819
2025-09-01 00:00:00	"ZEC-USD.CC"	"ZEC"	0.92228	0.989637
2025-09-02 00:00:00	"ZEC-USD.CC"	"ZEC"	0.782383	1.0
2025-09-03 00:00:00	"ZEC-USD.CC"	"ZEC"	0.849741	1.0
2025-09-04 00:00:00	"ZEC-USD.CC"	"ZEC"	0.92228	1.0
2025-09-05 00:00:00	"ZEC-USD.CC"	"ZEC"	0.906736	1.0
2025-09-06 00:00:00	"ZEC-USD.CC"	"ZEC"	0.953368	1.0
2025-09-07 00:00:00	"ZEC-USD.CC"	"ZEC"	0.611399	1.0
2025-09-08 00:00:00	"ZEC-USD.CC"	"ZEC"	0.378238	1.0

Visualize Target Distributions¶

The ranking targets are normalized to [0, 1] where 0 means worst performer and 1 means best performer on that day. This cross-sectional ranking approach means our model predicts relative performance, not absolute returns.

In [7]:

Copied!





# Sample recent data for visualization
sample_dates = df["date"].unique().sort().tail(30).to_list()
target_viz_df = df.filter(pl.col("date").is_in(sample_dates)).to_pandas()

# Create histograms for both targets
hist_10d = (
    alt.Chart(target_viz_df)
    .mark_bar(opacity=0.7)
    .encode(
        x=alt.X("target_10d:Q", bin=alt.Bin(maxbins=30), title="Target 10d Value"),
        y=alt.Y("count()", title="Count"),
        tooltip=["count()"]
    )
    .properties(
        width=350,
        height=200,
        title="10-Day Target Distribution (Last 30 Days)"
    )
)

hist_30d = (
    alt.Chart(target_viz_df)
    .mark_bar(opacity=0.7, color="orange")
    .encode(
        x=alt.X("target_30d:Q", bin=alt.Bin(maxbins=30), title="Target 30d Value"),
        y=alt.Y("count()", title="Count"),
        tooltip=["count()"]
    )
    .properties(
        width=350,
        height=200,
        title="30-Day Target Distribution (Last 30 Days)"
    )
)

alt.hconcat(hist_10d, hist_30d)
# Sample recent data for visualization
sample_dates = df["date"].unique().sort().tail(30).to_list()
target_viz_df = df.filter(pl.col("date").is_in(sample_dates)).to_pandas()

# Create histograms for both targets
hist_10d = (
    alt.Chart(target_viz_df)
    .mark_bar(opacity=0.7)
    .encode(
        x=alt.X("target_10d:Q", bin=alt.Bin(maxbins=30), title="Target 10d Value"),
        y=alt.Y("count()", title="Count"),
        tooltip=["count()"]
    )
    .properties(
        width=350,
        height=200,
        title="10-Day Target Distribution (Last 30 Days)"
    )
)

hist_30d = (
    alt.Chart(target_viz_df)
    .mark_bar(opacity=0.7, color="orange")
    .encode(
        x=alt.X("target_30d:Q", bin=alt.Bin(maxbins=30), title="Target 30d Value"),
        y=alt.Y("count()", title="Count"),
        tooltip=["count()"]
    )
    .properties(
        width=350,
        height=200,
        title="30-Day Target Distribution (Last 30 Days)"
    )
)

alt.hconcat(hist_10d, hist_30d)

Out[7]:

Step 4: Engineer features with centimators¶

This is the step where you can and should put most of your energy. Try different feature engineering strategies, different moving average windows, lag windows, and anything else you can think of.

In [8]:

Copied!





from sklearn import set_config
from sklearn.pipeline import make_pipeline
from centimators.feature_transformers import (
    LogReturnTransformer,
    RankTransformer,
    LagTransformer,
    MovingAverageTransformer,
)

# Enable metadata routing for sklearn
set_config(enable_metadata_routing=True)

print("🔧 Building feature engineering pipeline...")

# Define transformers with custom parameters
log_return_transformer = LogReturnTransformer().set_transform_request(ticker_series=True)
ranker = RankTransformer().set_transform_request(date_series=True)
ma_transformer = MovingAverageTransformer(
    windows=[2, 10]  # Custom moving average windows
).set_transform_request(ticker_series=True)
lagger = LagTransformer(windows=[0, 5, 10, 15, 20]).set_transform_request(ticker_series=True)  # Custom lag windows

# Create feature pipeline
feature_pipeline = make_pipeline(
    log_return_transformer,
    ranker,
    ma_transformer,
    lagger,
    verbose=True
)

feature_pipeline
from sklearn import set_config
from sklearn.pipeline import make_pipeline
from centimators.feature_transformers import (
    LogReturnTransformer,
    RankTransformer,
    LagTransformer,
    MovingAverageTransformer,
)

# Enable metadata routing for sklearn
set_config(enable_metadata_routing=True)

print("🔧 Building feature engineering pipeline...")

# Define transformers with custom parameters
log_return_transformer = LogReturnTransformer().set_transform_request(ticker_series=True)
ranker = RankTransformer().set_transform_request(date_series=True)
ma_transformer = MovingAverageTransformer(
    windows=[2, 10]  # Custom moving average windows
).set_transform_request(ticker_series=True)
lagger = LagTransformer(windows=[0, 5, 10, 15, 20]).set_transform_request(ticker_series=True)  # Custom lag windows

# Create feature pipeline
feature_pipeline = make_pipeline(
    log_return_transformer,
    ranker,
    ma_transformer,
    lagger,
    verbose=True
)

feature_pipeline

🔧 Building feature engineering pipeline...

Out[8]:

Pipeline(steps=[('logreturntransformer', LogReturnTransformer()),
                ('ranktransformer', RankTransformer()),
                ('movingaveragetransformer',
                 MovingAverageTransformer(windows=[2, 10])),
                ('lagtransformer', LagTransformer(windows=[20, 15, 10, 5, 0]))],
         verbose=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [9]:

Copied!





# Apply feature engineering
input_features = ["close", "open", "high", "low", "volume"]

print(f"Transforming {input_features} into features...")

feature_df = feature_pipeline.fit_transform(
    df[input_features],
    date_series=df["date"],
    ticker_series=df["ticker"]
)

# Get feature names and add to dataframe
feature_names = feature_pipeline.get_feature_names_out()
df = df.with_columns(feature_df).drop_nulls(subset=feature_names)

print(f"\n✅ Created {len(feature_names)} features")
print(f"Final dataset shape: {df.shape}")
print(f"\nFeature names:")
for i, name in enumerate(feature_names[:10]):
    print(f"  {name}")
if len(feature_names) > 10:
    print(f"  ... and {len(feature_names) - 10} more")

df.head()
# Apply feature engineering
input_features = ["close", "open", "high", "low", "volume"]

print(f"Transforming {input_features} into features...")

feature_df = feature_pipeline.fit_transform(
    df[input_features],
    date_series=df["date"],
    ticker_series=df["ticker"]
)

# Get feature names and add to dataframe
feature_names = feature_pipeline.get_feature_names_out()
df = df.with_columns(feature_df).drop_nulls(subset=feature_names)

print(f"\n✅ Created {len(feature_names)} features")
print(f"Final dataset shape: {df.shape}")
print(f"\nFeature names:")
for i, name in enumerate(feature_names[:10]):
    print(f"  {name}")
if len(feature_names) > 10:
    print(f"  ... and {len(feature_names) - 10} more")

df.head()

Transforming ['close', 'open', 'high', 'low', 'volume'] into features...
[Pipeline]  (step 1 of 4) Processing logreturntransformer, total=   0.0s
[Pipeline] ... (step 2 of 4) Processing ranktransformer, total=   0.0s
[Pipeline]  (step 3 of 4) Processing movingaveragetransformer, total=   0.0s
[Pipeline] .... (step 4 of 4) Processing lagtransformer, total=   0.0s

✅ Created 50 features
Final dataset shape: (207972, 61)

Feature names:
  close_logreturn_rank_ma2_lag20
  close_logreturn_rank_ma10_lag20
  open_logreturn_rank_ma2_lag20
  open_logreturn_rank_ma10_lag20
  high_logreturn_rank_ma2_lag20
  high_logreturn_rank_ma10_lag20
  low_logreturn_rank_ma2_lag20
  low_logreturn_rank_ma10_lag20
  volume_logreturn_rank_ma2_lag20
  volume_logreturn_rank_ma10_lag20
  ... and 40 more

Out[9]:

shape: (5, 61)

open	high	low	close	adjusted_close	volume	ticker	date	id	target_10d	target_30d	close_logreturn_rank_ma2_lag20	close_logreturn_rank_ma10_lag20	open_logreturn_rank_ma2_lag20	open_logreturn_rank_ma10_lag20	high_logreturn_rank_ma2_lag20	high_logreturn_rank_ma10_lag20	low_logreturn_rank_ma2_lag20	low_logreturn_rank_ma10_lag20	volume_logreturn_rank_ma2_lag20	volume_logreturn_rank_ma10_lag20	close_logreturn_rank_ma2_lag15	close_logreturn_rank_ma10_lag15	open_logreturn_rank_ma2_lag15	open_logreturn_rank_ma10_lag15	high_logreturn_rank_ma2_lag15	high_logreturn_rank_ma10_lag15	low_logreturn_rank_ma2_lag15	low_logreturn_rank_ma10_lag15	volume_logreturn_rank_ma2_lag15	volume_logreturn_rank_ma10_lag15	close_logreturn_rank_ma2_lag10	close_logreturn_rank_ma10_lag10	open_logreturn_rank_ma2_lag10	open_logreturn_rank_ma10_lag10	high_logreturn_rank_ma2_lag10	high_logreturn_rank_ma10_lag10	low_logreturn_rank_ma2_lag10	low_logreturn_rank_ma10_lag10	volume_logreturn_rank_ma2_lag10	volume_logreturn_rank_ma10_lag10	close_logreturn_rank_ma2_lag5	close_logreturn_rank_ma10_lag5	open_logreturn_rank_ma2_lag5	open_logreturn_rank_ma10_lag5	high_logreturn_rank_ma2_lag5	high_logreturn_rank_ma10_lag5	low_logreturn_rank_ma2_lag5	low_logreturn_rank_ma10_lag5	volume_logreturn_rank_ma2_lag5	volume_logreturn_rank_ma10_lag5	close_logreturn_rank_ma2_lag0	close_logreturn_rank_ma10_lag0	open_logreturn_rank_ma2_lag0	open_logreturn_rank_ma10_lag0	high_logreturn_rank_ma2_lag0	high_logreturn_rank_ma10_lag0	low_logreturn_rank_ma2_lag0	low_logreturn_rank_ma10_lag0	volume_logreturn_rank_ma2_lag0	volume_logreturn_rank_ma10_lag0
f64	f64	f64	f64	f64	i64	str	datetime[μs]	str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
1.463775	1.463893	1.316034	1.339242	1.339242	624706911	"ARB11841-USD.CC"	2023-04-23 00:00:00	"ARB"	0.14433	0.185567	0.063158	0.351579	0.026316	0.391579	0.052632	0.350526	0.110526	0.483158	0.536842	0.364211	0.268421	0.361053	0.342105	0.358947	0.473684	0.433684	0.421053	0.466316	0.426316	0.415789	0.726316	0.54	0.252632	0.431579	0.610526	0.515789	0.347368	0.54	0.805263	0.472632	0.489691	0.613728	0.43299	0.517124	0.752577	0.601042	0.572165	0.588117	0.546392	0.56191	0.046392	0.443082	0.525773	0.53834	0.484536	0.525426	0.458763	0.556831	0.417526	0.469018
1.339198	1.361657	1.293739	1.338344	1.338344	645855049	"ARB11841-USD.CC"	2023-04-24 00:00:00	"ARB"	0.608247	0.474227	0.468421	0.416842	0.057895	0.349474	0.415789	0.426316	0.5	0.48	0.257895	0.392632	0.447368	0.342105	0.257895	0.356842	0.247368	0.370526	0.326316	0.451579	0.405263	0.369474	0.994737	0.554737	0.636842	0.521053	0.994737	0.537895	0.731579	0.56	0.973684	0.528421	0.520619	0.551275	0.489691	0.595279	0.479381	0.571503	0.350515	0.559154	0.515464	0.522767	0.293814	0.399783	0.046392	0.442507	0.046392	0.430602	0.103093	0.474357	0.283505	0.407249
1.338297	1.381048	1.300433	1.379554	1.379554	636337759	"ARB11841-USD.CC"	2023-04-25 00:00:00	"ARB"	0.587629	0.360825	0.668421	0.38	0.468421	0.414737	0.636842	0.388421	0.831579	0.487368	0.342105	0.391579	0.8	0.390526	0.447368	0.338947	0.584211	0.383158	0.578947	0.413684	0.631579	0.441053	0.557895	0.517895	0.989474	0.535789	0.868421	0.562105	0.994737	0.572632	0.647368	0.533684	0.242268	0.502181	0.525773	0.532805	0.185567	0.521313	0.252577	0.522843	0.376289	0.510852	0.685567	0.468616	0.293814	0.399208	0.438144	0.439371	0.530928	0.464069	0.381443	0.415833
1.379177	1.494646	1.341212	1.384658	1.384658	1232822646	"ARB11841-USD.CC"	2023-04-26 00:00:00	"ARB"	0.546392	0.350515	0.373684	0.402105	0.668421	0.377895	0.384211	0.393684	0.542105	0.502105	0.210526	0.391579	0.510526	0.407368	0.805263	0.386316	0.563158	0.404211	0.715789	0.492632	0.478947	0.418947	0.447368	0.569474	0.557895	0.498947	0.421053	0.545263	0.531579	0.557895	0.257895	0.537895	0.649485	0.579067	0.247423	0.483711	0.458763	0.550624	0.592784	0.534552	0.597938	0.546565	0.840206	0.478351	0.680412	0.46701	0.881443	0.52268	0.891753	0.546392	0.634021	0.482474
1.384743	1.448856	1.382342	1.422934	1.422934	747265158	"ARB11841-USD.CC"	2023-04-27 00:00:00	"ARB"	0.216495	0.42268	0.347368	0.362105	0.373684	0.4	0.521053	0.468421	0.478947	0.543158	0.331579	0.427368	0.336842	0.449474	0.510526	0.405263	0.252632	0.422105	0.526316	0.513684	0.4	0.385263	0.44102	0.536625	0.439338	0.54892	0.336028	0.5251	0.387358	0.554314	0.166793	0.500727	0.525773	0.539967	0.649485	0.560597	0.747423	0.620271	0.912371	0.600054	0.572165	0.545285	0.783505	0.537113	0.835052	0.478351	0.623711	0.496907	0.623711	0.51134	0.510309	0.484536

Visualize Pipeline Output: Engineered Features¶

After feature engineering, all features are normalized and cross-sectionally ranked. This makes them comparable across different tickers and time periods. Let's visualize a few features for our top tickers.

In [10]:

Copied!





# Select a few representative features to visualize
sample_features = [
    "close_logreturn_rank_ma10_lag0",  # Most recent smoothed close feature
    "volume_logreturn_rank_ma10_lag0",  # Most recent smoothed volume feature
]

# Filter to top tickers and recent dates for cleaner visualization
recent_dates = df["date"].unique().sort().tail(100).to_list()
feature_viz_df = (
    df.filter(pl.col("id").is_in(top_tickers[:10]))
    .filter(pl.col("date").is_in(recent_dates))
    .select(["id", "date"] + sample_features)
    .to_pandas()
)

# Melt for plotting multiple features
feature_viz_melted = feature_viz_df.melt(
    id_vars=["id", "date"],
    value_vars=sample_features,
    var_name="feature",
    value_name="value"
)

# Create feature visualization
feature_chart = (
    alt.Chart(feature_viz_melted)
    .mark_line(opacity=0.6)
    .encode(
        x=alt.X("date:T", title="Date"),
        y=alt.Y("value:Q", title="Normalized Feature Value [0, 1]", scale=alt.Scale(domain=[0, 1])),
        color=alt.Color("id:N", title="Ticker"),
        strokeDash=alt.StrokeDash("feature:N", title="Feature Type"),
        tooltip=["id:N", "date:T", "feature:N", "value:Q"]
    )
    .properties(
        width=700,
        height=300,
        title="Pipeline Output: Normalized/Smoothed Features (Recent 100 Days)"
    )
)

feature_chart
# Select a few representative features to visualize
sample_features = [
    "close_logreturn_rank_ma10_lag0",  # Most recent smoothed close feature
    "volume_logreturn_rank_ma10_lag0",  # Most recent smoothed volume feature
]

# Filter to top tickers and recent dates for cleaner visualization
recent_dates = df["date"].unique().sort().tail(100).to_list()
feature_viz_df = (
    df.filter(pl.col("id").is_in(top_tickers[:10]))
    .filter(pl.col("date").is_in(recent_dates))
    .select(["id", "date"] + sample_features)
    .to_pandas()
)

# Melt for plotting multiple features
feature_viz_melted = feature_viz_df.melt(
    id_vars=["id", "date"],
    value_vars=sample_features,
    var_name="feature",
    value_name="value"
)

# Create feature visualization
feature_chart = (
    alt.Chart(feature_viz_melted)
    .mark_line(opacity=0.6)
    .encode(
        x=alt.X("date:T", title="Date"),
        y=alt.Y("value:Q", title="Normalized Feature Value [0, 1]", scale=alt.Scale(domain=[0, 1])),
        color=alt.Color("id:N", title="Ticker"),
        strokeDash=alt.StrokeDash("feature:N", title="Feature Type"),
        tooltip=["id:N", "date:T", "feature:N", "value:Q"]
    )
    .properties(
        width=700,
        height=300,
        title="Pipeline Output: Normalized/Smoothed Features (Recent 100 Days)"
    )
)

feature_chart

Out[10]:

What changed after the pipeline:

All features now range from 0 to 1 (normalized through ranking)
Different price levels no longer matter - we're comparing relative performance
Features are smoothed (moving averages reduce noise)
The data is now ready for machine learning models to find patterns

This transformation is crucial for cross-sectional ranking models where we predict which assets will outperform others, not their absolute price movements.

Step 5: Structure Dataset for ML¶

In [11]:

Copied!





# Create final dataset structure
final_df = (
    df.rename({"ticker": "eodhd_id"})
    .select([
        "id",
        "eodhd_id", 
        "date"
    ] + list(feature_names) + [
        "target_10d",
        "target_30d"
    ])
)

print(f"📊 Final Dataset Summary:")
print(f"Shape: {final_df.shape}")
print(f"Date range: {final_df['date'].min()} to {final_df['date'].max()}")
print(f"Cryptocurrencies: {final_df['id'].n_unique()}")
print(f"Features: {len(feature_names)}")
print(f"Targets: 2 (target_10d, target_30d)")

# Show sample with first few features
sample_cols = ["id", "date"] + list(feature_names)[:3] + ["target_10d", "target_30d"]
print(f"\nSample data:")
final_df.select(sample_cols).head()
# Create final dataset structure
final_df = (
    df.rename({"ticker": "eodhd_id"})
    .select([
        "id",
        "eodhd_id", 
        "date"
    ] + list(feature_names) + [
        "target_10d",
        "target_30d"
    ])
)

print(f"📊 Final Dataset Summary:")
print(f"Shape: {final_df.shape}")
print(f"Date range: {final_df['date'].min()} to {final_df['date'].max()}")
print(f"Cryptocurrencies: {final_df['id'].n_unique()}")
print(f"Features: {len(feature_names)}")
print(f"Targets: 2 (target_10d, target_30d)")

# Show sample with first few features
sample_cols = ["id", "date"] + list(feature_names)[:3] + ["target_10d", "target_30d"]
print(f"\nSample data:")
final_df.select(sample_cols).head()

📊 Final Dataset Summary:
Shape: (207972, 55)
Date range: 2020-02-01 00:00:00 to 2025-09-09 00:00:00
Cryptocurrencies: 195
Features: 50
Targets: 2 (target_10d, target_30d)

Sample data:

Out[11]:

shape: (5, 7)

id	date	close_logreturn_rank_ma2_lag20	close_logreturn_rank_ma10_lag20	open_logreturn_rank_ma2_lag20	target_10d	target_30d
str	datetime[μs]	f64	f64	f64	f64	f64
"ARB"	2023-04-23 00:00:00	0.063158	0.351579	0.026316	0.14433	0.185567
"ARB"	2023-04-24 00:00:00	0.468421	0.416842	0.057895	0.608247	0.474227
"ARB"	2023-04-25 00:00:00	0.668421	0.38	0.468421	0.587629	0.360825
"ARB"	2023-04-26 00:00:00	0.373684	0.402105	0.668421	0.546392	0.350515
"ARB"	2023-04-27 00:00:00	0.347368	0.362105	0.373684	0.216495	0.42268

Step 6: Save Dataset¶

In [12]:

Copied!





# Save to parquet
output_path = "custom_crypto_dataset.parquet"
final_df.write_parquet(output_path)

# Get file size, why not?
file_size_mb = os.path.getsize(output_path) / 1024 / 1024

print(f"✅ Dataset saved to: {output_path}")
print(f"File size: {file_size_mb:.2f} MB")
print(f"Rows: {len(final_df):,}")
print(f"\nYou can now use this dataset for ML experiments!")
# Save to parquet
output_path = "custom_crypto_dataset.parquet"
final_df.write_parquet(output_path)

# Get file size, why not?
file_size_mb = os.path.getsize(output_path) / 1024 / 1024

print(f"✅ Dataset saved to: {output_path}")
print(f"File size: {file_size_mb:.2f} MB")
print(f"Rows: {len(final_df):,}")
print(f"\nYou can now use this dataset for ML experiments!")

✅ Dataset saved to: custom_crypto_dataset.parquet
File size: 49.20 MB
Rows: 207,972

You can now use this dataset for ML experiments!

Step 7: Quick Model Validation¶

Let's verify our dataset works with a simple train-test split and xgboost model to ensure data quality.

In [13]:

Copied!





# Prepare features
feature_cols = list(feature_names)
print(f"Feature columns: {len(feature_cols)} features")

# Time-based split with embargo period
embargo_days = 30
sorted_dates = final_df["date"].unique().sort()
split_idx = int(len(sorted_dates) * 0.8)
split_date = sorted_dates[split_idx]
embargo_end = split_date + pl.duration(days=embargo_days)

print(f"Split date: {split_date}")
print(f"Embargo: {embargo_days} days")

train_df = final_df.filter(pl.col("date") < split_date)
test_df = final_df.filter(pl.col("date") > embargo_end)

print(f"Train period: {train_df['date'].min()} to {train_df['date'].max()} ({len(train_df)} rows)")
print(f"Test period: {test_df['date'].min()} to {test_df['date'].max()} ({len(test_df)} rows)")
# Prepare features
feature_cols = list(feature_names)
print(f"Feature columns: {len(feature_cols)} features")

# Time-based split with embargo period
embargo_days = 30
sorted_dates = final_df["date"].unique().sort()
split_idx = int(len(sorted_dates) * 0.8)
split_date = sorted_dates[split_idx]
embargo_end = split_date + pl.duration(days=embargo_days)

print(f"Split date: {split_date}")
print(f"Embargo: {embargo_days} days")

train_df = final_df.filter(pl.col("date") < split_date)
test_df = final_df.filter(pl.col("date") > embargo_end)

print(f"Train period: {train_df['date'].min()} to {train_df['date'].max()} ({len(train_df)} rows)")
print(f"Test period: {test_df['date'].min()} to {test_df['date'].max()} ({len(test_df)} rows)")

Feature columns: 50 features
Split date: 2024-07-27 00:00:00
Embargo: 30 days
Train period: 2020-02-01 00:00:00 to 2024-07-26 00:00:00 (138679 rows)
Test period: 2024-08-27 00:00:00 to 2025-09-09 00:00:00 (64829 rows)

In [14]:

Copied!





from xgboost import XGBRegressor
from crowdcent_challenge.scoring import evaluate_hyperliquid_submission

# Train simple, untuned multi-output XGBoost model for both 10d and 30d targets
model = XGBRegressor(n_estimators=500, random_state=42, verbosity=0, device="cuda")

X_train = train_df[feature_cols].to_pandas()
y_train = train_df[["target_10d", "target_30d"]].to_pandas()

print(f"Training on {X_train.shape} features, {y_train.shape} targets")
print("This may take a few minutes depending on number of features, model parameters, GPU, etc...")
model.fit(X_train, y_train)

# Make predictions for both horizons
X_test = test_df[feature_cols].to_pandas()
test_preds = model.predict(X_test)

# Extract predictions for both horizons and create predictions dataframe
test_results = test_df.select(["date", "id", "target_10d", "target_30d"]).with_columns([
    pl.Series("pred_10d", test_preds[:, 0]),
    pl.Series("pred_30d", test_preds[:, 1])
])

# Evaluate per date (cross-sectionally) then average
daily_scores = (
    test_results
    .group_by("date")
    .agg([
        pl.col("target_10d"),
        pl.col("pred_10d"),
        pl.col("target_30d"),
        pl.col("pred_30d")
    ])
    .with_columns(
        pl.struct(["pred_10d", "target_10d", "pred_30d", "target_30d"])
        .map_elements(
            lambda x: evaluate_hyperliquid_submission(
                y_true_10d=x["target_10d"],
                y_pred_10d=x["pred_10d"],
                y_true_30d=x["target_30d"],
                y_pred_30d=x["pred_30d"],
            ),
            return_dtype=pl.Struct,
        )
        .alias("metrics")
    )
    .unnest("metrics")
    .sort("date")
)


# Average scores across all test dates
metrics = ["spearman_10d", "spearman_30d", "ndcg@40_10d", "ndcg@40_30d"]
avg_scores = daily_scores.select([pl.col(metric).mean() for metric in metrics]).to_dicts()[0]

print(f"\n📈 Model Validation Results (CrowdCent Official Metrics):")
print(f"Spearman Correlation 10d: {avg_scores['spearman_10d']:.4f}")
print(f"Spearman Correlation 30d: {avg_scores['spearman_30d']:.4f}")
print(f"NDCG@40 10d: {avg_scores['ndcg@40_10d']:.4f}")
print(f"NDCG@40 30d: {avg_scores['ndcg@40_30d']:.4f}")

print(f"\n🎯 Your custom dataset is ready for advanced modeling!")
from xgboost import XGBRegressor
from crowdcent_challenge.scoring import evaluate_hyperliquid_submission

# Train simple, untuned multi-output XGBoost model for both 10d and 30d targets
model = XGBRegressor(n_estimators=500, random_state=42, verbosity=0, device="cuda")

X_train = train_df[feature_cols].to_pandas()
y_train = train_df[["target_10d", "target_30d"]].to_pandas()

print(f"Training on {X_train.shape} features, {y_train.shape} targets")
print("This may take a few minutes depending on number of features, model parameters, GPU, etc...")
model.fit(X_train, y_train)

# Make predictions for both horizons
X_test = test_df[feature_cols].to_pandas()
test_preds = model.predict(X_test)

# Extract predictions for both horizons and create predictions dataframe
test_results = test_df.select(["date", "id", "target_10d", "target_30d"]).with_columns([
    pl.Series("pred_10d", test_preds[:, 0]),
    pl.Series("pred_30d", test_preds[:, 1])
])

# Evaluate per date (cross-sectionally) then average
daily_scores = (
    test_results
    .group_by("date")
    .agg([
        pl.col("target_10d"),
        pl.col("pred_10d"),
        pl.col("target_30d"),
        pl.col("pred_30d")
    ])
    .with_columns(
        pl.struct(["pred_10d", "target_10d", "pred_30d", "target_30d"])
        .map_elements(
            lambda x: evaluate_hyperliquid_submission(
                y_true_10d=x["target_10d"],
                y_pred_10d=x["pred_10d"],
                y_true_30d=x["target_30d"],
                y_pred_30d=x["pred_30d"],
            ),
            return_dtype=pl.Struct,
        )
        .alias("metrics")
    )
    .unnest("metrics")
    .sort("date")
)


# Average scores across all test dates
metrics = ["spearman_10d", "spearman_30d", "ndcg@40_10d", "ndcg@40_30d"]
avg_scores = daily_scores.select([pl.col(metric).mean() for metric in metrics]).to_dicts()[0]

print(f"\n📈 Model Validation Results (CrowdCent Official Metrics):")
print(f"Spearman Correlation 10d: {avg_scores['spearman_10d']:.4f}")
print(f"Spearman Correlation 30d: {avg_scores['spearman_30d']:.4f}")
print(f"NDCG@40 10d: {avg_scores['ndcg@40_10d']:.4f}")
print(f"NDCG@40 30d: {avg_scores['ndcg@40_30d']:.4f}")

print(f"\n🎯 Your custom dataset is ready for advanced modeling!")

Training on (138679, 50) features, (138679, 2) targets
This may take a few minutes depending on number of features, model parameters, GPU, etc...

📈 Model Validation Results (CrowdCent Official Metrics):
Spearman Correlation 10d: 0.1805
Spearman Correlation 30d: 0.1027
NDCG@40 10d: 0.6406
NDCG@40 30d: 0.6017

🎯 Your custom dataset is ready for advanced modeling!

Next Steps¶

🎉 Congratulations! You've built a complete crypto prediction dataset. Here's what to try next, although the list is endless:

Ready to submit predictions with your new dataset? Check out the Hyperliquid End-to-End Tutorial where you'll learn how to download inference data, make predictions, and submit to CrowdCent competitions using your custom model. You will need to replace the download data steps with your own production pipeline using your new custom dataset!

Immediate experiments and next steps:

Try different models: Random Forest, Neural Networks, and other model estimators from Centimators like LSTMRegressor
Adjust time horizons: Create 3d, 14d, 50d targets
Additional data cleaning: Check the raw data for anomolies and remove/adjust samples.
Feature engineering: Add lags, differences, and other combinatorial feature combinations. Centimators provides additional feature transformers like GroupStatsTransformer and will always be adding more!
Cross-validation: Use walk-forward, time-series CV for robust validation like in this advanced tutorial notebook.
More data sources: Add social sentiment, onchain data, or macro indicators
Ensemble methods: Combine multiple models and time horizons
Real-time pipeline: Set up automated data updates and retraining
Automate submissions: Use techniques from the submission automation guide to set it and forget it.

Other Resources¶

EODHD API: https://eodhd.com/pricing-special-10?via=crowdcent
Centimators GitHub: https://github.com/crowdcent/centimators
CrowdCent Docs: https://docs.crowdcent.com
Join our Discord: Join here to get help and share your results with the community

Happy modeling! 🧪

	steps	[('logreturntransformer', ...), ('ranktransformer', ...), ...]
	transform_input	None
	memory	None
	verbose	True