Glossary

This glossary provides definitions for key terms used in the CrowdCent Challenge, organized alphabetically.

Challenge-Specific Columns

The exact feature names, target column names (e.g., target_1M, target_10d), and required prediction column names (e.g., pred_1M, pred_10d) can vary significantly from one challenge to another. Always refer to the specific rules and data description for the challenge you are participating in. The examples provided in this documentation are illustrative and may not apply to all challenges.

Terminology

Categorical Features

Features representing discrete categories, typically obfuscated and named like categorical_N (where N is a number). These features contain integer values representing different categories.

Challenge

A top-level competition (e.g., "Crypto Ranking") focused on a specific prediction task. A challenge has:

A unique slug (identifier in URLs)
A name and description
Start and end dates
Associated training datasets and inference periods

Features

Data attributes used for prediction. In CrowdCent challenges, features may include:

NLP Features - Generated by processing text data (e.g., analyst's investment write-up) - Include embeddings related to different aspects (e.g., catalysts, financials, management) - Named like nlp_groupname_N (e.g., nlp_catalyst_0, nlp_financials_15) - Contain numerical values representing elements of embedding vectors

Point-in-Time Features - Based on market conditions, fundamental data, or other factors known at a specific point in time - Obfuscated and named like fundamental_N, market_N, etc. - Contain numerical values

Fundamental Features

Features derived from fundamental data related to the investment idea, typically obfuscated and named like fundamental_N (where N is a number).

Identifier (`id`)

A unique numerical identifier assigned to each record in the dataset. The id is obfuscated and does not directly reveal the underlying security or analyst.

Inference Data

Periodic releases of new data for which you will make predictions. Each inference data period has:

A release date
A submission deadline
A Parquet file containing features (but not targets) for inference

Market Features

Features derived from market conditions at the time the investment idea was submitted, typically obfuscated and named like market_N (where N is a number).

NLP Features

Features generated by processing text data, named like nlp_groupname_N (e.g., nlp_catalyst_0). These contain numerical values representing elements of embedding vectors.

Parquet

The file format used for all datasets in the CrowdCent Challenge. Parquet is a columnar storage file format optimized for analytical processing.

Prediction Target

The values you aim to predict. Typically these are return values for different time horizons. Common prediction targets include:

pred_1M: Predicted return after 1 month
pred_3M: Predicted return after 3 months
pred_6M: Predicted return after 6 months
pred_9M: Predicted return after 9 months
pred_12M: Predicted return after 12 months

Submission

Your predictions for an inference data period. Submissions must be in Parquet format with the following columns:

Column	Description
`id`	Unique identifier matching the inference data
`pred_1M`	Predicted 1-month return
`pred_3M`	Predicted 3-month return
`pred_6M`	Predicted 6-month return
`pred_9M`	Predicted 9-month return
`pred_12M`	Predicted 12-month return

All prediction columns must contain numeric values (float or integer). Missing values are not allowed.

Target

The ground truth values in training data that your model aims to predict. Targets are typically labeled as target_1M, target_3M, etc., corresponding to returns over different time horizons.

Training Dataset

Labeled data used for building your models. Each training dataset has:

A version (e.g., "1.0", "2.1")
Feature and target descriptions
A Parquet file containing features and ground truth labels

Training datasets may be updated with new versions over time, but only one version is marked as "latest" at any time.

Example Data Formats

The following examples illustrate what the data formats look like. The values are entirely fictional and intended only to demonstrate the format.

Example Training Data:

id	nlp_catalyst_0	nlp_catalyst_1	fundamental_0	market_0	market_1	categorical_0	target_1M	target_3M	target_6M	target_9M	target_12M
1001	0.123456	-0.987654	12.340000	105.50	250000	1	0.05	0.12	0.18	0.22	0.25
1002	-0.543210	0.010101	0.987000	32.15	1500000	3	-0.02	-0.08	-0.15	-0.10	-0.05
1003	0.678901	1.234567	-5.670000	210.75	750000	1	0.10	0.25	0.40	0.45	0.50

Example Inference Data (for prediction):

id	nlp_catalyst_0	nlp_catalyst_1	fundamental_0	market_0	market_1	categorical_0
2001	0.234567	-0.876543	15.430000	110.25	275000	2
2002	-0.432109	0.121212	1.876000	29.75	1300000	3

Example Submission (your predictions):

id	pred_1M	pred_3M	pred_6M	pred_9M	pred_12M
2001	0.06	0.14	0.19	0.23	0.27
2002	-0.01	-0.07	-0.13	-0.09	-0.03