Glossary
This glossary provides definitions for key terms used in the CrowdCent Challenge, organized alphabetically.
Challenge-Specific Columns
The exact feature names, target column names (e.g., target_1M
, target_10d
), and required prediction column names (e.g., pred_1M
, pred_10d
) can vary significantly from one challenge to another.
Always refer to the specific rules and data description for the challenge you are participating in.
The examples provided in this documentation are illustrative and may not apply to all challenges.
Terminology
Categorical Features
Features representing discrete categories, typically obfuscated and named like categorical_N
(where N is a number). These features contain integer values representing different categories.
Challenge
A top-level competition (e.g., "Crypto Ranking") focused on a specific prediction task. A challenge has:
- A unique slug (identifier in URLs)
- A name and description
- Start and end dates
- Associated training datasets and inference periods
Features
Data attributes used for prediction. In CrowdCent challenges, features may include:
NLP Features
- Generated by processing text data (e.g., analyst's investment write-up)
- Include embeddings related to different aspects (e.g., catalysts, financials, management)
- Named like nlp_groupname_N
(e.g., nlp_catalyst_0
, nlp_financials_15
)
- Contain numerical values representing elements of embedding vectors
Point-in-Time Features
- Based on market conditions, fundamental data, or other factors known at a specific point in time
- Obfuscated and named like fundamental_N
, market_N
, etc.
- Contain numerical values
Fundamental Features
Features derived from fundamental data related to the investment idea, typically obfuscated and named like fundamental_N
(where N is a number).
Identifier (id
)
A unique numerical identifier assigned to each record in the dataset. The id
is obfuscated and does not directly reveal the underlying security or analyst.
Inference Data
Periodic releases of new data for which you will make predictions. Each inference data period has:
- A release date
- A submission deadline
- A Parquet file containing features (but not targets) for inference
Market Features
Features derived from market conditions at the time the investment idea was submitted, typically obfuscated and named like market_N
(where N is a number).
NLP Features
Features generated by processing text data, named like nlp_groupname_N
(e.g., nlp_catalyst_0
). These contain numerical values representing elements of embedding vectors.
Parquet
The file format used for all datasets in the CrowdCent Challenge. Parquet is a columnar storage file format optimized for analytical processing.
Prediction Target
The values you aim to predict. Typically these are return values for different time horizons. Common prediction targets include:
pred_1M
: Predicted return after 1 monthpred_3M
: Predicted return after 3 monthspred_6M
: Predicted return after 6 monthspred_9M
: Predicted return after 9 monthspred_12M
: Predicted return after 12 months
Submission
Your predictions for an inference data period. Submissions must be in Parquet format with the following columns:
Column | Description |
---|---|
id |
Unique identifier matching the inference data |
pred_1M |
Predicted 1-month return |
pred_3M |
Predicted 3-month return |
pred_6M |
Predicted 6-month return |
pred_9M |
Predicted 9-month return |
pred_12M |
Predicted 12-month return |
All prediction columns must contain numeric values (float or integer). Missing values are not allowed.
Target
The ground truth values in training data that your model aims to predict. Targets are typically labeled as target_1M
, target_3M
, etc., corresponding to returns over different time horizons.
Training Dataset
Labeled data used for building your models. Each training dataset has:
- A version (e.g., "1.0", "2.1")
- Feature and target descriptions
- A Parquet file containing features and ground truth labels
Training datasets may be updated with new versions over time, but only one version is marked as "latest" at any time.
Example Data Formats
The following examples illustrate what the data formats look like. The values are entirely fictional and intended only to demonstrate the format.
Example Training Data:
id | nlp_catalyst_0 | nlp_catalyst_1 | fundamental_0 | market_0 | market_1 | categorical_0 | target_1M | target_3M | target_6M | target_9M | target_12M |
---|---|---|---|---|---|---|---|---|---|---|---|
1001 | 0.123456 | -0.987654 | 12.340000 | 105.50 | 250000 | 1 | 0.05 | 0.12 | 0.18 | 0.22 | 0.25 |
1002 | -0.543210 | 0.010101 | 0.987000 | 32.15 | 1500000 | 3 | -0.02 | -0.08 | -0.15 | -0.10 | -0.05 |
1003 | 0.678901 | 1.234567 | -5.670000 | 210.75 | 750000 | 1 | 0.10 | 0.25 | 0.40 | 0.45 | 0.50 |
Example Inference Data (for prediction):
id | nlp_catalyst_0 | nlp_catalyst_1 | fundamental_0 | market_0 | market_1 | categorical_0 |
---|---|---|---|---|---|---|
2001 | 0.234567 | -0.876543 | 15.430000 | 110.25 | 275000 | 2 |
2002 | -0.432109 | 0.121212 | 1.876000 | 29.75 | 1300000 | 3 |
Example Submission (your predictions):
id | pred_1M | pred_3M | pred_6M | pred_9M | pred_12M |
---|---|---|---|---|---|
2001 | 0.06 | 0.14 | 0.19 | 0.23 | 0.27 |
2002 | -0.01 | -0.07 | -0.13 | -0.09 | -0.03 |