Skip to content

Glossary

This glossary provides definitions for key terms used in the CrowdCent Challenge, organized alphabetically.

Challenge-Specific Columns

The exact feature names, target column names (e.g., target_1M, target_10d), and required prediction column names (e.g., pred_1M, pred_10d) can vary significantly from one challenge to another. Always refer to the specific rules and data description for the challenge you are participating in. The examples provided in this documentation are illustrative and may not apply to all challenges.

Terminology

Categorical Features

Features representing discrete categories, typically obfuscated and named like categorical_N (where N is a number). These features contain integer values representing different categories.

Challenge

A top-level competition (e.g., "Crypto Ranking") focused on a specific prediction task. A challenge has:

  • A unique slug (identifier in URLs)
  • A name and description
  • Start and end dates
  • Associated training datasets and inference periods

Features

Data attributes used for prediction. In CrowdCent challenges, features may include:

NLP Features - Generated by processing text data (e.g., analyst's investment write-up) - Include embeddings related to different aspects (e.g., catalysts, financials, management) - Named like nlp_groupname_N (e.g., nlp_catalyst_0, nlp_financials_15) - Contain numerical values representing elements of embedding vectors

Point-in-Time Features - Based on market conditions, fundamental data, or other factors known at a specific point in time - Obfuscated and named like fundamental_N, market_N, etc. - Contain numerical values

Fundamental Features

Features derived from fundamental data related to the investment idea, typically obfuscated and named like fundamental_N (where N is a number).

Identifier (id)

A unique numerical identifier assigned to each record in the dataset. The id is obfuscated and does not directly reveal the underlying security or analyst.

Inference Data

Periodic releases of new data for which you will make predictions. Each inference data period has:

  • A release date
  • A submission deadline
  • A Parquet file containing features (but not targets) for inference

Market Features

Features derived from market conditions at the time the investment idea was submitted, typically obfuscated and named like market_N (where N is a number).

NLP Features

Features generated by processing text data, named like nlp_groupname_N (e.g., nlp_catalyst_0). These contain numerical values representing elements of embedding vectors.

Parquet

The file format used for all datasets in the CrowdCent Challenge. Parquet is a columnar storage file format optimized for analytical processing.

Prediction Target

The values you aim to predict. Typically these are return values for different time horizons. Common prediction targets include:

  • pred_1M: Predicted return after 1 month
  • pred_3M: Predicted return after 3 months
  • pred_6M: Predicted return after 6 months
  • pred_9M: Predicted return after 9 months
  • pred_12M: Predicted return after 12 months

Submission

Your predictions for an inference data period. Submissions must be in Parquet format with the following columns:

Column Description
id Unique identifier matching the inference data
pred_1M Predicted 1-month return
pred_3M Predicted 3-month return
pred_6M Predicted 6-month return
pred_9M Predicted 9-month return
pred_12M Predicted 12-month return

All prediction columns must contain numeric values (float or integer). Missing values are not allowed.

Target

The ground truth values in training data that your model aims to predict. Targets are typically labeled as target_1M, target_3M, etc., corresponding to returns over different time horizons.

Training Dataset

Labeled data used for building your models. Each training dataset has:

  • A version (e.g., "1.0", "2.1")
  • Feature and target descriptions
  • A Parquet file containing features and ground truth labels

Training datasets may be updated with new versions over time, but only one version is marked as "latest" at any time.

Example Data Formats

The following examples illustrate what the data formats look like. The values are entirely fictional and intended only to demonstrate the format.

Example Training Data:

id nlp_catalyst_0 nlp_catalyst_1 fundamental_0 market_0 market_1 categorical_0 target_1M target_3M target_6M target_9M target_12M
1001 0.123456 -0.987654 12.340000 105.50 250000 1 0.05 0.12 0.18 0.22 0.25
1002 -0.543210 0.010101 0.987000 32.15 1500000 3 -0.02 -0.08 -0.15 -0.10 -0.05
1003 0.678901 1.234567 -5.670000 210.75 750000 1 0.10 0.25 0.40 0.45 0.50

Example Inference Data (for prediction):

id nlp_catalyst_0 nlp_catalyst_1 fundamental_0 market_0 market_1 categorical_0
2001 0.234567 -0.876543 15.430000 110.25 275000 2
2002 -0.432109 0.121212 1.876000 29.75 1300000 3

Example Submission (your predictions):

id pred_1M pred_3M pred_6M pred_9M pred_12M
2001 0.06 0.14 0.19 0.23 0.27
2002 -0.01 -0.07 -0.13 -0.09 -0.03