Learning

This section walks you through setting up the consolidated master configuration file (config.yaml) for the machine learning pipeline. Instead of multiple JSON files in the previous versions, all parameters are now managed in a single YAML structure, allowing for easier maintenance and the use of YAML anchors for consistency (e.g., shared seeds).

The configuration is separated into the following subdivisions:

Study Metadata
Experiment Design Parameters
Variables Definition
Data Cleaning Parameters
Data Normalization Parameters
Feature Set Reduction Parameters
Machine Learning Parameters

Study Metadata

Defines high-level experiment identifiers and global variables.

study_metadata:
  var_study: "var1"
  combinations: ["var1"]
  seed: &global_seed 54288  # YAML anchor used to sync seeds across the pipeline

Experiment Design Parameters

Used to define the data splitting and validation strategy. You can define multiple profiles and select the active one.

design:
  active_method: "CrossValidation" # Options: "Random", "CrossValidation", "Bootstrapping"

  Random:
    method: "SubSampling"
    nSplits: 10
    stratifyInstitutions: 1
    testProportion: 0.33
    seed: *global_seed

  CrossValidation:
    method: "StratifiedKFold"
    nFolds: 5
    nRepeats: 10
    seed: *global_seed

  Bootstrapping:
    method: "Out-of-Bag"
    nIterations: 1000
    seed: *global_seed

Variables Definition

Defines the data sources and maps them to specific cleaning and reduction profiles defined later in the file.

variables:
  var1:
    nameType: "RadiomicsFull"
    path: "setToFeaturesinWorkspace"
    scans: ["CECT"]
    rois: ["tumor"]
    imSpaces: ["image"]
    cleaning_profile: "default"
    normalization: "combat"
    reduction_method: "FDA"

Data Cleaning Parameters

Defines how missing values and low-variance features are handled.

data_cleaning:
  default:
    continuous:
      missingCutoffps: 0.25 # Max % missing features per sample
      covCutoff: 0.1        # Min coefficient of variation
      missingCutoffpf: 0.1  # Max % missing samples per feature
      imputation: "mean"    # Options: "mean", "median", "random"

Data Normalization Parameters

Aims to remove batch effects (e.g., multicenter differences).

normalization:
  standardCombat: "RUN"
  standardization:
    perClass: 0
    perInstitution: 0
  minmax:
    min: 0
    max: 1

Feature Set Reduction Parameters

Parameters for reducing high-dimensional feature sets (like Radiomics) to a stable subset.

feature_reduction:
  FDA:
    nSplits: 100
    corrType: "Spearman" # Options: "Spearman", "Pearson"
    threshStableStart: 0.5
    threshInterCorr: 0.7
    minNfeatStable: 100
    minNfeat: 10
    seed: *global_seed

Machine Learning Parameters

Defines the algorithm and hyperparameter optimization settings.

modeling:
  method: "firth" # Options: "firth", "rf", "xgboost"
  optimization_metric: "MCC"
  cv_folds: 5
  var_importance_threshold: 0.05

Note

For rare-event studies (e.g., only 5 positive cases), it is highly recommended to use the firth method or a Random Forest with class_weight='balanced'.

Full Configuration Example

Below is a complete example of a config.yaml file incorporating all the sections discussed above. You can copy this into your project workspace to get started.

# ==============================================================================
# Master Configuration for Machine Learning Pipeline
# ==============================================================================
# Note: All lines below are indented by 3 spaces to satisfy the readthedocs directive

study_metadata:
  var_study: "var1"
  combinations: ["var1"]
  seed: &global_seed 54288

design:
  active_method: "CrossValidation"

  Random:
    method: "SubSampling"
    nSplits: 10
    stratifyInstitutions: 1
    testProportion: 0.33
    seed: *global_seed

  CrossValidation:
    method: "StratifiedKFold"
    nFolds: 5
    nRepeats: 10
    seed: *global_seed

  Bootstrapping:
    method: "Out-of-Bag"
    nIterations: 1000
    seed: *global_seed

variables:
  var1:
    nameType: "RadiomicsFull"
    path: "path/to/features/workspace"
    scans: ["CECT"]
    rois: ["tumor"]
    imSpaces: ["image"]
    cleaning_profile: "default"
    normalization: "combat"
    reduction_method: "FDA"

data_cleaning:
  default:
    continuous:
      missingCutoffps: 0.25
      covCutoff: 0.1
      missingCutoffpf: 0.1
      imputation: "mean"

normalization:
  standardCombat: "RUN"
  standardization:
    perClass: 0
    perInstitution: 0
  minmax:
    min: 0
    max: 1

feature_reduction:
  FDA:
    nSplits: 100
    corrType: "Spearman"
    threshStableStart: 0.5
    threshInterCorr: 0.7
    minNfeatStable: 100
    minNfeat: 10
    seed: *global_seed

modeling:
  method: "firth"
  optimization_metric: "MCC"
  cv_folds: 5
  var_importance_threshold: 0.05