Learning
This section walks you through setting up the consolidated master configuration file (config.yaml) for the machine learning pipeline.
Instead of multiple JSON files in the previous versions, all parameters are now managed in a single YAML structure,
allowing for easier maintenance and the use of YAML anchors for consistency (e.g., shared seeds).
The configuration is separated into the following subdivisions:
Study Metadata
Defines high-level experiment identifiers and global variables.
study_metadata:
var_study: "var1"
combinations: ["var1"]
seed: &global_seed 54288 # YAML anchor used to sync seeds across the pipeline
Experiment Design Parameters
Used to define the data splitting and validation strategy. You can define multiple profiles and select the active one.
design:
active_method: "CrossValidation" # Options: "Random", "CrossValidation", "Bootstrapping"
Random:
method: "SubSampling"
nSplits: 10
stratifyInstitutions: 1
testProportion: 0.33
seed: *global_seed
CrossValidation:
method: "StratifiedKFold"
nFolds: 5
nRepeats: 10
seed: *global_seed
Bootstrapping:
method: "Out-of-Bag"
nIterations: 1000
seed: *global_seed
Variables Definition
Defines the data sources and maps them to specific cleaning and reduction profiles defined later in the file.
variables:
var1:
nameType: "RadiomicsFull"
path: "setToFeaturesinWorkspace"
scans: ["CECT"]
rois: ["tumor"]
imSpaces: ["image"]
cleaning_profile: "default"
normalization: "combat"
reduction_method: "FDA"
Data Cleaning Parameters
Defines how missing values and low-variance features are handled.
data_cleaning:
default:
continuous:
missingCutoffps: 0.25 # Max % missing features per sample
covCutoff: 0.1 # Min coefficient of variation
missingCutoffpf: 0.1 # Max % missing samples per feature
imputation: "mean" # Options: "mean", "median", "random"
Data Normalization Parameters
Aims to remove batch effects (e.g., multicenter differences).
normalization:
standardCombat: "RUN"
standardization:
perClass: 0
perInstitution: 0
minmax:
min: 0
max: 1
Feature Set Reduction Parameters
Parameters for reducing high-dimensional feature sets (like Radiomics) to a stable subset.
feature_reduction:
FDA:
nSplits: 100
corrType: "Spearman" # Options: "Spearman", "Pearson"
threshStableStart: 0.5
threshInterCorr: 0.7
minNfeatStable: 100
minNfeat: 10
seed: *global_seed
Machine Learning Parameters
Defines the algorithm and hyperparameter optimization settings.
modeling:
method: "firth" # Options: "firth", "rf", "xgboost"
optimization_metric: "MCC"
cv_folds: 5
var_importance_threshold: 0.05
Note
For rare-event studies (e.g., only 5 positive cases), it is highly recommended to use the firth method or a Random Forest with class_weight='balanced'.
Full Configuration Example
Below is a complete example of a config.yaml file incorporating all the sections discussed above. You can copy this into your project workspace to get started.
# ==============================================================================
# Master Configuration for Machine Learning Pipeline
# ==============================================================================
# Note: All lines below are indented by 3 spaces to satisfy the readthedocs directive
study_metadata:
var_study: "var1"
combinations: ["var1"]
seed: &global_seed 54288
design:
active_method: "CrossValidation"
Random:
method: "SubSampling"
nSplits: 10
stratifyInstitutions: 1
testProportion: 0.33
seed: *global_seed
CrossValidation:
method: "StratifiedKFold"
nFolds: 5
nRepeats: 10
seed: *global_seed
Bootstrapping:
method: "Out-of-Bag"
nIterations: 1000
seed: *global_seed
variables:
var1:
nameType: "RadiomicsFull"
path: "path/to/features/workspace"
scans: ["CECT"]
rois: ["tumor"]
imSpaces: ["image"]
cleaning_profile: "default"
normalization: "combat"
reduction_method: "FDA"
data_cleaning:
default:
continuous:
missingCutoffps: 0.25
covCutoff: 0.1
missingCutoffpf: 0.1
imputation: "mean"
normalization:
standardCombat: "RUN"
standardization:
perClass: 0
perInstitution: 0
minmax:
min: 0
max: 1
feature_reduction:
FDA:
nSplits: 100
corrType: "Spearman"
threshStableStart: 0.5
threshInterCorr: 0.7
minNfeatStable: 100
minNfeat: 10
seed: *global_seed
modeling:
method: "firth"
optimization_metric: "MCC"
cv_folds: 5
var_importance_threshold: 0.05