Learning
--------

This section walks you through setting up the consolidated master configuration file (``config.yaml``) for the machine learning pipeline. 
Instead of multiple JSON files in the previous versions, all parameters are now managed in a single YAML structure, 
allowing for easier maintenance and the use of YAML anchors for consistency (e.g., shared seeds).

The configuration is separated into the following subdivisions:

* :ref:`Study Metadata`
* :ref:`Experiment Design Parameters`
* :ref:`Variables Definition`
* :ref:`Data Cleaning Parameters`
* :ref:`Data Normalization Parameters`
* :ref:`Feature Set Reduction Parameters`
* :ref:`Machine Learning Parameters`

Study Metadata
^^^^^^^^^^^^^^

Defines high-level experiment identifiers and global variables.

.. code-block:: yaml

   study_metadata:
     var_study: "var1"
     combinations: ["var1"]
     seed: &global_seed 54288  # YAML anchor used to sync seeds across the pipeline

Experiment Design Parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Used to define the data splitting and validation strategy. You can define multiple profiles and select the active one.

.. code-block:: yaml

   design:
     active_method: "CrossValidation" # Options: "Random", "CrossValidation", "Bootstrapping"
     
     Random:
       method: "SubSampling"
       nSplits: 10
       stratifyInstitutions: 1
       testProportion: 0.33
       seed: *global_seed

     CrossValidation:
       method: "StratifiedKFold"
       nFolds: 5
       nRepeats: 10
       seed: *global_seed

     Bootstrapping:
       method: "Out-of-Bag"
       nIterations: 1000
       seed: *global_seed

Variables Definition
^^^^^^^^^^^^^^^^^^^^

Defines the data sources and maps them to specific cleaning and reduction profiles defined later in the file.

.. code-block:: yaml

   variables:
     var1:
       nameType: "RadiomicsFull"
       path: "setToFeaturesinWorkspace"
       scans: ["CECT"]
       rois: ["tumor"]
       imSpaces: ["image"]
       cleaning_profile: "default"
       normalization: "combat"
       reduction_method: "FDA"

Data Cleaning Parameters
^^^^^^^^^^^^^^^^^^^^^^^^

Defines how missing values and low-variance features are handled.

.. code-block:: yaml

   data_cleaning:
     default:
       continuous:
         missingCutoffps: 0.25 # Max % missing features per sample
         covCutoff: 0.1        # Min coefficient of variation
         missingCutoffpf: 0.1  # Max % missing samples per feature
         imputation: "mean"    # Options: "mean", "median", "random"

Data Normalization Parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Aims to remove batch effects (e.g., multicenter differences).

.. code-block:: yaml

   normalization:
     standardCombat: "RUN"
     standardization:
       perClass: 0
       perInstitution: 0
     minmax:
       min: 0
       max: 1

Feature Set Reduction Parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Parameters for reducing high-dimensional feature sets (like Radiomics) to a stable subset.

.. code-block:: yaml

   feature_reduction:
     FDA:
       nSplits: 100
       corrType: "Spearman" # Options: "Spearman", "Pearson"
       threshStableStart: 0.5
       threshInterCorr: 0.7
       minNfeatStable: 100
       minNfeat: 10
       seed: *global_seed

Machine Learning Parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Defines the algorithm and hyperparameter optimization settings.

.. code-block:: yaml

   modeling:
     method: "firth" # Options: "firth", "rf", "xgboost"
     optimization_metric: "MCC"
     cv_folds: 5
     var_importance_threshold: 0.05

.. note::
   For rare-event studies (e.g., only 5 positive cases), it is highly recommended to use the ``firth`` method or a ``Random Forest`` with ``class_weight='balanced'``.

Full Configuration Example
^^^^^^^^^^^^^^^^^^^^^^^^^^

Below is a complete example of a ``config.yaml`` file incorporating all the sections discussed above. You can copy this into your project workspace to get started.

.. code-block:: yaml

   # ==============================================================================
   # Master Configuration for Machine Learning Pipeline
   # ==============================================================================
   # Note: All lines below are indented by 3 spaces to satisfy the readthedocs directive

   study_metadata:
     var_study: "var1"
     combinations: ["var1"]
     seed: &global_seed 54288

   design:
     active_method: "CrossValidation"
     
     Random:
       method: "SubSampling"
       nSplits: 10
       stratifyInstitutions: 1
       testProportion: 0.33
       seed: *global_seed

     CrossValidation:
       method: "StratifiedKFold"
       nFolds: 5
       nRepeats: 10
       seed: *global_seed

     Bootstrapping:
       method: "Out-of-Bag"
       nIterations: 1000
       seed: *global_seed

   variables:
     var1:
       nameType: "RadiomicsFull"
       path: "path/to/features/workspace"
       scans: ["CECT"]
       rois: ["tumor"]
       imSpaces: ["image"]
       cleaning_profile: "default"
       normalization: "combat"
       reduction_method: "FDA"

   data_cleaning:
     default:
       continuous:
         missingCutoffps: 0.25
         covCutoff: 0.1
         missingCutoffpf: 0.1
         imputation: "mean"

   normalization:
     standardCombat: "RUN"
     standardization:
       perClass: 0
       perInstitution: 0
     minmax:
       min: 0
       max: 1

   feature_reduction:
     FDA:
       nSplits: 100
       corrType: "Spearman"
       threshStableStart: 0.5
       threshInterCorr: 0.7
       minNfeatStable: 100
       minNfeat: 10
       seed: *global_seed

   modeling:
     method: "firth"
     optimization_metric: "MCC"
     cv_folds: 5
     var_importance_threshold: 0.05