Learning

DataCleaner

class MEDiml.learning.DataCleaner.DataCleaner(var_type: str = 'continuous', imputation: str = 'mean', missingCutoffpf: float = 0.1, missingCutoffps: float = 0.25, covCutoff: float = 0.1, random_state=None)[source]

Bases: BaseEstimator, TransformerMixin

A scikit-learn compatible transformer that cleans features by removing those with too many missing values or too little variation, removes samples with too many missing features, and imputes missing values.

__init__(var_type: str = 'continuous', imputation: str = 'mean', missingCutoffpf: float = 0.1, missingCutoffps: float = 0.25, covCutoff: float = 0.1, random_state=None)[source]

Initializes the DataCleaner with specified parameters for feature and sample filtering and imputation.

Parameters:
  • var_type (str) – Type of variable (“continuous”, “hcategorical”, “icategorical”).

  • imputation_method (str) – Method of imputation (“mean”, “median”, “mode”, “random”).

  • missing_cutoff_pf (float) – Max % of missing values allowed per feature (column).

  • missing_cutoff_ps (float) – Max % of missing values allowed per sample (row).

  • cov_cutoff (float) – Min coefficient of variation allowed per feature.

  • random_state (int, RandomState instance or None) – Seed for reproducibility.

Returns:

None

_apply_imputation(X)[source]

Helper to apply the imputation.

_fit_imputer(X)[source]

Helper to initialize and fit the correct imputer logic.

_validate_input(X)[source]

Ensures X is a DataFrame and handles infinite values.

fit(X: DataFrame, y: DataFrame | None = None)[source]

Learns which features to keep based on missingness and variation thresholds.

Parameters:
  • X (pd.DataFrame) – Input feature data.

  • y (pd.DataFrame, optional) – Ignored, present for API consistency by convention.

Returns:

Returns self.

Return type:

DataCleaner

transform(X: DataFrame)[source]

Applies feature selection, sample filtering, and imputation.

Desgin experiment

class MEDiml.learning.DesignExperiment.DesignExperiment(path_study: Path, path_settings: Path, experiment_label: str)[source]

Bases: object

__init__(path_study: Path, path_settings: Path, experiment_label: str) None[source]

Constructor of the class DesignExperiment.

Parameters:
  • path_study (Path) – Path to the main study folder where the outcomes, learning patients and holdout patients dictionaries are found.

  • path_settings (Path) – Path to the settings file.

  • experiment_label (str) – String specifying the label to attach to a given learning experiment in “path_experiments”. This label will be attached to the ml__$experiments_label$.json file as well as the learn__$experiment_label$ folder. This label is used to keep track of different experiments with different settings (e.g. radiomics, scans, machine learning algorithms, etc.).

Returns:

None

create_experiment() Dict[source]

Create the machine learning experiment dictionary, organizes each test/split information in a seperate folder.

Parameters:

ml (dict, optional) – Dictionary containing all the machine learning settings. Defaults to None.

Returns:

Dictionary containing all the organized machine learning settings.

Return type:

Dict

generate_experiment()[source]

Generate the json files containing all the options the experiment. The json files will then be used in machine learning.

Feature set reduction

class MEDiml.learning.FSR.FSR(method: str = 'fda')[source]

Bases: object

__init__(method: str = 'fda') None[source]

Feature set reduction class constructor.

Parameters:

method (str) – Method of feature set reduction. Can be “FDA”, “LASSO” or “mRMR”.

apply_fda(ml: Dict, variable_table: List, outcome_table_binary: DataFrame, logging: bool = True, path_save_logging: Path | None = None) List[source]

Applies false discovery avoidance method.

Parameters:
  • ml (dict) – Machine learning dictionary containing the learning options.

  • variable_table (List) – Table of variables.

  • outcome_table_binary (pd.DataFrame) – Table of binary outcomes.

  • logging (bool, optional) – If True, will save a dict that tracks features selsected for each level. Defaults to True.

  • path_save_logging (Path, optional) – Path to save the logging dict. Defaults to None.

Returns:

Table of variables after feature set reduction.

Return type:

List

apply_fda_balanced(ml: Dict, variable_table: List, outcome_table_binary: DataFrame) List[source]

Applies false discovery avoidance method but balances the number of features on each level.

Parameters:
  • ml (dict) – Machine learning dictionary containing the learning options.

  • variable_table (List) – Table of variables.

  • outcome_table_binary (pd.DataFrame) – Table of binary outcomes.

  • logging (bool, optional) – If True, will save a dict that tracks features selsected for each level. Defaults to True.

  • path_save_logging (Path, optional) – Path to save the logging dict. Defaults to None.

Returns:

Table of variables after feature set reduction.

Return type:

List

apply_fda_one_space(ml: Dict, variable_table: List, outcome_table_binary: DataFrame, del_variants: bool = True, logging_dict: Dict | None = None) List[source]

Applies false discovery avoidance method.

Parameters:
  • ml (dict) – Machine learning dictionary containing the learning options.

  • variable_table (List) – Table of variables.

  • outcome_table_binary (pd.DataFrame) – Table of binary outcomes.

  • del_variants (bool, optional) – If True, will delete the variants of the same feature. Defaults to True.

Returns:

Table of variables after feature set reduction.

Return type:

List

apply_fsr(ml: Dict, variable_table: List, outcome_table_binary: DataFrame, path_save_logging: Path | None = None) List[source]

Applies feature set reduction method.

Parameters:
  • ml (dict) – Machine learning dictionary containing the learning options.

  • variable_table (List) – Table of variables.

  • outcome_table_binary (pd.DataFrame) – Table of binary outcomes.

  • path_save_logging (Path, optional) – Path to save logging information. Defaults to None.

Returns:

Table of variables after feature set reduction.

Return type:

List

apply_random_fsr(ml: Dict, variable_table: List) List[source]

Applies random feature set reduction by choosing a random number of features.

Parameters:
  • ml (dict) – Machine learning dictionary containing the learning options.

  • variable_table (List) – Table of variables.

  • outcome_table_binary (pd.DataFrame) – Table of binary outcomes.

Returns:

Table of variables after feature set reduction.

Return type:

List

apply_random_fsr_one_space(ml: Dict, variable_table: DataFrame) List[source]

Normalization

class MEDiml.learning.Normalization.CombatNormalization(institution_col: str | None = None, covariates: list | None = None, drop_institution: bool = True)[source]

Bases: BaseEstimator, TransformerMixin

Sklearn-compatible Transformer for ComBat Normalization.

This transformer assumes the input X (DataFrame) contains both the features to be normalized and the column identifying the institution/batch.

__init__(institution_col: str | None = None, covariates: list | None = None, drop_institution: bool = True)[source]
Parameters:
  • institution_col (str) – Name of the column in X containing the institution/batch IDs. If None, tries to derive from Index using util.

  • covariates (list) – List of column names in X to treat as covariates (biological retention).

  • drop_institution (bool) – If True, removes the institution column from output.

_process_institutions(institutions)[source]

Helper to map institution strings to integers.

fit(X, y=None)[source]

ComBat calculates parameters on the current batch data provided in transform. Standard fit does nothing but validate input exists.

transform(X)[source]

Applies ComBat Normalization.

Radiomics Learner

class MEDiml.learning.RadiomicsLearner.RadiomicsLearner(path_study: Path, path_settings: Path, experiment_label: str)[source]

Bases: object

__init__(path_study: Path, path_settings: Path, experiment_label: str) None[source]

Constructor of the class DesignExperiment.

Parameters:
  • path_study (Path) – Path to the main study folder where the outcomes, learning patients and holdout patients dictionaries are found.

  • path_settings (Path) – Path to the settings folder.

  • experiment_label (str) – String specifying the label to attach to a given learning experiment in “path_experiments”. This label will be attached to the ml__$experiments_label$.json file as well as the learn__$experiment_label$ folder. This label is used to keep track of different experiments with different settings (e.g. radiomics, scans, machine learning algorithms, etc.).

Returns:

None

get_hold_out_set_table(ml: Dict, var_id: str, patients_id: List)[source]

Loads and pre-processes different radiomics tables then combines them to be used for hold-out testing.

Parameters:
  • ml (Dict) – The machine learning dictionary containing the information of the machine learning test.

  • var_id (str) – String specifying the ID of the radiomics variable in ml. –> Ex: var1

  • patients_id (List) – List of patients of the hold-out set.

Returns:

Radiomics table for the hold-out set.

Return type:

pd.DataFrame

ml_run(path_ml: Path, holdout_test: bool = True, method: str = 'auto') None[source]

This function runs the machine learning test for the ceated experiment.

Parameters:
  • path_ml (Path) – Path to the main dictionary containing info about the ml current experiment.

  • holdout_test (bool, optional) – Boolean specifying if the hold-out test should be performed.

Returns:

None.

pre_process_radiomics_table(ml: Dict, var_id: str, outcome_table_binary: DataFrame, patients_train: list) Tuple[DataFrame, DataFrame][source]

For the given variable, this function loads the corresponding radiomics tables and pre-processes them (cleaning, normalization and feature set reduction).

Note

Only patients of the training/learning set should be found in the given outcome table.

Parameters:
  • ml (Dict) – The machine learning dictionary containing the information of the machine learning test (parameters, options, etc.).

  • var_id (str) – String specifying the ID of the radiomics variable in ml. For example: ‘var1’.

  • outcome_table_binary (pd.DataFrame) – outcome table with binary labels. This table may be used to pre-process some variables with the “FDA” feature set reduction algorithm.

  • patients_train (list) – List of patients to use for training.

Returns:

Two dataframes of processed radiomics tables, one for training

and one for testing (no feature set reduction).

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

pre_process_variables(ml: Dict, outcome_table_binary: DataFrame) Tuple[DataFrame, DataFrame][source]

Loads and pre-processes different radiomics tables from different variable types found in the ml dict.

Note

only patients of the training/learning set should be found in this outcome table.

Parameters:
  • ml (Dict) – The machine learning dictionary containing the information of the machine learning test.

  • outcome_table_binary (pd.DataFrame) – outcome table with binary labels. This table may be used to pre-process some variables with the “FDA” feature set reduction algorithm.

Returns:

Two dict of processed radiomics tables, one dict for training and one for

testing (no feature set reduction).

Return type:

Tuple

run_experiment(holdout_test: bool = True, method: str = 'pycaret') None[source]

Run the machine learning experiment for each split/run

Parameters:
  • holdout_test (bool, optional) – Boolean specifying if the hold-out test should be performed.

  • method (str, optional) – String specifying the method to use to train the model. - “pycaret”: Use PyCaret to train the model (automatic). - “grid_search”: Grid search with cross-validation to find the best parameters. - “random_search”: Random search with cross-validation to find the best parameters.

Returns:

None

Results

class MEDiml.learning.Results.Results(model_dict: dict | None = None, model_id: str = '')[source]

Bases: object

A class to analyze the results of a given machine learning experiment, including the assessment of the model’s performance,

Parameters:
  • model_dict (dict, optional) – Dictionary containing the model’s parameters. Defaults to {}.

  • model_id (str, optional) – ID of the model. Defaults to “”.

model_dict

Dictionary containing the model’s parameters.

Type:

dict

model_id

ID of the model.

Type:

str

results_dict

Dictionary containing the results of the model’s performance.

Type:

dict

__init__(model_dict: dict | None = None, model_id: str = '') None[source]

Constructor of the class Results

average_results(path_results: Path, save: bool = False) dict[source]

Averages the results (AUC, BAC, Sensitivity and Specifity) of all the runs of the same experiment, for training, testing and holdout sets.

Parameters:
  • path_results (Path) – path to the folder containing the results of the experiment.

  • save (bool, optional) – If True, saves the results in the same folder as the model.

Returns:

Averaged results for each dataset.

Return type:

dict

bootstrap_metrics(response: ndarray, labels: DataFrame, thresh: float, n_bootstraps: int = 100) dict[source]

Computes 95% Confidence Intervals using bootstrap resampling.

Parameters:
  • response (np.ndarray) – Array of the probabilities of class “1” for all instances (prediction).

  • labels (pd.DataFrame) – Column vector specifying the outcome status (1 or 0) for all instances.

  • thresh (float) – Optimal threshold selected from the ROC curve.

  • n_bootstraps (int, optional) – Number of bootstrap samples. Defaults to 100.

Returns:

Dictionary containing the 95% confidence intervals for each metric.

Return type:

dict

get_model_performance(response: list, outcome_table: DataFrame) None[source]

Calculates the performance of the model :param response: List of machine learning model predictions. :type response: list :param outcome_table: Outcome table with binary labels. :type outcome_table: pd.DataFrame

Returns:

Updates the run_results attribute.

Return type:

None

get_optimal_level(path_experiments: Path, experiments_labels: List[str], metric: str = 'AUC_mean', p_value_test: str = 'wilcoxon', aggregate: bool = False) None[source]

This function plots a heatmap of the metrics values for the performance of the models in the given experiment.

Parameters:
  • path_experiments (Path) – Path to the folder containing the experiments.

  • experiments_labels (List) – List of experiments labels to use for the plot. including variants is possible. For example: [‘experiment1_morph_CT’, [‘experiment1_intensity5_CT’, ‘experiment1_intensity10_CT’], ‘experiment1_texture_CT’].

  • metric (str, optional) – Metric to plot. Defaults to ‘AUC_mean’.

  • p_value_test (str, optional) –

    Method to use to calculate the p-value. Defaults to ‘wilcoxon’. Available options:

    • ’delong’: Delong test.

    • ’ttest’: T-test.

    • ’wilcoxon’: Wilcoxon signed rank test.

    • ’bengio’: Bengio and Nadeau corrected t-test.

  • aggregate (bool, optional) – If True, aggregates the results of all the splits and computes one final p-value. Only valid for the Delong test when cross-validation is used. Defaults to False.

Returns:

None.

plot_fda_analysis_heatmap(path_experiments: Path, experiment: str, levels: List, modalities: List, title: str | None = None, save: bool = False) None[source]

This function plots a heatmap of the percentage of stable features and final features selected by FDA for a given experiment.

Parameters:
  • path_experiments (Path) – Path to the folder containing the experiments.

  • experiment (str) – Name of the experiment to plot. Will be used to find the results.

  • levels (List) – List of radiomics levels to include in plot. For example: [‘morph’, ‘intensity’].

  • modalities (List) – List of imaging modalities to include in the plot.

  • title (str, optional) – Title and name used to save the plot. Defaults to None.

  • save (bool, optional) – Whether to save the plot. Defaults to False.

Returns:

None.

plot_feature_analysis(path_experiments: Path, experiment: str, levels: List, modalities: List = [], title: str | None = None, save: bool = False) None[source]

This function plots a pie chart of the percentage of the final features used to train the model per radiomics level.

Parameters:
  • path_experiments (Path) – Path to the folder containing the experiments.

  • experiment (str) – Name of the experiment to plot. Will be used to find the results.

  • levels (List) – List of radiomics levels to include in plot. For example: [‘morph’, ‘intensity’].

  • modalities (List, optional) – List of imaging modalities to include in the plot. Defaults to [].

  • title (str, optional) – Title and name used to save the plot. Defaults to None.

  • save (bool, optional) – Whether to save the plot. Defaults to False.

Returns:

None.

plot_features_importance_histogram(path_experiments: Path, experiment: str, level: str, modalities: List, sort_option: str = 'importance', title: str | None = None, save: bool = True, figsize: tuple = (12, 12)) None[source]

Plots a histogram of the features importance for the given experiment.

Parameters:
  • path_experiments (Path) – Path to the folder containing the experiments.

  • experiment (str) – Name of the experiment to plot. Will be used to find the results.

  • level (str) – Radiomics level to plot. For example: ‘morph’.

  • modalities (List) – List of imaging modalities to use for the plot. A plot for each modality.

  • sort_option (str, optional) – Option used to sort the features. Available options: - ‘importance’: Sorts the features by importance. - ‘times_selected’: Sorts the features by the number of times they were selected across the different splits. - ‘both’: Sorts the features by importance and then by the number of times they were selected.

  • title (str, optional) – Title of the plot. Defaults to None.

  • save (bool, optional) – Whether to save the plot. Defaults to True.

  • figsize (tuple, optional) – Size of the figure. Defaults to (12, 12).

Returns:

None. Plots the figure or saves it.

plot_heatmap(path_experiments: Path, experiments_labels: List[str], metric: str = 'AUC_mean', stat_extra: list = [], plot_p_values: bool = True, p_value_test: str = 'wilcoxon', aggregate: bool = False, title: str | None = None, save: bool = False, figsize: tuple = (8, 8)) None[source]

This function plots a heatmap of the metrics values for the performance of the models in the given experiment.

Parameters:
  • path_experiments (Path) – Path to the folder containing the experiments.

  • experiments_labels (List) – List of experiments labels to use for the plot. including variants is possible. For example: [‘experiment1_morph_CT’, [‘experiment1_intensity5_CT’, ‘experiment1_intensity10_CT’], ‘experiment1_texture_CT’].

  • metric (str, optional) – Metric to plot. Defaults to ‘AUC_mean’.

  • stat_extra (list, optional) – List of extra statistics to include in the plot. Defaults to [].

  • plot_p_values (bool, optional) – If True plots the p-value of the choosen test. Defaults to True.

  • p_value_test (str, optional) –

    Method to use to calculate the p-value. Defaults to ‘wilcoxon’. Available options:

    • ’delong’: Delong test.

    • ’ttest’: T-test.

    • ’wilcoxon’: Wilcoxon signed rank test.

    • ’bengio’: Bengio and Nadeau corrected t-test.

  • aggregate (bool, optional) – If True, aggregates the results of all the splits and computes one final p-value. Only valid for the Delong test when cross-validation is used. Defaults to False.

  • extra_xlabels (List, optional) – List of extra x-axis labels. Defaults to [].

  • title (str, optional) – Title of the plot. Defaults to None.

  • save (bool, optional) – Whether to save the plot. Defaults to False.

  • figsize (tuple, optional) – Size of the figure. Defaults to (8, 8).

Returns:

None.

plot_lf_level_tree(path_experiments: Path, experiment: str, level: str, modalities: list, initial_width: float = 4, lines_weight: float = 1, title: str | None = None, figsize: tuple = (12, 10)) None[source]

Plots a tree explaining the impact of features in the linear filters radiomics complexity level.

Parameters:
  • path_experiments (Path) – Path to the folder containing the experiments.

  • experiment (str) – Name of the experiment to plot. Will be used to find the results.

  • level (List) – Radiomics complexity level to use for the plot.

  • modalities (List, optional) – List of imaging modalities to include in the plot. Defaults to [].

  • initial_width (float, optional) – Initial width of the lines. Defaults to 1. For aesthetic purposes.

  • lines_weight (float, optional) – Weight applied to the lines of the tree. Defaults to 2. For aesthetic purposes.

  • title (str, optional) – Title and name used to save the plot. Defaults to None.

  • figsize (tuple, optional) – Size of the figure. Defaults to (20, 10).

Returns:

None.

plot_original_level_tree(path_experiments: Path, experiment: str, level: str, modalities: list, initial_width: float = 4, lines_weight: float = 1, title: str | None = None, figsize: tuple = (12, 10)) None[source]

Plots a tree explaining the impact of features in the original radiomics complexity level.

Parameters:
  • path_experiments (Path) – Path to the folder containing the experiments.

  • experiment (str) – Name of the experiment to plot. Will be used to find the results.

  • level (List) – Radiomics complexity level to use for the plot.

  • modalities (List, optional) – List of imaging modalities to include in the plot. Defaults to [].

  • initial_width (float, optional) – Initial width of the lines. Defaults to 1. For aesthetic purposes.

  • lines_weight (float, optional) – Weight applied to the lines of the tree. Defaults to 2. For aesthetic purposes.

  • title (str, optional) – Title and name used to save the plot. Defaults to None.

  • figsize (tuple, optional) – Size of the figure. Defaults to (20, 10).

Returns:

None.

plot_radiomics_starting_percentage(path_experiments: Path, experiment: str, levels: List, modalities: List, title: str | None = None, figsize: tuple = (15, 10), save: bool = False) None[source]

This function plots a pie chart of the percentage of features used in experiment per radiomics level.

Parameters:
  • path_experiments (Path) – Path to the folder containing the experiments.

  • experiment (str) – Name of the experiment to plot. Will be used to find the results.

  • levels (List) – List of radiomics levels to include in the plot.

  • modalities (List) – List of imaging modalities to include in the plot.

  • title (str, optional) – Title and name used to save the plot. Defaults to None.

  • figsize (tuple, optional) – Size of the figure. Defaults to (15, 10).

  • save (bool, optional) – Whether to save the plot. Defaults to False.

Returns:

None.

plot_tf_level_tree(path_experiments: Path, experiment: str, level: str, modalities: list, initial_width: float = 4, lines_weight: float = 1, title: str | None = None, figsize: tuple = (12, 10)) None[source]

Plots a tree explaining the impact of features in the textural filters radiomics complexity level.

Parameters:
  • path_experiments (Path) – Path to the folder containing the experiments.

  • experiment (str) – Name of the experiment to plot. Will be used to find the results.

  • level (List) – Radiomics complexity level to use for the plot.

  • modalities (List, optional) – List of imaging modalities to include in the plot. Defaults to [].

  • initial_width (float, optional) – Initial width of the lines. Defaults to 1. For aesthetic purposes.

  • lines_weight (float, optional) – Weight applied to the lines of the tree. Defaults to 2. For aesthetic purposes.

  • title (str, optional) – Title and name used to save the plot. Defaults to None.

  • figsize (tuple, optional) – Size of the figure. Defaults to (20, 10).

Returns:

None.

to_json(response_train: list | None = None, response_test: list | None = None, response_holdout: list | None = None, patients_train: list | None = None, patients_test: list | None = None, patients_holdout: list | None = None, outcome_table_binary_train: DataFrame | None = None, outcome_table_binary_test: DataFrame | None = None, outcome_table_binary_holdout: DataFrame | None = None) dict[source]

Creates a dictionary with the results of the model using the class attributes.

Parameters:
  • response_train (list) – List of machine learning model predictions for the training set.

  • response_test (list) – List of machine learning model predictions for the test set.

  • response_holdout (list) – List of machine learning model predictions for the holdout set.

  • patients_train (list) – List of patients in the training set.

  • patients_test (list) – List of patients in the test set.

  • patients_holdout (list) – List of patients in the holdout set.

  • outcome_table_binary_train (pd.DataFrame) – Binary outcome table for the training set.

  • outcome_table_binary_test (pd.DataFrame) – Binary outcome table for the test set.

  • outcome_table_binary_holdout (pd.DataFrame) – Binary outcome table for the holdout set.

Returns:

Dictionary with the the responses of the model and the patients used for training, testing and holdout.

Return type:

Dict