Learning
DataCleaner
- class MEDiml.learning.DataCleaner.DataCleaner(var_type: str = 'continuous', imputation: str = 'mean', missingCutoffpf: float = 0.1, missingCutoffps: float = 0.25, covCutoff: float = 0.1, random_state=None)[source]
Bases:
BaseEstimator,TransformerMixinA scikit-learn compatible transformer that cleans features by removing those with too many missing values or too little variation, removes samples with too many missing features, and imputes missing values.
- __init__(var_type: str = 'continuous', imputation: str = 'mean', missingCutoffpf: float = 0.1, missingCutoffps: float = 0.25, covCutoff: float = 0.1, random_state=None)[source]
Initializes the DataCleaner with specified parameters for feature and sample filtering and imputation.
- Parameters:
var_type (str) – Type of variable (“continuous”, “hcategorical”, “icategorical”).
imputation_method (str) – Method of imputation (“mean”, “median”, “mode”, “random”).
missing_cutoff_pf (float) – Max % of missing values allowed per feature (column).
missing_cutoff_ps (float) – Max % of missing values allowed per sample (row).
cov_cutoff (float) – Min coefficient of variation allowed per feature.
random_state (int, RandomState instance or None) – Seed for reproducibility.
- Returns:
None
Desgin experiment
- class MEDiml.learning.DesignExperiment.DesignExperiment(path_study: Path, path_settings: Path, experiment_label: str)[source]
Bases:
object- __init__(path_study: Path, path_settings: Path, experiment_label: str) None[source]
Constructor of the class DesignExperiment.
- Parameters:
path_study (Path) – Path to the main study folder where the outcomes, learning patients and holdout patients dictionaries are found.
path_settings (Path) – Path to the settings file.
experiment_label (str) – String specifying the label to attach to a given learning experiment in “path_experiments”. This label will be attached to the ml__$experiments_label$.json file as well as the learn__$experiment_label$ folder. This label is used to keep track of different experiments with different settings (e.g. radiomics, scans, machine learning algorithms, etc.).
- Returns:
None
- create_experiment() Dict[source]
Create the machine learning experiment dictionary, organizes each test/split information in a seperate folder.
- Parameters:
ml (dict, optional) – Dictionary containing all the machine learning settings. Defaults to None.
- Returns:
Dictionary containing all the organized machine learning settings.
- Return type:
Dict
Feature set reduction
- class MEDiml.learning.FSR.FSR(method: str = 'fda')[source]
Bases:
object- __init__(method: str = 'fda') None[source]
Feature set reduction class constructor.
- Parameters:
method (str) – Method of feature set reduction. Can be “FDA”, “LASSO” or “mRMR”.
- apply_fda(ml: Dict, variable_table: List, outcome_table_binary: DataFrame, logging: bool = True, path_save_logging: Path | None = None) List[source]
Applies false discovery avoidance method.
- Parameters:
ml (dict) – Machine learning dictionary containing the learning options.
variable_table (List) – Table of variables.
outcome_table_binary (pd.DataFrame) – Table of binary outcomes.
logging (bool, optional) – If True, will save a dict that tracks features selsected for each level. Defaults to True.
path_save_logging (Path, optional) – Path to save the logging dict. Defaults to None.
- Returns:
Table of variables after feature set reduction.
- Return type:
List
- apply_fda_balanced(ml: Dict, variable_table: List, outcome_table_binary: DataFrame) List[source]
Applies false discovery avoidance method but balances the number of features on each level.
- Parameters:
ml (dict) – Machine learning dictionary containing the learning options.
variable_table (List) – Table of variables.
outcome_table_binary (pd.DataFrame) – Table of binary outcomes.
logging (bool, optional) – If True, will save a dict that tracks features selsected for each level. Defaults to True.
path_save_logging (Path, optional) – Path to save the logging dict. Defaults to None.
- Returns:
Table of variables after feature set reduction.
- Return type:
List
- apply_fda_one_space(ml: Dict, variable_table: List, outcome_table_binary: DataFrame, del_variants: bool = True, logging_dict: Dict | None = None) List[source]
Applies false discovery avoidance method.
- Parameters:
ml (dict) – Machine learning dictionary containing the learning options.
variable_table (List) – Table of variables.
outcome_table_binary (pd.DataFrame) – Table of binary outcomes.
del_variants (bool, optional) – If True, will delete the variants of the same feature. Defaults to True.
- Returns:
Table of variables after feature set reduction.
- Return type:
List
- apply_fsr(ml: Dict, variable_table: List, outcome_table_binary: DataFrame, path_save_logging: Path | None = None) List[source]
Applies feature set reduction method.
- Parameters:
ml (dict) – Machine learning dictionary containing the learning options.
variable_table (List) – Table of variables.
outcome_table_binary (pd.DataFrame) – Table of binary outcomes.
path_save_logging (Path, optional) – Path to save logging information. Defaults to None.
- Returns:
Table of variables after feature set reduction.
- Return type:
List
- apply_random_fsr(ml: Dict, variable_table: List) List[source]
Applies random feature set reduction by choosing a random number of features.
- Parameters:
ml (dict) – Machine learning dictionary containing the learning options.
variable_table (List) – Table of variables.
outcome_table_binary (pd.DataFrame) – Table of binary outcomes.
- Returns:
Table of variables after feature set reduction.
- Return type:
List
Normalization
- class MEDiml.learning.Normalization.CombatNormalization(institution_col: str | None = None, covariates: list | None = None, drop_institution: bool = True)[source]
Bases:
BaseEstimator,TransformerMixinSklearn-compatible Transformer for ComBat Normalization.
This transformer assumes the input X (DataFrame) contains both the features to be normalized and the column identifying the institution/batch.
- __init__(institution_col: str | None = None, covariates: list | None = None, drop_institution: bool = True)[source]
- Parameters:
institution_col (str) – Name of the column in X containing the institution/batch IDs. If None, tries to derive from Index using util.
covariates (list) – List of column names in X to treat as covariates (biological retention).
drop_institution (bool) – If True, removes the institution column from output.
Radiomics Learner
- class MEDiml.learning.RadiomicsLearner.RadiomicsLearner(path_study: Path, path_settings: Path, experiment_label: str)[source]
Bases:
object- __init__(path_study: Path, path_settings: Path, experiment_label: str) None[source]
Constructor of the class DesignExperiment.
- Parameters:
path_study (Path) – Path to the main study folder where the outcomes, learning patients and holdout patients dictionaries are found.
path_settings (Path) – Path to the settings folder.
experiment_label (str) – String specifying the label to attach to a given learning experiment in “path_experiments”. This label will be attached to the ml__$experiments_label$.json file as well as the learn__$experiment_label$ folder. This label is used to keep track of different experiments with different settings (e.g. radiomics, scans, machine learning algorithms, etc.).
- Returns:
None
- get_hold_out_set_table(ml: Dict, var_id: str, patients_id: List)[source]
Loads and pre-processes different radiomics tables then combines them to be used for hold-out testing.
- Parameters:
ml (Dict) – The machine learning dictionary containing the information of the machine learning test.
var_id (str) – String specifying the ID of the radiomics variable in ml. –> Ex: var1
patients_id (List) – List of patients of the hold-out set.
- Returns:
Radiomics table for the hold-out set.
- Return type:
pd.DataFrame
- ml_run(path_ml: Path, holdout_test: bool = True, method: str = 'auto') None[source]
This function runs the machine learning test for the ceated experiment.
- Parameters:
path_ml (Path) – Path to the main dictionary containing info about the ml current experiment.
holdout_test (bool, optional) – Boolean specifying if the hold-out test should be performed.
- Returns:
None.
- pre_process_radiomics_table(ml: Dict, var_id: str, outcome_table_binary: DataFrame, patients_train: list) Tuple[DataFrame, DataFrame][source]
For the given variable, this function loads the corresponding radiomics tables and pre-processes them (cleaning, normalization and feature set reduction).
Note
Only patients of the training/learning set should be found in the given outcome table.
- Parameters:
ml (Dict) – The machine learning dictionary containing the information of the machine learning test (parameters, options, etc.).
var_id (str) – String specifying the ID of the radiomics variable in ml. For example: ‘var1’.
outcome_table_binary (pd.DataFrame) – outcome table with binary labels. This table may be used to pre-process some variables with the “FDA” feature set reduction algorithm.
patients_train (list) – List of patients to use for training.
- Returns:
- Two dataframes of processed radiomics tables, one for training
and one for testing (no feature set reduction).
- Return type:
Tuple[pd.DataFrame, pd.DataFrame]
- pre_process_variables(ml: Dict, outcome_table_binary: DataFrame) Tuple[DataFrame, DataFrame][source]
Loads and pre-processes different radiomics tables from different variable types found in the ml dict.
Note
only patients of the training/learning set should be found in this outcome table.
- Parameters:
ml (Dict) – The machine learning dictionary containing the information of the machine learning test.
outcome_table_binary (pd.DataFrame) – outcome table with binary labels. This table may be used to pre-process some variables with the “FDA” feature set reduction algorithm.
- Returns:
- Two dict of processed radiomics tables, one dict for training and one for
testing (no feature set reduction).
- Return type:
Tuple
- run_experiment(holdout_test: bool = True, method: str = 'pycaret') None[source]
Run the machine learning experiment for each split/run
- Parameters:
holdout_test (bool, optional) – Boolean specifying if the hold-out test should be performed.
method (str, optional) – String specifying the method to use to train the model. - “pycaret”: Use PyCaret to train the model (automatic). - “grid_search”: Grid search with cross-validation to find the best parameters. - “random_search”: Random search with cross-validation to find the best parameters.
- Returns:
None
Results
- class MEDiml.learning.Results.Results(model_dict: dict | None = None, model_id: str = '')[source]
Bases:
objectA class to analyze the results of a given machine learning experiment, including the assessment of the model’s performance,
- Parameters:
model_dict (dict, optional) – Dictionary containing the model’s parameters. Defaults to {}.
model_id (str, optional) – ID of the model. Defaults to “”.
- model_dict
Dictionary containing the model’s parameters.
- Type:
dict
- model_id
ID of the model.
- Type:
str
- results_dict
Dictionary containing the results of the model’s performance.
- Type:
dict
- __init__(model_dict: dict | None = None, model_id: str = '') None[source]
Constructor of the class Results
- average_results(path_results: Path, save: bool = False) dict[source]
Averages the results (AUC, BAC, Sensitivity and Specifity) of all the runs of the same experiment, for training, testing and holdout sets.
- Parameters:
path_results (Path) – path to the folder containing the results of the experiment.
save (bool, optional) – If True, saves the results in the same folder as the model.
- Returns:
Averaged results for each dataset.
- Return type:
dict
- bootstrap_metrics(response: ndarray, labels: DataFrame, thresh: float, n_bootstraps: int = 100) dict[source]
Computes 95% Confidence Intervals using bootstrap resampling.
- Parameters:
response (np.ndarray) – Array of the probabilities of class “1” for all instances (prediction).
labels (pd.DataFrame) – Column vector specifying the outcome status (1 or 0) for all instances.
thresh (float) – Optimal threshold selected from the ROC curve.
n_bootstraps (int, optional) – Number of bootstrap samples. Defaults to 100.
- Returns:
Dictionary containing the 95% confidence intervals for each metric.
- Return type:
dict
- get_model_performance(response: list, outcome_table: DataFrame) None[source]
Calculates the performance of the model :param response: List of machine learning model predictions. :type response: list :param outcome_table: Outcome table with binary labels. :type outcome_table: pd.DataFrame
- Returns:
Updates the
run_resultsattribute.- Return type:
None
- get_optimal_level(path_experiments: Path, experiments_labels: List[str], metric: str = 'AUC_mean', p_value_test: str = 'wilcoxon', aggregate: bool = False) None[source]
This function plots a heatmap of the metrics values for the performance of the models in the given experiment.
- Parameters:
path_experiments (Path) – Path to the folder containing the experiments.
experiments_labels (List) – List of experiments labels to use for the plot. including variants is possible. For example: [‘experiment1_morph_CT’, [‘experiment1_intensity5_CT’, ‘experiment1_intensity10_CT’], ‘experiment1_texture_CT’].
metric (str, optional) – Metric to plot. Defaults to ‘AUC_mean’.
p_value_test (str, optional) –
Method to use to calculate the p-value. Defaults to ‘wilcoxon’. Available options:
’delong’: Delong test.
’ttest’: T-test.
’wilcoxon’: Wilcoxon signed rank test.
’bengio’: Bengio and Nadeau corrected t-test.
aggregate (bool, optional) – If True, aggregates the results of all the splits and computes one final p-value. Only valid for the Delong test when cross-validation is used. Defaults to False.
- Returns:
None.
- plot_fda_analysis_heatmap(path_experiments: Path, experiment: str, levels: List, modalities: List, title: str | None = None, save: bool = False) None[source]
This function plots a heatmap of the percentage of stable features and final features selected by FDA for a given experiment.
- Parameters:
path_experiments (Path) – Path to the folder containing the experiments.
experiment (str) – Name of the experiment to plot. Will be used to find the results.
levels (List) – List of radiomics levels to include in plot. For example: [‘morph’, ‘intensity’].
modalities (List) – List of imaging modalities to include in the plot.
title (str, optional) – Title and name used to save the plot. Defaults to None.
save (bool, optional) – Whether to save the plot. Defaults to False.
- Returns:
None.
- plot_feature_analysis(path_experiments: Path, experiment: str, levels: List, modalities: List = [], title: str | None = None, save: bool = False) None[source]
This function plots a pie chart of the percentage of the final features used to train the model per radiomics level.
- Parameters:
path_experiments (Path) – Path to the folder containing the experiments.
experiment (str) – Name of the experiment to plot. Will be used to find the results.
levels (List) – List of radiomics levels to include in plot. For example: [‘morph’, ‘intensity’].
modalities (List, optional) – List of imaging modalities to include in the plot. Defaults to [].
title (str, optional) – Title and name used to save the plot. Defaults to None.
save (bool, optional) – Whether to save the plot. Defaults to False.
- Returns:
None.
- plot_features_importance_histogram(path_experiments: Path, experiment: str, level: str, modalities: List, sort_option: str = 'importance', title: str | None = None, save: bool = True, figsize: tuple = (12, 12)) None[source]
Plots a histogram of the features importance for the given experiment.
- Parameters:
path_experiments (Path) – Path to the folder containing the experiments.
experiment (str) – Name of the experiment to plot. Will be used to find the results.
level (str) – Radiomics level to plot. For example: ‘morph’.
modalities (List) – List of imaging modalities to use for the plot. A plot for each modality.
sort_option (str, optional) – Option used to sort the features. Available options: - ‘importance’: Sorts the features by importance. - ‘times_selected’: Sorts the features by the number of times they were selected across the different splits. - ‘both’: Sorts the features by importance and then by the number of times they were selected.
title (str, optional) – Title of the plot. Defaults to None.
save (bool, optional) – Whether to save the plot. Defaults to True.
figsize (tuple, optional) – Size of the figure. Defaults to (12, 12).
- Returns:
None. Plots the figure or saves it.
- plot_heatmap(path_experiments: Path, experiments_labels: List[str], metric: str = 'AUC_mean', stat_extra: list = [], plot_p_values: bool = True, p_value_test: str = 'wilcoxon', aggregate: bool = False, title: str | None = None, save: bool = False, figsize: tuple = (8, 8)) None[source]
This function plots a heatmap of the metrics values for the performance of the models in the given experiment.
- Parameters:
path_experiments (Path) – Path to the folder containing the experiments.
experiments_labels (List) – List of experiments labels to use for the plot. including variants is possible. For example: [‘experiment1_morph_CT’, [‘experiment1_intensity5_CT’, ‘experiment1_intensity10_CT’], ‘experiment1_texture_CT’].
metric (str, optional) – Metric to plot. Defaults to ‘AUC_mean’.
stat_extra (list, optional) – List of extra statistics to include in the plot. Defaults to [].
plot_p_values (bool, optional) – If True plots the p-value of the choosen test. Defaults to True.
p_value_test (str, optional) –
Method to use to calculate the p-value. Defaults to ‘wilcoxon’. Available options:
’delong’: Delong test.
’ttest’: T-test.
’wilcoxon’: Wilcoxon signed rank test.
’bengio’: Bengio and Nadeau corrected t-test.
aggregate (bool, optional) – If True, aggregates the results of all the splits and computes one final p-value. Only valid for the Delong test when cross-validation is used. Defaults to False.
extra_xlabels (List, optional) – List of extra x-axis labels. Defaults to [].
title (str, optional) – Title of the plot. Defaults to None.
save (bool, optional) – Whether to save the plot. Defaults to False.
figsize (tuple, optional) – Size of the figure. Defaults to (8, 8).
- Returns:
None.
- plot_lf_level_tree(path_experiments: Path, experiment: str, level: str, modalities: list, initial_width: float = 4, lines_weight: float = 1, title: str | None = None, figsize: tuple = (12, 10)) None[source]
Plots a tree explaining the impact of features in the linear filters radiomics complexity level.
- Parameters:
path_experiments (Path) – Path to the folder containing the experiments.
experiment (str) – Name of the experiment to plot. Will be used to find the results.
level (List) – Radiomics complexity level to use for the plot.
modalities (List, optional) – List of imaging modalities to include in the plot. Defaults to [].
initial_width (float, optional) – Initial width of the lines. Defaults to 1. For aesthetic purposes.
lines_weight (float, optional) – Weight applied to the lines of the tree. Defaults to 2. For aesthetic purposes.
title (str, optional) – Title and name used to save the plot. Defaults to None.
figsize (tuple, optional) – Size of the figure. Defaults to (20, 10).
- Returns:
None.
- plot_original_level_tree(path_experiments: Path, experiment: str, level: str, modalities: list, initial_width: float = 4, lines_weight: float = 1, title: str | None = None, figsize: tuple = (12, 10)) None[source]
Plots a tree explaining the impact of features in the original radiomics complexity level.
- Parameters:
path_experiments (Path) – Path to the folder containing the experiments.
experiment (str) – Name of the experiment to plot. Will be used to find the results.
level (List) – Radiomics complexity level to use for the plot.
modalities (List, optional) – List of imaging modalities to include in the plot. Defaults to [].
initial_width (float, optional) – Initial width of the lines. Defaults to 1. For aesthetic purposes.
lines_weight (float, optional) – Weight applied to the lines of the tree. Defaults to 2. For aesthetic purposes.
title (str, optional) – Title and name used to save the plot. Defaults to None.
figsize (tuple, optional) – Size of the figure. Defaults to (20, 10).
- Returns:
None.
- plot_radiomics_starting_percentage(path_experiments: Path, experiment: str, levels: List, modalities: List, title: str | None = None, figsize: tuple = (15, 10), save: bool = False) None[source]
This function plots a pie chart of the percentage of features used in experiment per radiomics level.
- Parameters:
path_experiments (Path) – Path to the folder containing the experiments.
experiment (str) – Name of the experiment to plot. Will be used to find the results.
levels (List) – List of radiomics levels to include in the plot.
modalities (List) – List of imaging modalities to include in the plot.
title (str, optional) – Title and name used to save the plot. Defaults to None.
figsize (tuple, optional) – Size of the figure. Defaults to (15, 10).
save (bool, optional) – Whether to save the plot. Defaults to False.
- Returns:
None.
- plot_tf_level_tree(path_experiments: Path, experiment: str, level: str, modalities: list, initial_width: float = 4, lines_weight: float = 1, title: str | None = None, figsize: tuple = (12, 10)) None[source]
Plots a tree explaining the impact of features in the textural filters radiomics complexity level.
- Parameters:
path_experiments (Path) – Path to the folder containing the experiments.
experiment (str) – Name of the experiment to plot. Will be used to find the results.
level (List) – Radiomics complexity level to use for the plot.
modalities (List, optional) – List of imaging modalities to include in the plot. Defaults to [].
initial_width (float, optional) – Initial width of the lines. Defaults to 1. For aesthetic purposes.
lines_weight (float, optional) – Weight applied to the lines of the tree. Defaults to 2. For aesthetic purposes.
title (str, optional) – Title and name used to save the plot. Defaults to None.
figsize (tuple, optional) – Size of the figure. Defaults to (20, 10).
- Returns:
None.
- to_json(response_train: list | None = None, response_test: list | None = None, response_holdout: list | None = None, patients_train: list | None = None, patients_test: list | None = None, patients_holdout: list | None = None, outcome_table_binary_train: DataFrame | None = None, outcome_table_binary_test: DataFrame | None = None, outcome_table_binary_holdout: DataFrame | None = None) dict[source]
Creates a dictionary with the results of the model using the class attributes.
- Parameters:
response_train (list) – List of machine learning model predictions for the training set.
response_test (list) – List of machine learning model predictions for the test set.
response_holdout (list) – List of machine learning model predictions for the holdout set.
patients_train (list) – List of patients in the training set.
patients_test (list) – List of patients in the test set.
patients_holdout (list) – List of patients in the holdout set.
outcome_table_binary_train (pd.DataFrame) – Binary outcome table for the training set.
outcome_table_binary_test (pd.DataFrame) – Binary outcome table for the test set.
outcome_table_binary_holdout (pd.DataFrame) – Binary outcome table for the holdout set.
- Returns:
Dictionary with the the responses of the model and the patients used for training, testing and holdout.
- Return type:
Dict