housekeepingMinerPy modules¶
Modules¶
housekeepingMinerPy.mining module¶
- housekeepingMinerPy.mining.balance_resample(y_var: list = None, random_state: int = 42, sample_size: int = None, replace: bool = False)¶
Create balanced subsamples based in the minimum size of a class.
- Parameters:
y_var (list) – List or numpy.array of classes.
random_state (int) – It is a seed number to guarantee reproducibility.
sample_size (int) – Minimun number os samples in each class.
replace (bool) – Argument of np.random.choice. If False the resample is with unique values, otherwise the same sample can be in the subsample.
- Returns:
Return a list of indices of a balanced subsample.
- Return type:
list
- housekeepingMinerPy.mining.boruta_selection(adata, layer: str = None, class_col: str = None, scaler=None, rf_model=None, class_weight: list = None, random_state: int = 42, alpha: float = 0.05)¶
Call boruta function from BorutaPy package for feature seleciton. This function balance the classes via class_weight.
- Parameters:
adata (Anndata)
layer (string, optional) – layer of AnnData object to be used to transform. If layer is not informed, adata.X will be used.
class_col (str) – It is the column on adata.obs where the classes for comparasion are described.
scaler_object (scikit-learning preprocessing scaler_object) – It is optional to fit and transform the data before clustering. If None, no transformation is applied.
rf_model (scikit-learning RandomForestClassifier object.) – It is optional to calcluate closest neighbors. If None, a default sklearn.ensemble.RandomForestClassifier is applied.
class_weight (list) – Weight for each class. If None, we use sklearn.utils.class_weight.compute_class_weight function to calcluate a balanced weight.
random_state (int) – It is a seed number to guarantee reproducibility.
alpha (float) – It is the alpha for p-value cutoff during the boruta decision.
- Returns:
Return a dictionary with genes, their ranks and support. {‘genes’:[], ‘rank’:[], ‘support’:[]}
- Return type:
dict
- housekeepingMinerPy.mining.exprs_cv(adata, layer: str = None, groups_col: str = None, return_mean_per_group: bool = False, return_std_per_group: bool = False, return_cv_per_group: bool = False)¶
Calculate the coefficient (cv) of variation of expression for each gene. To give the same weight for different groups, you can infrom groups_col. So a pooled cv will be calculated considering the same weight for each group.
- Parameters:
adata (Anndata)
layer (string, optional) – layer of AnnData object to be used to transform. If layer is not informed, adata.X will be used.
groups_col (string) – groups_col to perform the stratified calculation of cv. It must be a column at adata.obs annotations. If None, the calculation will not give the same weight for each group. The group with more samples will have greater weight. The name of column will be simple_cv instead pooled_cv.
return_mean_per_group (bool) – If True, columns of mean calculation for each group will be stored in a adata.var column.
return_std_per_group (bool) – If True, columns of standard deviation (std) calculation for each group will be stored in a adata.var column.
return_std_per_group – If True, columns of cv calculation for each group will be stored in a adata.var column.
- Returns:
Return the adata with additional columns [‘pool_mean’, ‘pool_std’, ‘pool_cv’].
- Return type:
adata
- housekeepingMinerPy.mining.gene_gini_coeff(adata, layer: str = None, groups_col: str = None)¶
Calculate the Gini coefficient (cv) of each gene. G = 1 + 1/n - 2*sum_i(rank_k*x_i) / n*sum_i(x_i)
- Parameters:
adata (Anndata)
layer (string, optional) – layer of AnnData object to be used to transform. If layer is not informed, adata.X will be used.
groups_col (string) – We recomend use None, since Gini is ordenaded and batches doens’t affects. groups_col to perform the stratified calculation of Gini. It must be a column at adata.obs annotations. Since there is no stabilished way to pool Gini, only the valus per group is calcluate, but not a pooled one. If None, the calculation will not give the same weight for each group. The group with more samples will have greater weight. The name of column will be simple_cv instead pooled_cv.
- Returns:
Return the adata with additional column [‘gini_coefficient’].
- Return type:
adata
- housekeepingMinerPy.mining.hkg_selection_ga(adata, layer: str = None, outlier_threshold: float = 0.9, fitness_function: str = 'minimize_outliers', fitness_function_model=None, y: str = None, suppress_warnings: bool = False)¶
- housekeepingMinerPy.mining.pooled_tost(adata, layer: str = None, class_col: str = None, combinations_list: list = None, groups_col: str = None, method: str = 'fisher', cohens_d: float = 0.5, is_parametric: bool = False, is_paired: bool = False, correct_fdr: bool = True)¶
Perfom Two-One-Sided-Test to test for equivalence between two conditions by groups and then the p-values are pooled.
- Parameters:
adata (Anndata)
layer (string, optional) – layer of AnnData object to be used to transform. If layer is not informed, adata.X will be used.
class_col (str) – It is the column on adata.obs where the classes for comparasion are described.
combinations_list (list) – It is a list of tuple of combination to be tested. If None is informed, all pairwise combinations of different classes will be tested.
groups_col (str) – It is the column on adata.obs where the groups for pooling the partial p-values.
cohens_d (float) – Cohens d value. It is the maximum/minimum percentage of standard deviation to be considered equivalent.
is_parametric (bool) – If True peform a parametric test (T-test), if False a non-parametric is performed. For paired non-parametric, wilcoxon is applied, otherwise brunnermunzel test is applied.
is_paired (bool) – It True a paired T-test is applied in parametric option, otherwise a wilcoxon is applied.
correct_fdr (bool) – It True a False Discovery rate is calculated based on Benjamini-Hochberg method.
- Returns:
Return a table with all pooled adjusted p-values.
- Return type:
adata
- housekeepingMinerPy.mining.sclustering_cv_stb_gini(adata, cl_cols: list = [], scaler_object=None, kMeans=None)¶
Calculate supervised clustering by Kmeans algorithm (by default) based columns set in cl_cols list. Those columns are used as features to perform clusterization.
- Parameters:
adata (Anndata)
cl_cols (list, optional) – If a empty list is informed, it will get automatically the columns pool_cv, pool_stability_cv, pool_mean, gini_coefficient
scaler_object (scikit-learning preprocessing scaler_object) – It is optional to fit and transform the data before clustering. If None, no transformation is applied.
kMeans (scikit-learning KMeans object.) – It is optional to calcluate KMeans. If None, a default ssklearn.cluster.KMeans is applied.
- Returns:
Return the adata with additional column [‘gini_coefficient’].
- Return type:
adata
- housekeepingMinerPy.mining.set_balance_resample(y_var: list = None, n_set: int = 5, random_state: int = 42, sample_size: int = None, replace: bool = False)¶
Create a list of n_set balanced subsamples.
- Parameters:
y_var (list) – List or numpy.array of classes.
n_set (int) – Number boruta algorithm runs in different subsamples.
random_state (int) – It is a seed number to guarantee reproducibility.
sample_size (int) – Minimun number os samples in each class.
replace (bool) – Argument of np.random.choice. If False the resample is with unique values, otherwise the same sample can be in the subsample.
- Returns:
Return a list n_set indices of balanced subsamples
- Return type:
list
- housekeepingMinerPy.mining.set_boruta_selection(adata, layer: str = None, class_col: str = None, scaler=None, rf_model=None, random_state: int = 42, class_weight: list = None, n_set: int = 5, sample_size: int = None, replace: bool = False, alpha: float = 0.05)¶
Call boruta function from BorutaPy package for feature seleciton n_set times in different subsamples.
- Parameters:
adata (Anndata)
layer (string, optional) – layer of AnnData object to be used to transform. If layer is not informed, adata.X will be used.
class_col (str) – It is the column on adata.obs where the classes for comparasion are described.
scaler_object (scikit-learning preprocessing scaler_object) – It is optional to fit and transform the data before clustering. If None, no transformation is applied.
rf_model (scikit-learning RandomForestClassifier object.) – It is optional to calcluate closest neighbors. If None, a default sklearn.ensemble.RandomForestClassifier is applied.
class_weight (list) – Weight for each class. If None, we use sklearn.utils.class_weight.compute_class_weight function to calcluate a balanced weight.
random_state (int) – It is a seed number to guarantee reproducibility.
n_set (int) – Number boruta algorithm runs in different subsamples.
sample_size (int) – Minimun number os samples in each class.
replace (bool) – Argument of np.random.choice. If False the resample is with unique values, otherwise the same sample can be in the subsample.
alpha (float) – It is the alpha for p-value cutoff during the boruta decision.
- Returns:
Return a list with dictionaries from boruta_selection function, [{‘genes’:[], ‘rank’:[], ‘support’:[]},…]
- Return type:
list
- housekeepingMinerPy.mining.stability_cv(adata, layer: str = None, groups_col: str = None, return_stb_cv_per_group: bool = False)¶
Calculate the average coefficient (cv) of variation of stability for each pair of genes. To give the same weight for different groups, you can infrom groups_col. So a pooled stability cv will be calculated considering the same weight for each group.
- Parameters:
adata (Anndata)
layer (string, optional) – layer of AnnData object to be used to transform. If layer is not informed, adata.X will be used.
groups_col (string) – groups_col to perform the stratified calculation of cv. It must be a column at adata.obs annotations. If None, the calculation will not give the same weight for each group. The group with more samples will have greater weight. The name of column will be simple_cv instead pooled_cv.
return_stb_cv_per_group (bool) – If True, columns of cv-stability for each group will be stored in a adata.var column.
- Returns:
Return the adata with additional column [‘pool_stability_cv’].
- Return type:
adata
- housekeepingMinerPy.mining.tost(adata, layer: str = None, class_col: str = None, combinations_list: list = None, cohens_d: float = 0.5, is_parametric: bool = False, is_paired: bool = False, correct_fdr: bool = True)¶
Perfom Two-One-Sided-Test to test for equivalence between two conditions.
- Parameters:
adata (Anndata)
layer (string, optional) – layer of AnnData object to be used to transform. If layer is not informed, adata.X will be used.
class_col (str) – It is the column on adata.obs where the classes for comparasion are described.
combinations_list (list) – It is a list of tuple of combination to be tested. If None is informed, all pairwise combinations of different classes will be tested.
cohens_d (float) – Cohens d value. It is the maximum/minimum percentage of standard deviation to be considered equivalent.
is_parametric (bool) – If True peform a parametric test (T-test), if False a non-parametric is performed. For paired non-parametric, wilcoxon is applied, otherwise brunnermunzel test is applied.
is_paired (bool) – It True a paired T-test is applied in parametric option, otherwise a wilcoxon is applied.
correct_fdr (bool) – It True a False Discovery rate is calculated based on Benjamini-Hochberg method.
- Returns:
Return a table with all adjusted p-values.
- Return type:
adata
- housekeepingMinerPy.mining.uclustering_cv_stb_gini(adata, cl_cols: list = [], scaler_object=None, nearestNeighbors_object=None, louvain_object=None, resolution: float = 1)¶
Calculate unsupervised clusters by Louvain algorithm based columns set in cl_cols list. Those columns are used as features to perform clusterization.
- Parameters:
adata (Anndata)
cl_cols (list, optional) – If a empty list is informed, it will get automatically the columns pool_cv, pool_stability_cv, pool_mean, gini_coefficient If layer is not informed, adata.X will be used.
groups_col (string) – We recomend use None, since Gini is ordenaded and batches does not affects. groups_col to perform the stratified calculation of Gini. It must be a column at adata.obs annotations. Since there is no stabilished way to pool Gini, only the valus per group is calcluate, but not a pooled one. If None, the calculation will not give the same weight for each group. The group with more samples will have greater weight. The name of column will be simple_cv instead pooled_cv.
scaler_object (scikit-learning preprocessing scaler_object) – It is optional to fit and transform the data before clustering. If None, no transformation is applied.
nearestNeighbors_object (scikit-learning NearestNeighbors object.) – It is optional to calcluate closest neighbors. If None, a default sklearn.neighbors.NearestNeighbors is applied.
louvain_object (scikit-network Louvain object.) – It is optional to clustering. If None, a default sknetwork.clustering.Louvain is applied.
resolution (float) – resolution parameter sknetwork.clustering.Louvain object when louvain_object==None.
- Returns:
Return the adata with additional column [‘gini_coefficient’].
- Return type:
adata
housekeepingMinerPy.plot module¶
- housekeepingMinerPy.plot.plot_corr(adata, layer: str = None, r_pearson_lim: float = 0.5, p_value_lim: float = 0.05, correct_fdr: bool = True, color_threshold: float = 0, bbox_to_anchor: tuple = (1.2, 1), figsize: tuple = (5.566666666666666, 5.566666666666666), savefig: dict = None)¶
Scatterplot of Pearson correlation, top barplot with quantity of genes correlated and right dendrogram clustering genes by distance correlation.
- Parameters:
adata (Anndata)
layer (string, optional) – layer of AnnData object to be used to transform. If layer is not informed, adata.X will be used.
r_pearson_lim (float) – Limite to highlight the R correlation value.
p_value_lim (float) – Limit of p-value to consider the test statistically significant.
correct_fdr (bool) – It True a False Discovery rate is calculated based on Benjamini-Hochberg method.
color_threshold (float) – Limit of cophenetic distance to color the groups on dendrogram.
pallete (string, optional) – Name of the matplotlib colormap.
bbox_to_anchor (tuple) – Tuple with positions to the set the legend.
figsize (tuple) – Tuple with the width and height of the figure.
savefig (dict, optional) – Dictionary with arguments for matplotlib.pyplot.savefig. Example: {‘fname’:’./test.pdf’, ‘format’:’pdf’, ‘dpi’:300}
- Return type:
None
- housekeepingMinerPy.plot.plot_stb_cv_gini(adata, x: str = 'pool_cv', y: str = 'pool_stability_cv', z: str = 'gini_coefficient', hue: str = 'uclustering_cv_stb_labels', palette: str = None, legend: bool = False, median_line: bool = True, ann_genes: list = None, highlight_group: str = None, figsize: tuple = (16.7, 8.35), savefig: dict = None)¶
Plot six main plots combining three variables, for example: coefficient of variance, Gini coefficient and CV of stability. For each line of plot there are a scatterplot and a boxplot. The scatterplot is design with two variable and in each axis a histogram show the distribution of each variable. the second plot is a boxplot with the vatriable from the scatterplot vertical axis splitted in groups. A barplot is in the top of boxplot showing the number of genes is in each group.
- Parameters:
adata (Anndata)
x (string) – Variable to be plotted as the vertical axis on the first row of plots.
y (string) – Variable to be plotted as the vertical axis on the second row of plots.
z (string) – Variable to be plotted as the vertical axis on the third row of plots.
hue (string) – Column to be considered to group by colors in scatterplot and boxplot.
pallete (string, optional) – Name of the matplotlib colormap.
legend (bool) – If True, a legend of hue is displayed on scatterplot.
median_line (bool) – If True, medians of each variable is plotted as dot-points in scatterplot and boxplot.
ann_genes (list, optional) – List of genes that should have their names displayed on the scatterplot.
highlight_group (str, optional) – Name of the group to be highlighted at the scatterplot.
figsize (tuple) – Tuple with the width and height of the figure.
savefig (dict, optional) – Dictionary with arguments for matplotlib.pyplot.savefig. Example: {‘fname’:’./test.pdf’, ‘format’:’pdf’, ‘dpi’:300}
- Return type:
None
housekeepingMinerPy.pp module¶
- class housekeepingMinerPy.pp.MRN_transformer(*args: Any, **kwargs: Any)¶
Bases:
BaseEstimator,TransformerMixinMRN adaptation to be used in scikit-learning pipeline.
- fit(X, y=None)¶
- transform(X, y=None)¶
- class housekeepingMinerPy.pp.TMM_transformer(*args: Any, **kwargs: Any)¶
Bases:
BaseEstimator,TransformerMixinTMM adaptation to be used in scikit-learning pipeline.
- fit(X, y=None)¶
- transform(X, y=None)¶
- housekeepingMinerPy.pp.create_groups(adata, layer: str = None, study_col: str = None, scaler_object=None, nearestNeighbors_object=None, louvain_object=None, umap_object=None)¶
Create groups based on Nearest Neighbors and Louvain algorithm. It is usefull when the demographics and metadata are not available. The input must follow columns as features (genes) and rows as samples.
- Parameters:
adata (Anndata)
layer (string, optional) – layer of AnnData object to be used to clustering. If layer is not informed, adata.X will be used.
study_col (string, optional) – Columns in adata.obs to be used to stratify the clustering based on each study. If If layer is not informed, the data will be cluster for the entire dataset without stratification for different studies.
scaler_object (scikit-learning scaler object) – Object to transform the data before clustering. By default no trasformation is applied. Example: sklearn.preprocessing.StandardScaler()
nearestNeighbors_object (scikit-learning scaler object.) – scikit-learning object to calculate the neighbors. The default is the sklearn.neighbors.NearestNeighbors()
louvain_object (sknetwork Object) – Louvain object to cluster the neighbors. The default is the sknetwork.clustering.Louvain()
umap_object (UMAP Object) – Umap object to reduce the dimensionality only for 2D visualization purpose. The default is the umap.UMAP()
- Returns:
adata – Return the same adata input with three other columns in adata.obs (‘louvain_group’,’UMAP_1’,’UMAP_2’)
- Return type:
AnnData object
- housekeepingMinerPy.pp.log_transform(adata, layer: str = None, method: str = 'arcsinh')¶
Apply log transformation to a expression table count adding a pseudocount (+1).
- Parameters:
adata (Anndata)
layer (string, optional) – layer of AnnData object to be used to log transform. If layer is not informed, adata.X will be used.
method (string) – Method of log transformation to avoid np.nan for log(0). It can be [‘arcsinh’, ‘log1p’])
- Returns:
Return the adata with additional layer ‘arcsinh’ or ‘log1p’.
- Return type:
adata
- housekeepingMinerPy.pp.mrn(data, return_norm_factors=False)¶
Normalize counts matrix by Median of Ratios. This function is part of the project https://gitlab.com/georgy.m/conorm. We used the function within our code in favor of usability
- Parameters:
data (array_like) – Counts dataframe to normalize (rows are genes). Most often can be either pandas DataFrame or an numpy matrix.
return_norm_factors (bool, optional) – If True, then norm factors are also returned. The default is False.
- Returns:
data – Normalized data.
- Return type:
array_like
- housekeepingMinerPy.pp.mrn_norm_factors(data)¶
Compute Median of Ratio norm factors. This function is part of the project https://gitlab.com/georgy.m/conorm. We used the function within our code in favor of usability
- Parameters:
data (array_like) – Counts dataframe to normalize (rows are genes). Most often can be either pandas DataFrame or an numpy matrix.
- Returns:
tmms – Norm factors.
- Return type:
np.ndarray or pd.DataFrame
- housekeepingMinerPy.pp.tmm(data, trim_lfc=0.3, trim_mag=0.05, index_ref=None, return_norm_factors=False)¶
Normalize counts matrix by Trimmed Means of M-values (TMM). This function is part of the project https://gitlab.com/georgy.m/conorm. We used the function within our code in favor of usability
- Parameters:
data (array_like) – Counts dataframe to normalize (rows are genes). Most often can be either pandas DataFrame or an numpy matrix.
trim_lfc (float, optional) – Quantile cutoff for M_g (logfoldchanges). The default is 0.3.
trim_mag (float, optional) – Quantile cutoff for A_g (log magnitude). The default is 0.05.
index_ref (float, str, optional) – Reference index or column name to use as reference in the TMM algorithm. The default is None.
return_norm_factors (bool, optional) – If True, then norm factors are also returned. The default is False.
- Returns:
data – Normalized data.
- Return type:
array_like
- housekeepingMinerPy.pp.tmm_norm_factors(data, trim_lfc=0.3, trim_mag=0.05, index_ref=None)¶
Compute Trimmed Means of M-values norm factors. This function is part of the project https://gitlab.com/georgy.m/conorm. We used the function within our code in favor of usability
- Parameters:
data (array_like) – Counts dataframe to normalize (rows are genes). Most often can be either pandas DataFrame or an numpy matrix.
trim_lfc (float, optional) – Quantile cutoff for M_g (logfoldchanges). The default is 0.3.
trim_mag (float, optional) – Quantile cutoff for A_g (log magnitude). The default is 0.05.
index_ref (float, str, optional) – Reference index or column name to use as reference in the TMM algorithm. The default is None.
- Returns:
tmms – Norm factors.
- Return type:
np.ndarray or pd.DataFrame
- housekeepingMinerPy.pp.transform_exprs(adata, layer: str = None, groups_col: str = None, trns_dict: dict = None)¶
Transform expression data into MRN, TMM, quantile or power transformations. This information must be on a dict trns_dict, where the key is the group and the value the method, for example {0:’TMM’, 1:’quantile’}. If the element in adata.obs trns_col column is not in [‘MRN’, ‘TMM’, ‘quantile’, ‘power’], the subset will not be transformed. As well as the groups of interest must be related in adata.obs groups_col column to perform a grouped transformation. The input must follow columns as features (genes) and rows as samples.
- Parameters:
adata (Anndata)
layer (string, optional) – layer of AnnData object to be used to transform. If layer is not informed, adata.X will be used.
groups_col (string) – groups_col to perform the transformation independently. It must be a column at adata.obs annotations.
trns_dict (dict) – trns_dict to perform the right transformation method to the subset. It must be a dictionary related to the adata.obs[groups_col]. for example, if your groups are [0,1,2], the dict must be like {0:’TMM’, 1:’TMM’, 2:’quantile’}. For each group a respective method will be applied to tranformation. trns_dict values must be in [‘MRN’, ‘TMM’, ‘quantile’, ‘power’], otherwise it will not be transformed.
- Returns:
Return the adata with additional layer ‘trns’.
- Return type:
adata
- housekeepingMinerPy.pp.transform_exprs_Microarray(X, trns_method: str = 'quantile')¶
Transform Microarray data into quantile or power transformation. This def used sklearn algorithm to calculate the transformations. Important! The input must be the counts or pseudocounts integer counts, do not use TPM normalization as input. The input must follow columns as features (genes) and rows as samples.
- Parameters:
X (np.array) – Integer count matrix. Columns as features and rows as samples.
norm_method (string or scikit-learning transform object) – Type of method used. The options are ‘quantile’ or ‘power’. Or a scikit-learning sklearn.preprocessing object.
- Returns:
Return the X input transformed
- Return type:
np.array
- housekeepingMinerPy.pp.transform_exprs_RNAseq(X, trns_method: str = 'MRN')¶
Transform RNAseq counts data into MRN or TMM (DESeq2 and EdgeR respectively). This def used conorm algorithm to calculate the transformations. Important! The input must be the counts or pseudocounts integer counts, do not use TPM normalization as input. The input must follow columns as features (genes) and rows as samples.
- Parameters:
X (np.array) – Integer count matrix
trns_method (string, optional) – Type of method used. The options are MRN (DESeq2) or TMM (EdgeR).
- Returns:
Return the X input transformed
- Return type:
np.array