Project properties

Title Exploring the dimensions of the transcriptome landscape of living organisms.
Group Systems and Synthetic Biology
Project type thesis
Credits 36
Supervisor(s) Edoardo Saccenti
Examiner(s) Edoardo Saccenti, Rob Smith
Contact info robert1.smith@wur.nl
Begin date 2024/05/15
End date
Description Transcriptomics data is inherently high-dimensional and present very complex yet unknown structure: its complexity arises from the large number of genes (ranging from a few hundreds or thousands for bacteria, to around 20.000 for humans, up to more than 40.000 for some plants), and the dynamic nature of their expression across different biological conditions. A transcriptomic dataset encapsulates a multitude of molecular signatures, capturing the intricate interplay of genes and their regulation. High dimensionality of transcriptomics data not only reflects the richness of biological systems they describe, but also poses significant computational and analytical challenges. Effective methods for data preprocessing, feature selection, and dimensionality reduction are essential to reveal meaningful patterns amidst the noise inherent in transcriptomics data.
Assessing the dimensionality of a data set is the first step for dimensionality reduction and as such is a critical, yet often neglected, step in data analysis.
Scope of this project is to assess the dimensionality of a large compendium of transcriptomics data sets in the context of Principal Component Analysis using a variety of techniques available, either statistical or computational, which may fail in the high-dimensional context. This topic has been seldom addressed in literature, so there is no little knowledge on the topic. Moreover is not clear how sub-optimal or not optimal dimensionality assessment impacts biological interpretation of the data, which is the ultimate goal of the analysis.

References

Michael Lenz, Franz-Josef Muller, Martin Zenke, and Andreas Schuppert. Principal components analysis and the reported low intrinsic dimensionality of gene expression microarray data. Scientific reports, 6(1):25696, 2016.

Edoardo Saccenti and Jose’ Camacho. Determining the number of components in principal components analysis: A comparison of statistical, cross-validation and approximated methods. Chemometrics and Intelligent Laboratory Systems, 149:99–116, 2015.


Used skills Processing and analysis of large transcriptomics data sets; Exploratory data analysis of omics data through Principal component analysis and related methods; understanding of statistical and algorithmic techniques for dimensionality assessment in the case of high-dimensional biological data; advanced programming skills in R and Matlab
Requirements Basic statistics (univariate and multivariate), basic molecular biology, basic understanding of omics data (transcriptomics). Machine learning at the level of Molecular Systems Biology course (SSB-30306). Ability of programming in R an Matlab is a pre.