Features Selection in Statistical Classification of High Dimensional Image Derived Maize (Zea Mays L.) Phenomic Data
Abstract
Phenotyping has advanced with the application of high throughput phenotyping techniques such
automated imaging. This has led to derivation of large quantities of high dimensional phenotypic data that could not
have been achieved using manual phenotyping in a single run. Hence, the need for parallel development of statistical
techniques that can appropriately handle such large and/or high dimensional data set. Moreover, there is need to
come up with a statistical criteria for selecting the best image derived phenotypic features that can be used as best
predictors in modelling plant growth. Information on such criteria is limited. The objective of this study is to apply
feature importance, feature selection with Shapley values and LASSO regression techniques to find the subset of
features with the highest predictive power for subsequent use in modelling maize plant growth using highdimensional image derived phenotypic data. The study compared the statistical power of these features extraction
methods by fitting an XGBoost model using the best features from each selection method. The image derived
phenomic data was obtained from Leibniz Institute of Plant Genetics and Crop Plant Research, -Gatersleben,
Germany. Data analysis was performed using R-statistical software. The data was subjected to data imputation using
𝑘𝑘 Nearest Neighbours technique. Features extraction was performed using feature importance, Shapley values and
LASSO regression. The Shapley values extracted 25 phenotypic features, feature importance extracted 31 features
and LASSO regression extracted 12 features. Of the three techniques, the feature importance criterion emerged the
best feature selection technique, followed by Shapley values and LASSO regression, respectively. The study
demonstrated the potential of using feature importance as a selection technique in reduction of input variables in of
high dimensional growth data set.