Fast way to find cause and effect

18 Jun 2014. A new statistical methodology has been developed to find contributing factors to experimental outcomes in complex situations. The method is useful to medical studies, genetic research and financial analysis.

Finding out the relationship between an outcome and the contributing factors is one of the most important tasks of scientific research. However, in most real life problems, these contributing factors may not be known beforehand, and are often buried among other non-contributing variables. When the number of possible factors is very large, the task becomes exceedingly difficult. Such difficult tasks are found however in many important areas of science, including genetic research, medical studies, financial analysis, etc. Professor CHEN Zehua from the Department of Statistics and Applied Probability in NUS, and his collaborator Dr LUO Shan from the Shanghai Jiao Tong University, has developed a simple yet efficient method to correctly select the contribution factors (called features in statistics) and solve these tasks.

This new methodology, named the sequential Lasso (SLasso) method, has an edge over the other existing methods in terms of selection accuracy as well as computational effort. The Lasso (Least absolute shrinkage and selection operator) method however is known for some time. The innovation here is the sequential solving of partially penalised least squares problems where the features selected in earlier steps are not penalized, using the extended Bayesian information criterion (EBIC) as end point. It was demonstrated that the SLasso correctly selects all the relevant features before any irrelevant features were selected. The EBIC decreases until it attains the minimum at the model consisting of exactly all the relevant features and then begins to increase. The SLasso method has been tested on microarray data for mapping disease genes, and found to triumph over other methods (see Figure). Currently,  the pages and volume of the journal is not yet available.



Image shows the feature selection is crucial in the analysis of sparse high-dimensional regression (SHR) models such as the one above.[Image source: CHEN Zehua]


Luo S, Chen Z. "Sequential Lasso cum EBIC for feature selection with ultra-high dimensional feature space". The Journal of American Statistical Association (2014).