Taming Big Data

16 July 2014. NUS statisticians have developed new statistical approaches and methodologies to deal with complex big data.

“Big Data” has recently been coined to describe data sets so large and complex that they are impossible to handle using traditional data tools and applications. Therefore, fundamentally new statistical or mathematical ideas have to be discovered to enable it to capture, curation, storage, search, share, analysis and/or visualize such data. Big data is likely to play an increasingly important role in the investigation of important scientific questions from materials to biomedical sciences to cosmology, and to enable sophisticated manufacturing and growth of small cities. The growth of big data presents challenges as well as opportunities in many research areas like neuroeconomics and finance. However the massive sample size, high dimensionality and complex dependence of big data create unique computational and statistical challenges that cannot be handled by the conventional statistical and analytical methods. Thus, there is a need to discover new techniques to allow investigators to interrogate big data in as informative, accurate and flexible a manner as possible.

Professor CHEN Ying’s team from the Department of Statistics in NUS has been developing new statistical approaches and methodologies to deal with big data. The team conceptualizes Big Data as curves, surfaces and other more complex geometric objects that evolve over a period of time. They found that big data can be estimated, modeled and forecasted with much improved efficiency and accuracy. They have applied the method to successfully forecast the daily electricity price curves in California (see Figure 1). The new method is able to reduce forecast errors by 15%, which is better than several other popular models. The team is also presently applying their methods to analyze functional magnetic resonance imaging (fMRI) data (see Figure 2).


Figure 1.  Application of Big Data analysis to California electricity market prices.  Left: Log-prices 24 hours a day, 7 days a week from 5 July 1999 to 11 June 2000. Right: Smoothed log-price curves of the California electricity market. (Image credit: CHEN Ying)


Figure 2: Application of Big Data analysis to fMRI images: (a) Estimated function with largest values in Parietal Cortex; (b) with largest values in ventrolateral prefrontal cortex (VLPFC); (c) with largest values in middle orbito-frontal cortex (mOFC) and inferior orbito-frontal cortex (lOFC); (d) with largest values in Parietal Cortex; (e) with largest values in anterior insula (aINS). (Data source: fMRI data from CRC649, RPID tasks, 17 subjects, healthy young Germen, right-handed, 91X109X91 voxels, 27X3 trails and 1360 scans (2.5 seconds per scan), sfb649.wiwi.hu-berlin.de)