CASE.EDU:    HOME | DIRECTORIES | SEARCH
case western reserve university

PENG LIU
PhD Student
Department of Statistics

 

RESEARCH

Research Interests

Statistical Data Mining, Machine Learning, Feature Selection, Mixture Modeling ;
High-dimensional Data Analysis (image analysis, array data analysis) ;
Computationally Intensive Statistical Methods, Bagging and Boosting, Resampling/Subsampling.

Dissertation Research : New Methods for Large Data Analysis
Part I: Apative Estimation for Mixture Parameters

Mixture models are often used to analyze data that arise from heterogeneous populations and they have applications in many scientific disciplines. When data come in large quantities, or come in sequentially over time, the standard EM algorithm may not be applied to whole data to obtain parameter estimates of mixture models, or the EM is not computationally efficient. New algorithms, called Partial EM and Bayesian Partial EM, are proposed to address the problems of estimating mixture parameters for large data and online data. The idea of Partial EM is to first obtain MLE by implementing EM on first part of data, and then combine the parameter estimate (without first part of data) with the information from the second part of data to get an updated estimate. Coupling the new estimators with the recent D-test for homogeneity of finite mixture distributions, in which the D-test statistic has a closed-form expression in terms of only parameter estimators, a data mining procedure could be developed with a promising application in intrusion detection.

Part II: SPCA - A New Feature Selection Procedure

Principal component analysis (PCA) is one of the most widely used data processing and dimension reduction techniques. However, in data mining, the number of variables (or features) in a data set may be too large to be loaded all together into a computer program for analyses. Even if they can be loaded all at once, too many nuisance features may mask important features in the data. We focus on a subsampling procedure for selecting important features under the PCA framework. The main idea is to perform PCA on subsamples of features from the original, large data set and then combine the results based on a ranking function to obtain the whole picture. The ranking function is designed to ``summarize'' the performance of each of the individual features in PCA and evaluate their global importance. We showed subsampling PCA (SPCA) procedure can recover principal components of the data by identifying the important variables. Therefore, it can serve as not only a dimension reduction technique but also a feature selection procedure. We are currently investigating the application of our SPCA procedure to gene expression data analysis.

Other Research Work

Analyzed network traffic data with application to intrusion detection methods.

Analyzed medical gel imaging data, which involves image registration, segmentation, nonparametric modeling and multiple testing.