Research Interests
Statistical Data Mining, Machine Learning, Feature Selection, Mixture Modeling ;
High-dimensional Data Analysis (image analysis, array data analysis) ;
Computationally Intensive Statistical Methods, Bagging and Boosting, Resampling/Subsampling.
Dissertation Research : New Methods for Large Data Analysis
Part I: Apative Estimation for Mixture Parameters
Mixture models are often used to analyze data that arise from heterogeneous populations and
they have applications in many scientific disciplines. When data come in large quantities, or come in
sequentially over time, the standard EM algorithm may not be applied
to whole data to obtain parameter estimates of mixture models, or
the EM is not computationally efficient. New algorithms, called
Partial EM and Bayesian Partial EM, are proposed to address the problems of estimating mixture parameters for large
data and online data. The idea of Partial EM is to
first obtain MLE by implementing EM on first part of data, and
then combine the parameter estimate (without first part of data) with the information
from the second part of data to get an updated estimate. Coupling the new
estimators with the recent D-test for homogeneity of
finite mixture distributions, in which the D-test statistic has a
closed-form expression in terms of only parameter estimators, a data
mining procedure could be developed with a promising application in
intrusion detection.
Part II: SPCA - A New Feature Selection Procedure
Principal component analysis
(PCA) is one of the most widely used data processing and dimension
reduction techniques. However, in data mining, the number of
variables (or features) in a data set may be too large to be loaded
all together into a computer program for analyses. Even if they can
be loaded all at once, too many nuisance features may mask important
features in the data. We focus on a subsampling procedure for
selecting important features under the PCA framework. The main idea
is to perform PCA on subsamples of features from the original, large
data set and then combine the results based on a ranking function to
obtain the whole picture. The ranking function is designed to
``summarize'' the performance of each of the individual features in
PCA and evaluate their global importance. We showed subsampling
PCA (SPCA) procedure can recover principal components of
the data by identifying the important variables. Therefore, it can
serve as not only a dimension reduction technique but also a feature
selection procedure. We are currently investigating the application of our SPCA procedure to gene expression data analysis.
Other Research Work
Analyzed network traffic data with application to intrusion detection methods.
Analyzed medical gel imaging data, which involves image registration, segmentation, nonparametric modeling and multiple testing.
|