January 21, 3:00-4:00pm, 796 COE,
Jason Ding,
Department of Computer Science,
Georgia State University
Imbalanced Data Learning and Diversified Ensemble Classifiers
Abstract:
Imbalanced data learning is one of the most important problems in machine learning and data mining area, attracting continuous attentions in both academia and industry in last decade. In this talk, I will introduce the binary version of this imbalance data learning problem and present an effective ensemble learning framework. First, a formal definition of imbalanced binary classification problem is introduced and several real-world examples will be provide to show its significance. Then, we will thoroughly investigate the current research trends in handling imbalance learning problem to provide a comprehensive overview of representative studies in this area. After discussing the advantages and weakness of existing learning methods, we proposed a new effective ensemble framework?Diversified Ensemble Classifiers for Imbalanced Data Learning (DECIDL). Our strategy combines three popular learning techniques together: a) ensemble learning, b) artificial example generation; c) diversity construction by oppositional data re-labeling. As a meta-learner, DECIDL can utilize general supervised learning algorithms, such as support vector machines, decision trees, neural networks, as the base learner to build effective ensemble committees. We compare the DECIDL ensemble framework with several existing ensemble imbalanced learning frameworks, namely under-bagging, over-bagging, SMOTE-bagging, AdaBoost, on our newly developed benchmark data pool consisting 30 highly skewed data sets. Extensive experiments with various base learners suggest that our DECIDL framework is comparable with other ensemble methods.
December 2, 2:00-3:00pm, 796 COE,
Professor Hulin Wu,
Department of Biostatistics and Computational Biology,
University of Rochester School of Medicine and Dentistry
Two Cultures of Statistical Research: Statistical Inference for Empirical Models vs. Mechanism Models
Abstract:
Traditional statistical inference is usually based on the assumption of empirical models for the data such as linear, nonlinear, nonparametric, and semiparametric models for continuous data, generalized models for binary or discrete data, and proportional hazard regression models for survival data.
Another class of statistical inference is purely based on algorithmic models such as neural
nets and decision trees to solve the black box problem in the real world. Statistical research in
this main stream culture is trying to perform inference by minimizing the use of knowledge about the mechanism behind the data.
However, the knowledge of the research system and the data-generation mechanism, which can be described by mathematical models,
in particular dynamic models such as differential equations, are usually known or partially known in the real world. Statistical inference and research for the mechanism-based models are very sparse, but are badly needed. Thus, a new culture in statistical research for mechanism models needs to be established. I?ll illustrate and outline the statistical research and its importance for mechanism-based differential equation models by our group and others. Statistical methods and theories for differential equation models are illustrated via the experimental data from infectious diseases such as HIV and influenza research.
November 19, 2:00-3:00pm, 796 COE,
Professor Mengling Liu,
Division of Biostatistics, School of Medicine, New York University
Cox Regression Model with Time-Varying Coefficients in Nested Case-Control Studies
Abstract:
The nested case-control (NCC) design is a cost-effective sampling method to study the relationship between a
disease and its risk factors in epidemiologic studies. NCC data are commonly
analyzed using Thomas partial likelihood approach under Cox's proportional hazards model with
constant covariate effects. In this talk, I will present an extension, the Cox regression with time-varying coefficients,
in NCC studies and an estimation approach based on a kernel-weighted Thomas partial likelihood.
Both simulation studies and an application to the NCC study of breast cancer in the New York University
Women's Health Study are used to illustrate the usefulness of the proposed methods. Furthermore,
I will discuss another extension, the Cox regression with nonlinear covariate effects, and issues regarding different
techniques to handle these two different models in NCC studies.
November 12, 3:00-4:00pm, 796 COE,
Professor Yi Zhao,
Department of Marketing,
Georgia State University
Consumer Learning in a Turbulent Market Environment: Modeling Consumer
Choice Dynamics in the Wake of a Product-Harm Crisis
Abstract:
This paper empirically studies consumer choice behavior in the wake of a product-harm crisis. A product-harm crisis creates consumer uncertainty about product quality. In this paper, the authors develop a model that explicitly incorporates the impact of such uncertainty on consumer behavior. The authors assume that consumers are uncertain about the mean product quality level and learn about product quality through the signals contained in use experience and the product harm crisis. The authors also assume that consumers are uncertain about the precision of the signals in conveying product quality and update their perception of the precision of such signals over time upon their arrival. To study the possible impact of a product-harm crisis on consumer?s
sensitivities to price, quality, and risk, the authors also allow these model parameters to be different before, during, and after the product-harm crisis. The model is estimated by Bayesian methods for a scanner panel dataset that includes consumer purchase history before, during, and after a product-harm crisis that hit the peanut butter division of Kraft Foods Australia in June 1996. The proposed model fits the data better than the standard consumer learning model in marketing that assumes consumers are uncertain about product quality level but the precision of information in conveying product quality is known to consumers. This study also provides substantive insights on consumers? behavioral choice responses to a product-harm crisis. Finally, the authors conduct counterfactual experiments based on the estimation results and provide insights to managers on crisis management.
November 5, 2:00-3:00pm, 796 COE,
Professor Wei Wu,
Department of Statistics,
Florida State University
Towards Summary Statistics in the Function Space of
Neural Spike Trains
Abstract:
Statistical inferences are essentially important in analyzing neural
spike trains in computational neuroscience. Current approaches have
followed a general inference paradigm where a parametric probability
model is often used to characterize the temporal evolution of the
underlying stochastic processes. To capture the overall variability and
distribution in the space of the spike trains directly, we focus on a
data-driven approach where statistics are defined and computed in the
function space in which individual spike trains are viewed as points. To
this end, we at first develop a parametrized family of metrics that
takes into account different warpings in the time domain and generalizes
several currently used spike train distances. These new metrics are
essentially penalized L^p norms, involving appropriate functions of
spike trains, with penalties associated with time-warping. In
particular, when p = 2, we present an efficient recursive algorithm,
termed Matching-Minimization algorithm, to compute the sample mean of a
set of spike trains with arbitrary numbers of spikes. The proposed
metrics as well as the mean spike trains ideas are demonstrated using
simulations as well as an experimental recording from the motor cortex.
It is found that all these methods achieve desirable performance and the
results support the success of this novel framework.
October 29, 3:00-4:00pm, 796 COE,
Professor Jing Wang,
Department of Mathematics, Statistics,
and Computer Science,
University of Illinois at Chicago
On Determination of Linear Components in Additive Models
Abstract:
Additive models have been widely used in nonparametric
regression, mainly due to their ability to avoid the problem
of the "curse of dimensionality". When some of the additive components
are linear, the model can be further simplified and higher convergence
rates can
be achieved for the estimation of these linear components. In this
paper, we propose a testing procedure for the determination of linear
components in
nonparametric additive models. We adopt the penalized spline approach
for modelling the nonparametric functions, and the test is a sort of
Chi-square
test based on finite order penalized spline estimators. The limiting
behavior of the test statistic is investigated. To obtain the critical
values for finite sample problems, we use resampling techniques to
establish a bootstrap test. The performance of the proposed tests is
studied through
simulation experiments and a real-data example.
October 22, 3:00-4:00pm, 796 COE,
Professor Rusty Tchernis,
Department of Economics, Georgia State University
On the Estimation of Selection Models when Participation is Endogenous and
Misclassified
Abstract:
This paper presents a Bayesian analysis of the endogenous treatment model
with misclassified treatment participation. Our estimation procedure utilizes a
combination of data augmentation, Gibbs
sampling, and Metropolis-Hastings to obtain estimates of the
misclassification probabilities and the treatment effect. Simulations demonstrate that the proposed Bayesian
estimator accurately estimates the treatment effect in light of misclassification and
endogeneity.
October 15, 3:00-4:00pm, 796 COE,
Professor Nelson Chen,
Department of Biostatistics and Bioinformatics & Biostatistics Shared Core of Winship Cancer Institute, Emory University
A Novel Toxicity Scoring System Treating Toxicity Response as a
Quasi-Continuous Variable in Phase I Clinical Trials
Abstract:
In most current Phase I designs including Standard 3+3 design, Continuous
Reassessment Method (CRM), and Escalation With Overdose Control (EWOC),
toxicity response of patient is treated coarsely as a binary indicator (Yes
vs No) of dose limiting toxicity (DLT) although patient usually has multiple
toxicities and a lot of useful toxicity information is discarded. For the
first time in the literature, we establish a novel toxicity scoring system
to treat toxicity response as a quasi-continuous variable and utilize all
toxicities of patients. Our toxicity scoring system consists of generally
accepted and objective components (a logistic function, grade and type of
toxicity, and whether the toxicity is DLT) so that it is relatively
objective. Our system can successfully transform current Phase I designs
treating toxicity response as a binary indictor of DLT to new designs
treating toxicity response as a quasi-continuous variable by replacing the
binary indicator of DLT and the Target Toxicity level (TTL) of current
designs with a Normalized Equivalent Toxicity Score (NETS) and a Target NETS
(TNETS), respectively. The transformed designs will improve the accuracy of
Maximum Tolerated Dose (MTD) and efficiency of trial. As an example, we
couple our system with EWOC to develop a new design called Escalation With
Overdose Control using Normalized Equivalent Toxicity Score (EWOC-NETS).
Simulation studies and its application to real trial data demonstrate that
EWOC-NETS can treat toxicity response as a quasi-continuous variable, fully
utilize all toxicity information, and improve the accuracy of MTD and
efficiency of Phase I trial. A user-friendly software of EWOC-NETS is under
development and will be available in the future.
October 8, 2:00-3:00pm, 796 COE,
Professor Wenbin Lu,
Department of Statistics, North Carolina State University
Variable Selection for Linear Transformation Models
Abstract:
Semi-parametric linear transformation models have received much attention due to their high flexibility in modeling survival data.
However, the problem of variable selection for linear transformation models has been less studied,
partially because a convenient loss function is not readily available under this context.
In this talk, we propose a simple yet powerful approach to achieve both sparse and consistent estimation
for linear transformation models. The main idea is to derive a profiled score from the martingale-based estimating equations
of Chen et al. (2001), construct a loss function based on the profile scored and its variance, and then minimize the loss subject
to some shrinkage penalty. Under regularity conditions, we have shown that the resulting estimator is consistent for
both model estimation and variable selection. Furthermore, the estimated parametric terms are asymptotically normal and
can achieve a higher efficiency than that yielded from the estimation equations. For computation, we suggest a one-step
approximation algorithm which can take advantage of the LARS and build the entire solution path efficiently.
Performance of the new procedure is illustrated through numerous simulations and real data applications.
September 24, 1:30-3:00pm, 796 COE,
Professor Jun Han,
Department of Mathematics and Statistics, Georgia State University
Distribution-free Estimators of Variance Components for Multivariate Linear Mixed Model
Abstract:
Non-iterative, distribution-free, unbiased estimators of variance components including minimum norm quadratic
unbiased estimator and method of moment estimator are derived for multivariate mixed model.
A general cluster-wise covariance and a same-member-only response-wise covariance are assumed.
Some properties of the proposed estimators such as unbiasedness and existence are discussed, and related
computational issues are addressed. A simulation study is conducted to compare the proposed estimators with Gaussian
(restricted) maximum likelihood estimator in terms of bias and mean square error. An application of gene expression family
data is presented to illustrate the proposed estimators.
September 17, 2:00-3:00pm, 654 COE,
Professor Ying Guo,
Department of Biostatistics and Bioinformatics, Emory University
A general probabilistic model for group independent component analysis and its estimation methods
Abstract: Independent component analysis (ICA) has become an important tool for
analyzing data from functional magnetic resonance imaging (fMRI) studies. ICA has been successfully applied to
single-subject fMRI data. The extension of ICA to group inferences in neuroimaging studies, however, is challenging due to the
unavailability of a pre-specified group design matrix and the uncertainty in between-subjects variability in fMRI data.
We present a general probabilistic ICA (PICA) model that can accommodate varying group structures of multi-subject spatio-temporal processes.
An advantage of the proposed model is that it can flexibly model various types of group structures in different underlying neural source
signals and under different experimental conditions in fMRI studies. A maximum likelihood method is used for estimating this general group ICA model.
We propose two EM algorithms to obtain the ML estimates. The first method is an exact EM algorithm which provides an exact E-step and an explicit noniterative M-step.
The second method is an variational approximation EM algorithm which is computationally more efficient than the exact EM.
We conduct simulation studies to evaluate the performance of the proposed methods. An fMRI data example is used to illustrate application of the proposed methods.
September 3, 2:00-3:00pm, 796 COE,
Professor Ruiyan Luo,
Department of Mathematics and Statistics, Georgia State University
Bayesian Hierarchical Models in Proteomics Studies
Abstract:
Data produced from complex biological processes are not subject to simple statistical methods.
Bayesian approaches provide a natural framework to untangle such problems through incorporation of
our understanding of biological processes and the data generation process. In this talk, I will describe
the development of Bayesian hierarchical models in addressing the following two Proteomics problems.
iTRAQ data. iTRAQ (isobaric Tags for Relative and Absolute Quantitation) is a technique
that allows simultaneous quantitation of proteins in multiple samples. However, ignoring the common
nonrandom missingness will lead to biased estimation of protein expression levels. To reduce such bias,
we construct a Bayesian hierarchical model-based method and model the nonrandom missingness of
peptide data with a logistic regression, which relates the missingness probability for a peptide with
the expression level of the protein that produces this peptide. We assumes that the measured peptide
intensities are affected by both protein expression levels and peptide specific effects. The values of these
two effects across experiments are modeled as random effects. Simulation results suggest that such
estimates have smaller bias than those estimated from ANOVA models or fold changes.
Pathway inference. Simultaneous measurements of multiple protein activities at the single cell
level provide much richer information on signaling networks. With the measurements of protein activities
at different experimental conditions, we propose a Bayesian hierarchical modeling framework for
signaling network reconstruction. We model the existence of an association between two proteins both
at the overall level across all experiments and at each individual experimental level, from which we infer
the pairs of proteins that are associated and their causal relations. This approach can effectively pool
information from different interventional experiments. Simulation results demonstrate the superiority
of the hierarchical approach.