Research
Projects:
Discovery of
Error-Tolerant Patterns from Binary and Real-valued Data
Since
traditional frequent pattern mining
algorithms works with
only binary or boolean attributes, it requires transformation of
real-valued attributes to binary attributes, which often results in
loss of information. We developed a novel
Error-Tolerant Frequent Itemset (ETFI) algorithm for binary as well as
real-valued data, which can sequentially discover all ETFIs in
bottom-up fashion from any real-valued data sets.
Quantitative
Evaluation of Error-Tolerant Pattern Mining Algorithms
Traditional association
mining algorithms use a strict
definition of support that requires every item in a frequent itemset
to occur in each supporting transaction. In real-life data sets, this
limits the recovery of frequent patterns as they are fragmented due to
random noise and other errors in the data. We implemented a suite of
algorithms to discover approximate frequent itemsets in the presence
of noise.
Pattern Mining
based Integrative Biomarker Discovery Using Gene Expression and
Protein-protein Interaction Network
Most
of the complex biological problems like biomarker discovery, protein
function
prediction and others require more information than provided by any
individual biological
data. For example, biomarkers or functional modules identified using
information
from both gene expression and protein-protein interaction data are more
reliable and biologically plausible than those obtained from individual
data
sources. This is because both gene expression data and protein
interaction
data are noisy and therefore a set of genes that co-expresses as well
as physically
interact with each other is more likely to be significant
and
biologically
relevant. We developed an association mining based framework to
perform an integrated analysis of microarray gene-expression data and
protein-protein interaction data in order to efficiently discover
active sub-network based biomarkers.
Discovery of
Quality Markers of Colonoscopy for Detection of Colorectal Cancer:
Characterization of Missed Neoplasia (In Collaboration with Mayo Clinic
Rochester)
Colonoscopy is the most accepted screening
method for the detection of
colorectal cancer or its precursor lesions, colorectal polyps. Though,
colonoscopy has contributed to a decline in the number of colorectal
cancer
related deaths, not all cancers or large polyps are detected at the
time of
colonoscopy. Hence, it is important to develop methods to understand
and
quantify the cause of these omissions. In the past, researchers have
studied
the effect of various factors (mostly one at a time) in predicting
adenomas
during colonoscopy. For example, recently it was shown that the
endoscopist performing
the procedure is a more powerful predictor than age or gender of the
patient,
which were earlier considered to be the most powerful predictors, in
detecting
adenomas during colonoscopy. The goal of our data mining effort is to
systematically identify all the factors (not just the known ones) and
more
importantly, combinations of them that may result in missed adenomas
and polyps
during colonoscopy.
Discovery and
Characteristics of Patients with Idiosyncratic Drug-Induced Liver Injury (In Collaboration with Mayo Clinic
Rochester)
The most
common reason for new drugs not passing the final clinical
trials for FDA approval is liver injury. Idiosyncratic drug-induced
liver
injury, its name already says so, at present cannot be predicted. Thus
identifying which patients are more likely to develop an idiosyncratic
drug
reaction would be a major scientific achievement. It is hypothesized
that there
are identifiable clinical, environmental, and genetic differences
between those
patients who have had such a reaction and those that did not. The goal
is to
develop data mining based techniques to identify patients with
drug-induced
injury and then determine whether these patients have one or more
inherited or
acquired genome-based susceptibility factors that result in
drug-induced liver
injury.
Data Mining for
Prediction of Degree of Liver Fibrosis (In Collaboration with Mayo Clinic
Rochester)
Liver cirrhosis is a common
lethal disease that is most often caused by
alcoholism and viral hepatitis. Life expectancy is greatly influenced
by the
degree of liver fibrosis, which in most fibrosis classification is
scored from
F0 (no fibrosis) to F4 (liver cirrhosis). Currently, the most accurate
way of
measuring fibrosis is by liver biopsy. However, due to its invasive
nature,
liver biopsy is not performed frequently, and for this reason,
physicians rely
on less accurate laboratory tests, with their inherent deficiencies.
Therefore,
the goal of this work is to develop and apply data mining based
techniques to
first identify the right features (combination of commonly available
laboratory
tests, obtained over time during routine patient care), and then use
them to
predict liver cirrhosis and hepatocellular cancer at an early stage. It
is
believed that such techniques hold the promise of empowering physicians
to
improve diagnostic processes without the need for invasive procedures.
Unsupervised Techniques for Finding Overlapping
Co-clusters in the Data
Overlapping co-clustering is an
interesting clustering
problem, which is of immense use in
several real life domains such as gene expression data,
documents, and movie recommender systems, where overlapping co-clusters
are desired. Most
of the current approaches either deal with co-clustering or with
overlapping aspect of this
problem. We explored two unsupervised learning approaches - frequent
pattern mining based approach and alternate minimization based approach
- to generate overlapping
co-clusters in a given data matrix.Our primary focus is to apply these
approaches to gene
expression data. We performed experiments both on synthetic and real
gene-expression
datasets to show the correctness of the algorithms and to show the
applicability of the proposed
approaches in the domain of microarray data analysis.