DIMACS Working Group on Data Mining and Epidemiology

Second Meeting, March 18-19, 2004
DIMACS Center, CoRE Building, Rutgers University

Organizers: abstracts.html
James Abello, DIMACS, abello@dimacs.rutgers.edu
Graham Cormode, DIMACS, graham@dimacs.rutgers.edu
Kenton Morgan, University of Liverpool, k.l.morgan@liverpool.ac.uk
David Ozonoff, Boston University, dozonoff@bu.edu

Presented under the auspices of the Special Focus on Computational and Mathematical Epidemiology.


James Abello, DIMACS, Rutgers University

Web page: http://www.mgvis.com

Title: Graph Theoretical Methods in Epidemiology

We will discuss our experiences in using graph theorethical methods for the Analysis of SEER Cancer Data. The fourth main issues that we have been wrestling with are: efficient identification of essential data attributes, data driven partition methods, patient clustering and visual data navigation. The lattice of bicliques, associated with a graph theoretical interpretation of the data records, plays a central role in our investigations. This lattice is identical to the Concept Lattice. One of the aims is to identify the effect caused in the concept lattice by certain basic semantic operations on the data. These operations allow us to focus on selected regions of the concept lattice without generating the full lattice. This is a crucial finding. We will describe the status of our current prototype for SEER Cancer Data exploration.

Portions of this work have been done in cooperation with Frank V. Ham, Lance Miller, Dave Millman and Alex Pogel

Michael Cook, Merck Research Laboratories, Epidemiology Department

Title: Case-Control Surveillance Methods

The aim of this presentation is to describe the general principles, methods, and limitations of using case-control surveillance to detect previously unrecognized adverse events associated with medication use. Pharmacovigilance activities traditionally detect signals by analyzing large cross-tabulations of spontaneously reported drug-event combinations in regulatory databases. Unfortunately, variation in physician reporting of specific adverse events over time for different drugs limits the usefulness of these data for signal detection. Large multipurpose relational databases with date-stamped historical information on drug use, health outcomes, and relevant confounding variables represent an alternative resource for signal detection. Cohort studies comparing the incidence rate of an adverse event among individuals exposed to a drug relative to all unexposed individuals in the general population are computational demanding for hypothesis-screening purposes in view of the large population size needed to study rare events (typically in the millions), the large number of drug-disease combinations possible in a database, and the time-dependence of most drug exposures. Case-control sampling methods represent a computationally efficient means to estimate the incidence rate ratio with a minimum amount of data retrieval. A case-control study of childhood neuroblastoma and maternal medication use during pregnancy will be presented to illustrate how case-control surveillance can be implemented using only the macro language SQL and the statistical procedure for performing conditional logistic regression in SAS.

A. Lawrence Gould and Peter K. Honig, Merck Research Laboratories

Title: Perspectives on Automated Methods for Pharmacovigilance Signal Detection

Proportional reporting ratios (PRRs) commonly used to identify drug-event associations reflecting possible toxicity in spontaneous reporting system (SRS) databases will have low precision if there are few reports. Recently described Bayesian methods provide a statistically well-founded way to incorporate this uncertainty into evaluations of the likelihood that a particular finding represents a potential signal or just noise. For many reasons, reports from SRS databases generally cannot be used to establish toxicity relationships; at best they identify potential toxicity issues that must be confirmed by detailed clinical and epidemiological follow-up. The statistical procedures make the identification process more effective and efficient, but they do not establish causality. This presentation illustrates a reasonably predictive and robust approach to the evaluation of data in SRS databases that has potential utility for evaluating drug safety post-marketing. At least for the example considered here, potential toxicities that could be identified on the basis of several years of experience usually could be identified on the basis of data received early in the life of the drug. In addition, the relationships that seemed apparent from reports received in the first few years also persisted as more information was received. Quantitative pharmacovigilance methods may be useful for the public health purpose of finding new apparent associations between drugs and combinations of adverse events as well as for ascertaining the adverse-event reporting profile corresponding to a single drug or a class of drugs. Even a small number of event reports may signal a potential association between a drug and the event if the event is rarely reported with other drugs. The key to having useful SRS databases is having dictionaries of names of events and drugs that contain as few synonyms as possible. Progress is being made in this regard, but much remains to be done.

Lynette Hirschman, The MITRE Corporation

Title: Capture and Use of Free Text Information for Tracking Disease Outbreaks

This talk will provide a starting point for discussion of the application of text mining to tracking disease outbreaks. The talk reviews the types of texts that contain relevant information, from triage reports to medical records to journal articles and news write reports. These texts contain different types of information, and the processing that is applied to them will also differ ? for example, short triage reports can be useful for binning into prodromes, while medical records require information extraction technology to access specific details such as patient history or symptomology. The tools to support text mining have shown promising results, and there are systems in daily use that perform some of these functions. To date, text mining applications still require significant overhead, including careful selection of task, location of sufficient training data (input/output pairs that cover the range of likely data that will be presented to the system), and customization of the system to the specific application. However, the technology has matured to the point where it should be possible for interdisciplinary teams to exploit text mining to access new text-based resources for tracking disease outbreaks.

Tomasz Imielinski, Division of Information and Computer Science, Rutgers University

Web page: http://www.cs.rutgers.edu/~imielins/

Title: Association Rule Mining of Biological Data Sets

Association rule mining (ARM) is emerging as one of the key data mining techniques applied to data sets in bioinformatics: genetic linkage data for complex diseases, gene expression micro array data etc. Such data sets are often called horizontal, due to relatively large ratio of number of attributes (columns) to the number of records (rows). This calls for different algorithms, which are more data driven than attribute driven. But even with more efficient algorithms for ARM to be useful to biologists it has to overcome a far more critical obstacle: the number of generated rules is enormous and their size (number of attributes in a rule) is often too large to comprehend for the end user (scientist, biologist). Consequently, the picture presented by ARM to a biologist is often confusing and overwhelming. Too many rules, which "all look the same". It almost does not matter how fast AR can be generated when they can really never be fully analyzed due to the sheer volume of the output. More importantly, a large fraction of the generated rules are produced due to random chance or interdependencies with other rules. Thus, a key question is how to find subsets of rules which are nonrandom and independent. Another question is how to represent them in a form which is clear and useful to the end user - the scientist. This is a far more important problem than efficient generation of rules. We will review the current state of art in applications of ARM in genetics and discuss how to make the results of ARM more useful to the end user. We will describe some of our own painful experiences in this process.

David Madigan, Rutgers University, Department of Statistics

Web page: http://www.stat.rutgers.edu/~madigan/

Title: Data Mining Overview

Data Mining is a dynamic and fast growing field at the interface of Statistics and Computer Science. The emergence of massive datasets containing millions or even billions of observations provides the primary impetus for the field. Such datasets arise, for instance, in large-scale retailing, telecommunications, astronomy, computational biology, and internet commerce. The analysis of data on this scale presents exciting new computational and statistical challenges. This tutorial will provide an overview of current research in data mining with detailed descriptions of a couple of specific algorithms.

Ilya Muchnik and Jixin Li, Rutgers University

Title: Epidemiological Factors of Survival Time for Cancer Patients Discovered via SVM Learning Classification Method: An Experimental Study on SEER Data

In the talk, we will discuss a novel application of a learning classification method to estimate the significance of the influence of different variables relevant to some basic target factors which are under consideration. From the perspective of estimating the significance of an epidemiological factor, we came up with the idea to build a classifier which is able to distinguish cancer patients by the comparative length of survival time based on available epidemiological and medical factors. We assumed that if this classifier could be constructed with a high accuracy level, one would be able to estimate a marginal influence of an individual factor by the partial derivative of the discriminate function on the factor. This is the standard approach which is called "sensitivity analysis" in control theory or "elastic analysis" in mathematical economy. In our case, where the most epidemiological factors are presented by boolean variables, we applied combinatorial analogs of the derivatives. We defined the coefficient to describe the marginal influence of the epidemiological factors as the minimal difference in accuracy recognition between the whole population presented in the training data and a subset of the data on which the considered factor has a constant value. In other words, we evaluate the significance of a factor by a relative accuracy difference of two cases when the factor is changeable and when it is fixed on a constant value.

Since this idea is efficient only if the classifier built on the entire data set works better than random guessing, it is necessary to consider a method of learning classification that is well known by its power and that at the same time provides default values for "free" parameters and without professional assistance. This criterion is extremely important if one desires to consider the approach as a general method that can be applied to cancers other than prostate cancer and to other types of chronic disease. After evaluating all these requirements we have found that SVM methods are well suited to the task.

An experimental study was conducted on SEER data records for all prostate cancer patients registered from the period 1973-2000. Out of a total of about 335,000 patients, we further selected about 52,000 patients for whom we have more information on certain variables.

Kenton Morgan, Faculty of Veterinary Science, University of Liverpool, UK

Web page: http://pcwww.liv.ac.uk/vets/research/epidemiology/epidemiology.htm

Title: Observational Data Sets in Veterinary Epidemiology: The Challenges for New Data Mining Techniques

It is widely recognized that many diseases, which are of importance to human and animal health, are multi-factorial. The veterinary practitioner faced with the clinical control of these diseases often uses interventions which are based on the art rather than the science of medicine e.g. interventions that have apparently worked in other situations or which, in the practitioners experience, seem most likely to work in this case. Every intervention has a cost, be it chemotherapeutic agents, changes in grassland management or housing conditions. There is a need for rational intervention strategies, based on some objective measure of the importance of different risk factors in the causal web of disease. These will enable practitioners to introduce interventions which result in the greatest reduction in risk, for the least cost.

Non-hypothesis driven, observational epidemiological studies are a common starting point in identifying risk factors for a disease, measuring their effect and generating hypotheses of causation. The aim of these studies is to be inclusive and data on a large number of variables are often collected.

Analysis of these data sets, to produce information that is of practical use in preventive medicine, presents a number of methodological problems. These will be highlighted during this presentation using examples from a number of different observational studies carried out by our group.

Dave Ozonoff, Boston University, School of Public Health and Alex Pogel, New Mexico State University, PNSL

Web page:www.bumc.bu.edu/Departments/PageMain.asp?Page=1217&DepartmentID=97

Title: The Generalized Contingency Table, its Concept Lattice and Connections with 2x2 Tables

In this talk we present a new concept in epidemiology, the generalized contingency table, and discusses some of its mathematical foundations, derived from lattice theory (Ganter and Wille, 1999). As its name suggests, it is a generalization of the conventional contingency table or cross-classification. Development of this idea leads to some new concepts in epidemiology and its application to new methods of data analysis

Greg Ridgeway, The Rand Corporation, Santa Monica, CA

Web page:http://www.rand.org/methodology/stat/

Title: Retooling Propensity ScoreTechniques with Machine Learning for Evaluating Solutions to the Los Angeles Drug Abuse Epidemic

In drug policy research we often find ourselves basing policy decisions on observational data, unable to afford or unable to implement an experiment to test the effect of a change in policy. Several strategies have been proposed to infer cause from observational data and among them propensity score methods have shown great promise. In this talk I develop a propensity scoring technique based on importance sampling. Borrowing from the machine learning toolbox, I use a flexible "boosted" logistic regression model to estimate the propensity score and the causal effect of the treatment on the treated. I apply the method to an evaluation of a Los Angeles drug treatment program.

Ingo Ruczinski, Johns Hopkins University, Department of Biostatistics

Web page:http://www.biostat.jhsph.edu/~iruczins/

Title: Finding Interactions and Assessing Variable Importance in SNP association studies

Exploring interactions between single nucleotide polymorphisms (SNPs) and their associations to disease is a common problem in genetic epidemiology. Challenges in particular are the vast search space and the fact that the number of SNPs is often larger than then number of genotyped patients. Previously, we introduced Logic Regression as an adaptive regression methodology that can be used to explore such interactions, constructing predictors as Boolean combinations of binary covariates. We compare this methodology to other approaches such as CART, MARS and Random Forests, and show how some MCMC based procedures can be used to generate measures of variable importance.

Joint work with Charles Kooperberg - Fred Hutchinson Cancer Research Center.

Dona Schneider, Rutgers University, Environmental and Occupational Health Sciences Institute

Web page: www.eohsi.rutgers.edu/divisions/envpolicy.shtml

Title: Descriptive Epidemiology for Data Miners

Epidemilogists and data miners are trained in different paradigms. This overview of descriptive epidemiology explains how epidemiologists approach data, defines epidemiologic jargon, and points out pitfalls in the most commonly used U.S. health data sets.

William Shannon, Washington University School of Medicine

Web page: http://ilya.wustl.edu/~shannon

Title: Biostatistical Challenges in Molecular Epidemiology

Epidemiology is the study of the distribution and size of disease problems in human populations, in particular to identify etiological factors in the pathogenesis of diseases and to provide the data essential for the management, evaluation and planning of services for the prevention, control and treatment of disease (Everitt). Molecular epidemiology uses molecular biology to identify the etiological factors, and is a growing and important area of biomedical research.

Molecular epidemiology presents new challenges to data analysts. Modern molecular biology can measure tens of thousands of molecular variables rapidly and cheaply (e.g., gene chips measure the activity of tens of thousands of genes, genotyping is routinely done at hundreds or thousands of markers, and proteomics has the potential of characterizing the entire protein content of tissues). The limiting step in molecular epidemiology is the often small number of human subjects these measurements are made on(the large P, small N problem) and not the ability to measure the factors. In this talk I address three statistical problems faced when analyzing molecular epidemiology data. The first problem is the proper identification of patient subgroups within which statistical tests of genotype-phenotype association should be applied. The second problem is the testing of clinical covariates against a large number of molecular variables. The third problem is the selection of important molecular factors related to disease. These problems can be defined in the language of classical statistics (i.e., population stratification, over-determined systems, and variable selection, respectively). However, I believe classical statistics will more often than not fail with molecular epidemiology and will argue that new ways of thinking about statistics will be needed for this data.

Bill Smith, USDA Forest Service

Title: The Exploration of Spatial Data Mining (and Mind Mining) to model the risk of Emerald ash borer (EAB) (Agrilus planipennis) and its likely spread from current areas of infestation.

The emerald ash borer is an exotic wood-boring insect that attacks ash trees (Fraxinus spp.). The insect is indigenous to Asia and known to occur in China, Korea, Japan, Mongolia, the Russian Far East, and Taiwan. It eventually kills healthy ash trees after it bores beneath their bark and disrupts their vascular tissues. EAB has been found in the Detroit area of Michigan and neighboring area of Windsor, Ontario, Canada. It has infested thousands of square miles, and an estimated 30 million ash trees are currently at risk. Current estimated losses are approximately $11.6 million in the landscape, and wood products industry, $2 million in lost nursery stock sales, in addition to the loss of aesthetic, recreational, and habitat-providing values. It has been introduced into two areas in Ohio by the movement of wood stock for the implement industry and to an area outside of Washington DC in nursery stock. Introduction in additional areas is likely through the movement of nursery stock, green lumber, firewood, and composted and uncomposted chips of the genus Fraxinus.

A current risk map has been developed based on the spatial intersection of (1) Urban areas, (2) Points of Entry, (3) Ash Forests in Proximity to Urban areas, and (4) Current Areas of Infection. The current rule based ordinal model will be revised based on (spatial) statistical procedures. Some of the factors that add additional complexity to the task are: The data base is a combination of points (points of entry, infestation locations, nursery locations, mill locations), polylines (transportation routes) and polygonals (host distributions); The data is censured, i.e., disclosure of the location of nurseries that produce nursery stock and buyers is voluntary; and all infestations are not known, and with unknown levels of certainty, e.g., the area around Detroit has been intensely sampled, while in other areas the discovery has been serendipitous.

Carla Thomas, University of California at Davis and Leonard Coop and Hang-Kwang Luh, Oregon State University

Web page: http://pnwpest.org/coopl

Title: Pattern analysis and data mining efforts of the National Plant Diagnostic Network for early detection of infectious crop disease and pestoutbreaks

The National Plant Diagnostic Network coordinates data collection and analysis of records from plant diagnostic laboratories throughout the United States. These records can be compared to other diverse datasets such as weather, soil, imagery, ecozone, land use, and transportation to detect anomalies and early warning syndromics in GIS and non-GIS platforms. Spatial scale differences need to be managed effectively to extract meaningful information during the data mining. A variety of statistical strategies for data mining can be used to identify factors to significant epidemic detection and development. The analyses also identify the relationships between these factors, as well as identify potential risk assessment models and expected outcomes. These analyses include but are not limited to principal component and cluster analysis. When possible, generic equations describing key system limiting factors can greatly increase the speed of solution development. Previously, the same strategies of principal component data mining and generic modeling have been used in crop integrated pest management studies with success. Several case studies will be discussed, along with applications for Bayesian and spatial analysis.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on March 11, 2004.