
Karsten Steinhaeuser
Quick Links
News & Events | Contact | CV | Research | Bio | Publications
News & Events
May 2012 - The CRA has a new section called CS URGE with resources for undergraduates inerested in doin research and applying for graduate school.
Oct 2012 - The Conference on Intelligent Data Understanding (CIDU) will be held at NCAR in Boulder, CO (new location!) on Oct 24-26; submissions are due on June 4.
Nov 2012 - The Discovery Informatics Symposium will be held as part of the AAAI Fall Symposium Series in Arlington, VA on Nov 2-4; submissions are due on June 5.
Contact
| phone | 612.626.7502 |
snail mail | University of Minnesota |
| fax | 612.625.0572 |
| 4-192 Keller Hall |
| e-mail |  |
| 200 Union St SE |
| web | www.umn.edu/~ksteinha/ |
| Minneapolis, MN 55455 |
CV
Research Interests
Data mining and machine learning, specifically the construction and analysis of graphs/networks; large-scale data analysis, including parallel and distributed algorithms; applications to climate and earth sciences, ecology, biology, sustainability, medicine/healthcare, and social networks.
Bio
Karsten Steinhaeuser is a Research Associate in the
Department of Computer Science and Engineering at the
University of Minnesota. His primary responsibilities currently include two major research efforts: an NSF Expeditions in Computing on
Understanding Climate Change: A Data Driven Approach and the
GOPHER project, which is an R&D partner in the
Planetary Skin Institute.
His research interests are centered around data mining and machine learning, in particular the construction and analysis of complex networks, with applications in diverse domains including (but not limited to) climate, ecology, and social networks. He is actively involved in shaping an emerging research area called
climate informatics, which lies at the intersection of computer science and climate sciences, and his interests are more generally in interdisciplinary research and scientific problems relating to climate change and sustainability. He co-organizes the
IEEE ICDM Workshop on Knowledge Discovery from Climate Data and the
International Workshop on Climate Informatics, among others, and is engaged in numerous other professional service and community building activities.
Publications
Book Chapters
Pursuit of preventive healthcare relies on fundamental knowledge of the complex relationships between diseases and individuals. We take a step towards understanding these connections by employing a network-based approach to explore a large medical database. Here we report on two distinct tasks. First, we characterize networks of diseases in terms of their physical properties and emergent behavior over time. Our analysis reveals important insights with implications for modeling and prediction. Second, we immediately apply this knowledge to construct patient networks and build a predictive model to assess disease risk for individuals based on medical history. We evaluate the ability of our model to identify conditions a person is likely to develop in the future and study the benefits of demographic data partitioning. We discuss strengths and limitations of our method as well as the data itself to provide direction for future work.
K. Steinhaeuser and N. V. Chawla (2008).
Community Detection in a Large Real-World Social Network.
Social Computing, Behavioral Modeling, and Prediction, H. Liu, J.J. Salerno, M.J. Young (Eds.), Springer, 168-175.
Keywords: social network analysis, telecommunications data, community detection, node attributes
© Springer Science + Business Media, LLC 2008
Identifying meaningful community structure in social networks is a hard problem, and extreme network size or sparseness of the network compound the difficulty of the task.With a proliferation of real-world network datasets there has been an increasing demand for algorithms that work effectively and efficiently. Existing methods are limited by their computational requirements and rely heavily on the network topology, which fails in scale-free networks. Yet, in addition to the network connectivity, many datasets also include attributes of individual nodes, but current methods are unable to incorporate this data. Cognizant of these requirements we propose a simple approach that stirs away from complex algorithms, focusing instead on the edge weights; more specifically, we leverage the node attributes to compute better weights. Our experimental results on a real-world social network show that a simple thresholding method with edge weights based on node attributes is sufficient to identify a very strong community structure.
Refereed Journal Articles
Human populations are profoundly affected by water stress, or the lack of sufficient per capita available freshwater. Water stress can result from overuse of available freshwater resources or from a reduction in the amount of available water due to decreases in rainfall and stored water supplies. Analyzing the interrelationship between human populations and water availability is complicated by the uncertainties associated with climate change projections and population projections. We present a simple methodology developed to integrate disparate climate and population data sources and develop first-order per capita water availability projections at the global scale. Simulations from the coupled land-ocean-atmosphere Community Climate System Model version 3 (CCSM3) forced with a range of hypothetical greenhouse gas emissions scenarios are used to project grid-based changes in precipitation minus evapotranspiration as proxies for changes in runoff, or fresh water supply. Population growth changes, according to Intergovernmental Panel on Climate Change (IPCC) storylines, are used as proxies for changes in fresh water demand by 2025, 2050 and 2100. These freshwater supply and demand projections are then combined to yield estimates of per capita water availability aggregated by watershed and political unit. Results suggest that important insights might be extracted from the use of the process developed here, notably including the identification of the globe.s most vulnerable regions in need of more detailed analysis and the relative importance of population growth versus climate change in in altering future freshwater supplies. However, these are only exemplary insights and, as such, could be considered hypotheses that should be rigorously tested with multiple climate models, multiple observational climate datasets, and more comprehensive population change storylines.
A systematic characterization of multivariate dependence at multiple spatio-temporal scales is critical to understanding climate system dynamics and improving predictive ability from models and data. However, dependence structures in climate are complex due to nonlinear dynamical generating processes, long-range spatial and long-memory temporal relationships, as well as low-frequency variability. Here we utilize complex networks to explore dependence in climate data. Specifically, networks constructed from reanalysis-based atmospheric variables over oceans and partitioned with community detection methods demonstrate the potential to capture regional and global dependence structures within and among climate variables. Proximity-based dependence as well as long-range spatial relationships are examined along with their evolution over time, yielding new insights on ocean meteorology. The tools are implicitly validated by confirming conceptual understanding about aggregate correlations and teleconnections. Our results also suggest a close similarity of observed dependence patterns in relative humidity and horizontal wind speed over oceans. In addition, updraft velocity, which relates to convective activity over the oceans, exhibits short spatiotemporal decorrelation scales but long-range dependence over time. The multivariate and multi-scale dependence patterns broadly persist over multiple time windows. Our findings motivate further investigations of dependence structures among observations, reanalysis and model-simulated data to enhance process understanding, assess model reliability and improve regional climate predictions.
The analysis of climate data has relied heavily on hypothesis-driven statistical methods, while projections of future climate are based primarily on physics-based computational models. However, in recent years a wealth of new datasets has become available. Therefore, we take a more data-centric approach and propose a unified framework for studying climate, with an aim towards characterizing observed phenomena as well as discovering new knowledge in the climate domain. Specifically, we posit that complex networks are well-suited for both descriptive analysis and predictive modeling tasks. We show that the structural properties of "climate networks" have useful interpretation within the domain. Further, we extract clusters from these networks and demonstrate their predictive power as climate indices. Our experimental results establish that the network clusters are statistically significantly better predictors than clusters derived using a more traditional clustering approach. Using complex networks as data representation thus enables the unique opportunity for descriptive and predictive modeling to inform each other.
Analyses of climate model simulations and observations reveal that extreme cold events are likely to persist across each land-continent even under 21st-century warming scenarios. The grid-based intensity, duration and frequency of cold extreme events are calculated annually through three indices: the coldest annual consecutive three-day average of daily maximum temperature, the annual maximum of consecutive frost days, and the total number of frost days. Nine global climate models forced with a moderate greenhouse-gas emissions scenario compares the indices over 2091-2100 versus 1991-2000. The credibility of model-simulated cold extremes is evaluated through both bias scores relative to reanalysis data in the past and multi-model agreement in the future. The number of times the value of each annual index in 2091-2100 exceeds the decadal average of the corresponding index in 1991-2000 is counted. The results indicate that intensity and duration of grid-based cold extremes, when viewed as a global total, will often be as severe as current typical conditions in many regions, but the corresponding frequency does not show this persistence. While the models agree on the projected persistence of cold extremes in terms of global counts, regionally, inter-model variability and disparity in model performance tends to dominate. Our findings suggest that, despite a general warming trend, regional preparedness for extreme cold events cannot be compromised even towards the end of the century.
Climate change is a pressing focus of research, social and economic concern, and political attention. According to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC), increased frequency of extreme events will only intensify the occurrence of natural hazards, acting global population, health, and economies. It is of keen interest to identify regions of similar climatological behavior to discover spatial relationships in climate variables, including long-range teleconnections. To that end, we consider a complex networks-based representation of climate data. Cross correlation is used to weight network edges, thus respecting the temporal nature of the data, and a community detection algorithm identifies multivariate clusters. Examining networks for consecutive periods allows us to study structural changes over time. We show that communities have a climatological interpretation and that disturbances in structure can be an indicator of climate events (or lack thereof). Finally, we discuss how this model can be applied for the discovery of more complex concepts such as unknown teleconnections or the development of multivariate climate indices and predictive insights.
We compare and evaluate different metrics for community structure in networks. In this context we also discuss a simple approach to community detection, and show that it performs as well as other methods, but at lower computational complexity.
A. R. Ganguly,
K. Steinhaeuser, D. J. Erickson III, M. L. Branstetter, E. Parish, N. Singh, J. B. Drake and L. Buja (2009).
Higher trends but larger uncertainty and geographic variability in 21st century temperature and heat waves.
Proceedings of the National Academy of Sciences USA,
106(37), 15555-15559.
Keywords: climate change, extremes, uncertainty, regional analysis
Generating credible climate change and extremes projections remains a high-priority challenge, especially since recent observed emissions are above the worst-case scenario. Bias and uncertainty analyses of ensemble simulations from a global earth systems model show increased warming and more intense heat waves combined with greater uncertainty and large regional variability in the 21st century. Global warming trends are statistically validated across ensembles and investigated at regional scales. Observed heat wave intensities in the current decade are larger than worst-case projections. Model projections are relatively insensitive to initial conditions, while uncertainty bounds obtained by comparison with recent observations are wider than ensemble ranges. Increased trends in temperature and heat waves, concurrent with larger uncertainty and variability, suggest greater urgency and complexity of adaptation or mitigation decisions.
Refereed Conference and Workshop Publications
Keywords: spatio-temporal data mining, teleconnections, climate data, dipoles
© ACM 2012
Dipoles represent long distance connections between the pressure anomalies of two distant regions that are negatively correlated with each other. Such dipoles have proven important for understanding and explaining the variability in climate in many regions of the world, e.g., the El Niño climate phenomenon is known to be responsible for precipitation and temperature anomalies worldwide. Systematic approaches for dipole detection generate a large number of candidate dipoles, but there exists no method to evaluate the significance of the candidate teleconnections. Statistical significance testing is an important mechanism that helps in assessing the relevance of the patterns generated to determine whether they are interesting or spurious, i.e., generated by random chance. In this paper, we present a novel method for testing the statistical significance of a class of spatio-temporal patterns called teleconnections or dipoles. One of the most important challenges in addressing significance testing in a spatio-temporal context is how to address the spatial and temporal dependencies that show up as high autocorrelation. We present a novel approach that uses the wild bootstrap to capture the spatio-temporal dependencies, in the special use case of teleconnections in climate data. Our approach to find the statistical significance takes into account the autocorrelation, the seasonality and the trend in the time series over a period of time. This framework is applicable to other problems in spatio-temporal data mining to assess the significance of the patterns.
Keywords: sparse regression, group lasso, climate data, multivariate predictive modeling
© SIAM 2012
The design of statistical predictive models for climate data gives rise to some unique challenges due to the high dimensionality and spatio-temporal nature of the datasets, which dictate that models should exhibit parsimony in variable selection. Recently, a class of methods which promote structured sparsity in the model have been developed, which is suitable for this task. In this paper, we prove theoretical statistical consistency of estimators with tree-structured norm regularizers. We consider one particular model, the Sparse Group Lasso (SGL), to construct predictors of land climate using ocean climate variables. Our experimental results demonstrate that the SGL model provides better predictive performance than the current state-of-the-art, remains climatologically interpretable, and is robust in its variable selection.
Various clustering methods have been applied to climate, ecological, and other environmental datasets, for example to define climate zones, automate land-use classification, and similar tasks. Measuring the "goodness" of such clusters is generally application-dependent and highly subjective, often requiring domain expertise and/or validation with field data (which can be costly or even impossible to acquire). Here we focus on one particular task: the extraction of ocean climate indices from observed climatological data. In this case, it is possible to quantify the relative performance of different methods. Specifically, we propose to extract indices with complex networks constructed from climate data, which have been shown to effectively capture the dynamical behavior of the global climate system, and compare their predictive power to candidate indices obtained using other popular clustering methods. Our results demonstrate that network-based clusters are statistically significantly better predictors of land climate than any other clustering method, which could lead to a deeper understanding of climate processes and complement physics-based climate models.
A. Pelan,
K. Steinhaeuser, N. V. Chawla, D. A. de Alwis Pitts and A. R. Ganguly (2011). Empirical Comparison of Correlation Measures and Pruning Levels in Complex Networks Representing the Global Climate System.
IEEE Symposium Series on Computational Intelligence and Data Mining (CIDM), Paris, France.
Keywords: complex networks, climate data, correlation measures, network properties
© IEEE 2011
Climate change is an issue of growing economic, social, and political concern. Continued rise in the average temperatures of the Earth could lead to drastic climate change or an increased frequency of extreme events, which would negatively affect agriculture, population, and global health. One way of studying the dynamics of the Earth's changing climate is by attempting to identify regions that exhibit similar climatic behavior in terms of long-term variability. Climate networks have emerged as a strong analytics framework for both descriptive analysis and predictive modeling of the emergent phenomena. Previously, the networks were constructed using only one measure of similarity, namely the (linear) Pearson cross correlation, and were then clustered using a community detection algorithm. However, nonlinear dependencies are known to exist in climate, which begs the question whether more complex correlation measures are able to capture any such relationships. In this paper, we present a systematic study of different univariate measures of similarity and compare how each affects both the network structure as well as the predictive power of the clusters.
K. Steinhaeuser, N. V. Chawla and A. R. Ganguly (2010). Complex Networks in Climate Science: Progress, Opportunities and Challenges.
NASA Conference on Intelligent Data Understanding (CIDU), Mountain View, CA.
Keywords: complex networks, climate data, network properties, community detection, open questions
© NASA 2010
Networks have been used to describe and model a wide range of complex systems, both natural as well as man-made. One particularly interesting application in the earth sciences is the use of complex networks to represent and study the global climate system. In this paper, we motivate this general approach, explain the basic methodology, report on the state of the art (including our contributions), and outline open questions and opportunities for future research.
While data mining aims to identify hidden knowledge from massive and high dimensional datasets, the importance of dependence structure among time, space, and between different variables is less emphasized. Analogous to the use of probability density functions in modeling individual variables, it is now possible to characterize the complete dependence space mathematically through the application of copulas. By adopting copulas, the multivariate joint probability distribution can be constructed without constraint to specific types of marginal distributions. Some common assumptions, like normality and independence between variables, can also be relieved. This study provides fundamental introduction and illustration of dependence structure, aimed at the potential applicability of copulas in general data mining. The case study in hydro-climatic anomaly detection shows that the frequency of multivariate anomalies is affected by the dependence level between variables. The appropriate multivariate thresholds can be determined through a copula-based approach.
To discover patterns in historical data, climate scientists have applied various clustering methods with the goal of identifying regions that share some common climatological behavior. However, past approaches are limited by the fact that they either consider only a single time period (snapshot) of multivariate data, or they consider only a single variable by using the time series data as multi-dimensional feature vector. In both cases, potentially useful information may be lost. Moreover, clusters in high-dimensional data space can be dicult to interpret, prompting the need for a more effective data representation. We address both of these issues by employing a complex network (graph) to represent climate data, a more intuitive model that can be used for analysis while also having a direct mapping to the physical world for interpretation. A cross correlation function is used to weight network edges, thus respecting the temporal nature of the data, and a community detection algorithm identifies multivariate clusters. Examining networks for consecutive periods allows us to study structural changes over time. We show that communities have a climatological interpretation and that disturbances in structure can be an indicator of climate events (or lack thereof). Finally, we discuss how this model can be applied for the discovery of more complex concepts such as unknown teleconnections or the development of multivariate climate indices and predictive insights.
C. Moretti
†,
K. Steinhaeuser†, D. Thain and N. V. Chawla (2008). Scaling Up Classifiers to Cloud Computers.
IEEE International Conference on Data Mining (ICDM), Pisa, Italy. † Equal Contribution
Keywords: distributed data mining, cloud computing, large datasets, scalability analysis
© IEEE 2008
As the size of available datasets has grown from Megabytes to Gigabytes and now into Terabytes, machine learning algorithms and computing infrastructures have continuously evolved in an effort to keep pace. But at large scales, mining for useful patterns still presents challenges in terms of data management as well as computation. These issues can be addressed by dividing both data and computation to build ensembles of classifiers in a distributed fashion, but trade-offs in cost, performance, and accuracy must be considered when designing or selecting an appropriate architecture. In this paper, we present an abstraction for scalable data mining that allows us to explore these tradeoffs. Data and computation are distributed to a computing cloud with minimal effort from the user, and multiple models for data management are available depending on the workload and system configuration. We demonstrate the performance and scalability characteristics of our ensembles using a wide variety of datasets and algorithms on a Condor-based pool with Chirp to handle the storage.
Knowledge discovery from temporal, spatial and spatiotemporal data is critical for climate change science and climate impacts. Climate statistics is a mature area. However, recent growth in observations and model outputs, combined with the increased availability of geographical data, presents new opportunities for data miners. This paper maps climate requirements to solutions available in temporal, spatial and spatiotemporal data mining. The challenges result from long-range, long-memory and possibly nonlinear dependence, nonlinear dynamical behavior, presence of thresholds, importance of extreme events or extreme regional stresses caused by global climate change, uncertainty quantification, and the interaction of climate change with the natural and built environments. This paper makes a case for the development of novel algorithms to address these issues, discusses the recent literature, and proposes new directions. An illustrative case study presented here suggests that even relatively simple data mining approaches can provide new scientific insights with high societal impacts.
K. Steinhaeuser and N. V. Chawla (2008). Is Modularity the Answer to Evaluating Community Structure in Networks?
International Conference on Network Science (NetSci), Norwich, UK.
Keywords: complex networks, community detection, evaluation metrics, modularity, rand index
A significant increase in the ability to collect and store diverse information over the past decade has led to an outright data explosion, providing larger and richer datasets than ever before. This proliferation in dataset size is accompanied by the dilemma of successfully analyzing this data to discover patterns of interest. Extreme dataset sizes place unprecedented demands on high-performance computing infrastructures, and a gap has developed between the available real-world datasets and our ability to process them; data volumes are quickly approaching Tera and Petabytes. This rate of increase also defies the subsampling paradigm, as even a subsample of data runs well into Gigabytes. To counter this challenge, we exploit advances in multi-threaded processor technology. We explore massive thread-level parallelism -- provided by the Cray MTA-2 -- as a platform for scalable data mining. We conjecture that such an architecture is well suited for the application of machine learning to large datasets. To this end, we present a thorough complexity analysis and experimental evaluation of a popular decision tree algorithm implemented using fine-grain parallelism, including a comparison to two more conventional architectures. We use diverse datasets with sizes varying in both dimensions (number of records and attributes). Our results lead us to the conclusion that a massively parallel architecture is an appropriate platform for the implementation of highly scalable learning algorithms.
K. Steinhaeuser, N. V. Chawla and P. M. Kogge (2006). Exploiting Thread-Level Parallelism to Build Decision Trees.
ECML/PKDD Workshop on Parallel Data Mining (PDM), Berlin, Germany.
Keywords: high-performance data mining, large datasets, cray mta-2
© Springer 2006
Classification is an important data mining task, and decision trees have emerged as a popular classifier due to their simplicity and relatively low computational complexity. However, as datasets get extremely large, the time required to build a decision tree still becomes intractable. Hence, there is an increasing need for more efficient tree-building algorithms. One approach to this problem involves using a parallel mode of computation. Prior work has successfully used processor-level parallelism to partition the data and computation. We propose to use Cray’s Multi-Threaded Architecture (MTA) and extend the idea by employing thread-level parallelism to reduce the execution time of the tree building process. Decision tree building is well-suited for such low-level parallelism as it requires a large number of independent computations. In this paper, we present the analysis and parallel implementation of the ID3 algorithm, along with experimental results.
The management of wireless sensor networks in the presence of multiple constraints is an open problem in systems research. Existing methods perform well when optimized for a single parameter (such as energy, delay, network bandwidth). However, we might want to establish trade-offs on the fly, and optimize the information flow/exchange. This position paper shall serve as a preliminary proof-of-concept that techniques and algorithms from the machine learning and data mining domains can be applied to network data to learn relevant information about the routing behavior of individual nodes and the overall state of the network. We describe two simple examples which demonstrate the application of existing algorithms and analyze the results to illustrate their usefulness.
Technical Reports
K. Steinhaeuser, N. V. Chawla and A. R. Ganguly (2010). Complex Networks as a Unified Framework for Descriptive Analysis and Predictive Modeling in Climate.
University of Notre Dame Technical Report TR-2010-07, Notre Dame, IN.
Keywords: complex networks, climate data, network analysis, community detection, multivariate predictive modeling
The analysis of climate data has relied heavily on hypothesis-driven statistical methods, while projections of future climate are based primarily on physics-based computational models. However, in recent years a wealth of new datasets has become available. Therefore, we take a more data-centric approach and propose a unified framework for studying climate, with an aim towards characterizing observed phenomena as well as discovering new knowledge in the climate domain. Specifically, we posit that complex networks are well-suited for both descriptive analysis and predictive modeling tasks. We show that the structural properties of "climate networks" have useful interpretation within the domain. Further, we extract clusters from these networks and demonstrate their predictive power as climate indices. Our experimental results establish that the network clusters are statistically significantly better predictors than clusters derived using a more traditional clustering approach. Using complex networks as data representation thus enables the unique opportunity for descriptive and predictive modeling to inform each other.
The US DOD exercise requires twelve maps, one for each month, of Arctic sea ice extent and thickness in the 2030's. The projections generated here use simulations from the Community Sea Ice Model component of CCSM3 forced with the A2 scenario. Specifically, the following variables are used here: sea ice concentration (sic) and sea ice thickness (sit).
The analysis presented here uses twelve CCSM3 model grid cells overlaying the Dominican Republic and the immediately surrounding areas. The outputs from CCSM3 were interpolated using commercial GIS. The results show average temperature and precipitation differences in 2025, 2050, and 2100, compared to current, or 2000 values. The numbers correspond to decadal averages, where the decades are around the corresponding years. The bounding coordinates of the analysis area used in this study are -72.4 to -66.7 degrees West and 16.8 to 21.0 degrees North. The center points of each grid cell were used for interpolation and analysis.
Last modified: May 10, 2012
The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.