Project Overview

Data Mining for the Discovery of Ocean Climate Indices

Ocean climate indices (OCIs), which are time series that summarize the behavior of selected areas of the Earth’s oceans, are important tools for predicting the effect of the oceans on land climate. In this work we describe the use of data mining to discover Ocean Climate Indices (OCIs). In particular, we apply a shared nearest neighbor (SNN) clustering algorithm to cluster the pressure and temperature time series associated with points on the ocean, yielding clusters that represent ocean regions with relatively homogeneous behavior. The centroids of these clusters are time series that summarize the behavior of these ocean areas, and thus, represent potential OCIs. To evaluate cluster centroids for their usefulness as potential OCIs, we must determine which cluster centroids significantly influence the behavior of well-defined land areas. For this task, we use a variety of approaches that analyze the correlation between potential OCIs and the time series (e.g., of temperature or precipitation) which describe the behavior of land points. Based on these approaches, we have identified some cluster centroids that are almost identical to well-known OCIs, e.g., the Southern Oscillation Index (SOI) and the North Atlantic Oscillation (NAO). We also introduce two strategies for validating potential OCIs which do not correspond to well-known (and probably “stronger” OCIs), namely, focusing on the correlation between “extreme” events on the ocean and land and looking for more persistent patterns of correlation.

 

Discovery of Patterns in the Global Climate System

This work presents preliminary work in using data mining techniques to find interesting spatio-temporal patterns from Earth Science data. The data consists of time series measurements for various Earth Science variables (e.g. soil moisture, temperature, and precipitation), along with additional data from existing ecosystem models (e.g. Net Primary Production). The ecological patterns of interest include associations, clusters, predictive models, and trends. In this work, we first discuss some of the challenges involved in preprocessing and analyzing the data. Earth Science data has strong seasonal components that need to be removed prior to pattern analysis, as Earth scientists are primarily interested in patterns that represent deviations from normal seasonal variation such as anomalous climate events (e.g., El Nino) or trends (e.g., global warming). We compare several alternatives (including singular value decomposition (SVD), discrete Fourier transform (DFT), “monthly” Z score, and moving average) with respect to their effectiveness in removing seasonality. After preprocessing, we apply clustering and different kinds of association analysis to the data to discover spatio-temporal relationships among ecological variables at various parts of the Earth. Our current technique for finding associations extracts sets of events from the time series data and then applies existing algorithms traditionally used for market-basket data. We use K-means clustering to divide the land and ocean areas of the earth into disjoint regions in an automatic, but meaningful, way that enables the direct or indirect discovery of interesting patterns.

NASA News: "Data Mining Reveals a New History of Natural Disasters"

Feature: "NASA Finds Trees and Insect Outbreaks Affect Carbon Dioxide Levels"

 

Finding Spatio-Temporal Patterns in Earth Science Data

This work presents preliminary work in using data mining techniques to find interesting spatio-temporal patterns from Earth Science data. The data consists of time series measurements for various Earth science and climate variables (e.g. soil moisture, temperature, and precipitation), along with additional data from existing ecosystem models (e.g. Net Primary Production). The ecological patterns of interest include associations, clusters, predictive models, and trends. In this work, we discuss some of the challenges involved in preprocessing and analyzing the data, and also consider techniques for handling some of the spatio-temporal issues. Earth Science data has strong seasonal components that need to be removed prior to pattern analysis, as Earth scientists are primarily interested in patterns that represent deviations from normal seasonal variation such as anomalous climate events (e.g., El Nino) or trends (e.g., global warming). We compare several alternatives (including singular value decomposition (SVD), discrete Fourier transform (DFT), "monthly" Z score, and moving average) with respect to their effectiveness in removing seasonality. We describe the different kinds of association analysis that can be performed on such data. Our current technique for finding associations transforms the time series into transactions and then applies existing algorithms traditionally used for market-basket data. Some of the transformations lead to dense columns in the transaction matrices, causing an exponential growth in the computing requirements. Furthermore, no single interestingness measure accurately reflects the quality of the derived patterns. Indeed, we argue that existing approaches for mining association rules and sequential patterns may not be able to capture all the interesting patterns due to the spatio-temporal nature of this data.

 

Clustering Earth Science Data: Goals, Issues and Results

This work reports on recent work applying data mining to the task of finding interesting patterns in earth science data derived from global observing satellites, terrestrial observations, and ecosystem models. Patterns are "interesting" if ecosystem scientists can use them to better understand and predict changes in the global carbon cycle and climate system. The initial goal of the work reported here (which is only part of the overall project) is to use clustering to divide the land and ocean areas of the earth into disjoint regions in an automatic, but meaningful, way that enables the direct or indirect discovery of interesting patterns. Finding "meaningful" clusters requires an approach that is aware of various issues related to the spatial and temporal nature of earth science data: the "proper" measure of similarity between time series, removing seasonality from the data to allow detection of non-seasonal patterns, and the presence of spatial and temporal autocorrelation (i.e., measured values that are close in time and space tend to be highly correlated, or similar). While we have techniques to handle some of these spatio-temporal issues (e.g., removing seasonality) and some issues are not a problem (e.g., spatial autocorrelation actually helps our clustering), other issues require more study (e.g., temporal autocorrelation and its effect on time series similarity). Nonetheless, by using the K-means as our clustering algorithm and taking linear correlation as our measure of similarity between time series, we have been able to find some interesting ecosystem patterns, including some that are well known to earth scientists and some that require further investigation.