Research Interests

University of Munich Database Group

Currently, I am working with the database group at the University of Munich under the supervision of Prof. H.-P. Kriegel. My main interests are in the area of knowledge discovery in databases (KDD) and data mining.
The subject of my PhD-thesis will be "Quality Driven Data Mining".

Knowledge Discovery in Databases

"Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data."

Knowledge Discovery in Databases (KDD) is an interdisciplinary field bringing together techniques from various areas to address the issues of analyzing huge data sets, and extracting knowledge from them. The birth of KDD was spurred by the rapid growth of almost all types of traditional (relational) and spatial databases, and the advent of commercial data warehouses, containing terrabytes of data, accumulated by established companies over the last decades. These mountains of data contain information from such diverse sources as credit card transactions, telephone calls, space observatories, human genome research, supermarket purchase transactions (market basket data) or web clickstreams.
Classical techniques from the areas of statistics and on-line analytical processing (OLAP) were not designed to cope with todays large databases and the new demands on the power of the analysis method. To meet the new requirements, new methods have been developed in the area of KDD.
KDD systems have been successfully applied by a number of companies.
  • GTE Data Services International uses a customer relationship management system build by Compaq to analyze detailed call records in real-time.
  • National Semiconductor analyzed its web site usage statistics to find out how many clicks it took a user to get to a specific page on the site and, based on this, redesigned the web site and reduced the average number of clicks from seven to two.
Many more examples of real-world applications exit.

Data Mining

"Data Mining is a step in the KDD process consisting of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data."

The core step of the knowledge discovery process is the application of the data mining step. Hence, most work has focused on data mining methods. Among most important data mining methods are clustering and the dual, outlier detection.

Hierarchical Clustering

"Clustering is the process of grouping the data records into meaningful subclasses (clusters), in a way that maximizes the similarity within clusters and minimizes the similarity between two different clusters."

  • Problem Description: Cluster analysis (also called segmentation) is a primary method for database mining. It can either be used as a stand-alone tool to get insight into and explain the distribution of a data set, e.g. to focus further analysis and data processing, or for pre-processing for other algorithms, which then operate on the detected clusters.
    Existing clustering algorithms can be classified into hierarchical and partitioning clustering algorithms. Partitioning algorithms construct a flat (single level) partition of a database into a set of k clusters. Hierarchical algorithms decompose a database into several levels of nested partitionings, generally represented by a tree that iteratively splits the data set into smaller subsets. In such a hierarchy, each node of the tree represents a cluster.
    Numerous applications for clustering exist. Segmenting a database is useful for identifying customer groups based on purchasing patterns. This information in turn can be used for Customer Relationship Management (CRM), e.g. to improve customer retention, select customers for direct mail campaign or lower customer attrition rates. Clustering is also helpful for categorizing www documents, grouping genes and proteins that have similar functions or the detection of seismic faults by grouping the entries in an earthquake catalog. All these examples have in common that the better the quality of the clustering algorithm, the higher the benefits realized.
  • Project Overview: The basic idea of the hierarchical clustering methods OPTICS is to order the data objects according to their "closeness". The theoretical basis and the practical applications of this idea are at the core of this project.
  • Further/Related Information: Data Mining at the LMU | Database Group at the LMU | University of Munich

Outlier Detection

"An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism."

  • Problem Description: Most KDD-methods focus on finding patterns applicable to many objects. However, for applications such as detecting criminal activities of various kinds (e.g. in electronic commerce), rare events, deviations from the majority, or exceptional cases may be more interesting and useful than the common cases.
    From this definition it may seem, that outliers are actually a nuisance, i.e. that they interfere and obstruct the data mining process. For some applications this is true, and in this case, outliers should be identified and eliminated as quickly as possible. For other application, outliers do contain the most useful information. Examples include the above mentioned applications of detecting criminal activities like credit card fraud, pharmaceutical research or intrusion detection.
    Thus, outlier detection can be beneficially applied either as a pre-processing step, in which the data set is cleaned by removing outliers, or as the data mining step, if the outliers are expected to contain useful information for the given application. So far, only very few approaches are directly concerned with outlier detection. Clustering and outlier detection are closely related. From the viewpoint of a clustering algorithm, outliers are objects not located in clusters of a data set, usually called noise.
  • Project Overview: Existing work in outlier detection regards being an outlier as a binary property. For many scenarios, however, it is more meaningful to assign to each object a degree of being an outlier. Given a degree of outlierness for every object, the objects can be ranked according to this degree, giving the data mining analyst a sequence in which to analyze the outliers. This degree of outlierness is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. Developing a formal basis for this basic idea and applying it to practical problems is the focus of this research project.
  • Further/Related Information: Data Mining at the LMU | Database Group at the LMU | University of Munich

Stanford University Database Group

  • STRIP-project
    • Project Overview: The goals of the STRIP project are: to build a real-time relational main memory database capable of over 1000 transactions per second, to support seamless data sharing with conventional relational databases, to study the practical issues related to real-time concurrency control and value-function scheduling.
    • My Contribution: My work is on value function based scheduling. Each task in the system is assigned a priority by means of a value function. This value functions maps a time to a value and models the value the scheduler realizes for finishing the task at this time. In this context the scheduler tries to gain as high a value as possible. I studied how different schedulers react to different value functions.
    • Further/Related Information: STRIP homepage | Stanford DB Group | Stanford University

  • TSIMMIS-project
    • Project Overview: As an acronym, TSIMMIS stands for "The Stanford-IBM Manager of Multiple Information Sources." In addition, TSIMMIS is a Yiddish word for a stew with "heterogeneous" fruits and vegetables integrated into a surprisingly tasty whole. The goal of the TSIMMIS Project is to develop tools that facilitate the rapid integration of heterogeneous information sources that may include both structured and semistructured data. TSIMMIS has components that: translate queries and information (source wrappers); extract data from World Wide Web sites; combine information from several sources (mediator); allow browsing of data sources over the Web. The TSIMMIS project is funded by DARPA.
    • My Contribution: My job was to implement a number of library functions facilitating rapid development of new source wrappers. As every data source on the web has it's own format, each needs a proprietary wrapper, so ability to quickly develop new wrapper (in order to add new data sources) is crucial.
    • Further/Related Information: TSIMMIS homepage | Stanford DB Group | Stanford University

Technical University of Munich

  • Concurrent Object-Oriented Programming
    • Project Overview: Designing and implementing parallel or distributed programs is one of the most complex and error-prone tasks a programmer faces. The object-oriented paradigm, which exhibts the two very important facilities of encapsulation and message passing, seems to be ideally suited for alleviating the complexity of parallel programming. Not surprisingly, many of the languages specifically designed for the development of parallel or distributed applications emply object-oriented techniques.
    • My Contribution: In my diploma thesis, I categorize the most relevant and influential parallel and distributed object-oriented programming languages. Their respective advantages or drawbacks are discussed in detail and solutions to two of the main problems in this area are proposed.
    • Further/Related Information: Technical University of Munich

Main Home | Research Home