Supervised
Projects
Advanced Project:
Designing and Evaluating a Relational Database for a Publication-Search-Engine in
Intranets.
- Problem Description: Like all research groups, the database
group at the university of munich has published quite a number of papers
in recent years. This information is an important part of our presence
on the web and enjoys many pages hits (from the intranet by group members
as well as from the internet) every day. Currently, this information
resides in a flat file, has a rather primitive query interface and no
update/insert functionality (apart from simple text-editors).
- Solution: In this project, we design, implement and evaluate
a relational database for the publications. The easy to use web interface
consists of a powerful query engine, and a convenient update/insert interface.
Technically, we use a 3-tier architecture. A mySQL database engine running
under Linux contains the publication data, the Apache web server hosts
the static HTML pages and cgi-based Java programs communicating with
the database engine through JDBC.
Advanced Project:
Speeding up Hierarchical Clustering using Data Compression.
- Problem Description: Clustering algorithms are an important
data mining method. They identify groups in the data, such that the
data objects in a group are as similar to each other as possible while
the data objects from different groups are as different as possible
(data segmentation). While in flat clustering algorithms the groups
are all on one level, in hierarchicals methods the groups can be nested
inside each other (e.g. a large group containing customers buying
fiction books containing two smaller groups of customers buying
mostly science fiction and others buying mostly detective stories).
Because of the high quality of the result that hierarchical methods
provide, these algorithms usually have a runtime that grows at least
quadratically with the number of object to segment, making them
impractical for very large databases.
- Solution: In order to facilitate the application of hierarchical
clustering methods to very large databases, we investigate combining the
clustering algorithm with a data compression algorithm. This improves the
runtime dramatically, but incurs a loss in accurarcy (quality) of the result.
We therefore extend the clustering method to make maximal use of the
information contained in the compressed data items, thus keeping the loss
of quality as small as possible.
Advanced Project:
Data Compression in Java.
- Problem Description: This project builds on the results of the
project "Speeding up Hierarchical Clustering using Data Compression"
discussed above. The data compression method we used is this project
is part of a large software distribution implemented in C/C++. Our
data mining software is mostly implemented in Java, causing lots of
data conversion and code problems.
- Solution: The data compression algorithm is extracted from
the large system, ported to Java and integrated into our existing
system.
Advanced Project:
Next Generation Sampling: Recovering Lost Information.
- Problem Description: available soon...
- Solution:
Advanced Project:
Evolving Optimal Cluster Descriptions Using Genetic Algorithms.
- Problem Description: Clustering algorithms are an important
data mining method. They identify groups in the data, such that the
data objects in a group are as similar to each other as possible while
the data objects from different groups are as different as possible
(data segmentation). The result of many clustering algorithms are sets of
objects belonging to the same group. However, for the human data analyst
such a set of objects is hard to analyze further, so concise and easy
to understand descriptions of the clusters are needed. This is a hard
problem, as often the clusters can be of arbitraty shape, e.g. contain
holes etc.
- Solution: In the project we apply the search strategie
pioneered by genetic algorithms for computing descriptions of clusters
given sets of objects belonging to the clusters. Genetic algorithms
mimic the natural processes of cross-over, mutation etc. They start we
a large number of "individuals" (cluster descriptions) and assign each
one a "fitness" value representing how well the description fits the
clusters. Then, the best individuals are allowed to propagate and create
offspring for further generations. After the number of generations, the
best individual describes the clusters very well.
Diploma Thesis:
Incremental Hierarchical Clustering.
- Problem Description: Many databases used for knowledge discovery
are dynamic, i.e. new objects are added and old ones deleted. An example
are web log data: every time a user access a web page, a new page hit is
recorded and added to the database. As these databases usually get very
large, running a clustering algorithm on such a database becomes more and
more expensive.
- Solution: The idea of incremental algorithms is to re-use the
information from the last run of the algorithm, and simple update this
with the new information added in the mean time. In this project, our
hierarchical clustering algorithm OPTICS is extended to make such an
incremental version possible.
|