Home Archive Members Summary

Project Summary

The growth of networked and distributed systems in several application domains has been explosive in the past few years. This has changed the way we reason about distributed systems in many ways. One issue of definitive importance is the following: what model to use for large-scale interactive or mission-critical applications?

A traditional trend when large-scale, unpredictable and unreliable infrastructures are at stake (e.g. Internet) has been to use asynchronous models. For a large number of applications this approach has served well, since uncertainty about the provision of service was tolerated.

However, a large part of the emerging services has interactivity or mission-criticality requirements, which are best translated into requirements for fault-tolerance and real-time. Examples of this can be found in the telecommunications area, where some new services like toll-free numbers, forwarding calls or redirection of calls based on specific information such as locality, must be highly available and responsive to fulfill user demands.

This behavior is materialized by timeliness specifications, which in essence call for synchronous system models. Under these models there are known bounds for essential timing variables, such as processing speed or communication delay. However, correct operation under fully synchronous models is very difficult to achieve (if at all possible) in the large-scale infrastructures we are aiming at, since they have poor baseline timeliness properties.

This project aims to investigate, in the first place, the steps needed for the definition of a new model suitable for mission-critical applications. The crucial aspect is timing fault-tolerance in the context of real-time systems. We intend to formalize assumptions about system timeliness, and then develop what we call a Timing Failure Detector, in order to perfectly detect all violations of timeliness. There are several ways to treat the problem afterwards, but we plan to study the use of replication to mask timing faults. Previously known failure detectors were of the crash type only, our detectors are more accurate. Besides, replication has not been used previously in the context of timing faults, and as such, this approach is innovative. Our research will therefore concentrate on the definition of a set of basic services (a Timing Failure Detection Service - TFDS, and a Replica Management Service - RMS) which aim at providing the functionality just described, by means of a suitable programming interface, with provisions for timeliness specifications.

The final target is to build a proof-of-concept prototype application. We devise a cenario that simulates the problems of a specialized telecommunication application, with requirements for fault-tolerance and timeliness. A scheme of a replicated database will be used to achieve tolerance both to crash and timing faults. The problems of replica management (tolerance to timing failures included) should be handled by the services (TFDS and RMS) previously defined.

The validation of the prototype is extremely important, since it is a whole model that we are validating, not just a particular application. In consequence, the project adopted the fault injection methodology. A fault injection campaign will address both the validity of the assumptions about the system, and the perfection of our timing failure detector.



For problems or questions regarding this web contact [pmartins@di.fc.ul.pt].

Last updated: Novembro 09, 2000