The growth of networked and distributed systems in several
application domains has been explosive in the past few years. This has changed
the way we reason about distributed systems in many ways. One issue of
definitive importance is the following: what model to use for large-scale
interactive or mission-critical applications?
A traditional trend when large-scale, unpredictable and
unreliable infrastructures are at stake (e.g. Internet) has been to use
asynchronous models. For a large number of applications this approach has served
well, since uncertainty about the provision of service was tolerated.
However, a large part of the emerging services has
interactivity or mission-criticality requirements, which are best translated
into requirements for fault-tolerance and real-time. Examples of this can be
found in the telecommunications area, where some new services like toll-free
numbers, forwarding calls or redirection of calls based on specific information
such as locality, must be highly available and responsive to fulfill user
This behavior is materialized by
specifications, which in essence call for synchronous system models. Under these
models there are known bounds for essential timing variables, such as processing
speed or communication delay. However, correct operation under fully synchronous
models is very difficult to achieve (if at all possible) in the large-scale
infrastructures we are aiming at, since they have poor baseline timeliness
This project aims to investigate, in the first place, the
steps needed for the definition of a new model suitable for mission-critical
applications. The crucial aspect is timing fault-tolerance in the context of
real-time systems. We intend to formalize assumptions about system timeliness,
and then develop what we call a Timing Failure Detector, in order to perfectly
detect all violations of timeliness. There are several ways to treat the problem
afterwards, but we plan to study the use of replication to mask timing faults.
Previously known failure detectors were of the crash type only, our detectors
are more accurate. Besides, replication has not been used previously in the
context of timing faults, and as such, this approach is innovative. Our research
will therefore concentrate on the definition of a set of basic services (a
Timing Failure Detection Service - TFDS, and a Replica Management Service -
RMS) which aim at providing the functionality just described, by means of a
suitable programming interface, with provisions for timeliness specifications.
The final target is to build a proof-of-concept prototype
application. We devise a cenario that simulates the problems of a specialized
telecommunication application, with requirements for fault-tolerance and
timeliness. A scheme of a replicated database will be used to achieve tolerance
both to crash and timing faults. The problems of replica management (tolerance
to timing failures included) should be handled by the services (TFDS and RMS)
The validation of the prototype is extremely important, since
it is a whole model that we are validating, not just a particular application.
In consequence, the project adopted the fault injection methodology. A fault
injection campaign will address both the validity of the assumptions about the
system, and the perfection of our timing failure detector.