DIGITAL COMPUTER RELIABILITY ASSESSMENT Features

Lecture



In modern ACS using digital computers, not only the failure-free operation of machines, but also programs that do not have hidden errors are very important. Currently, there is a tendency to reduce the quality of programs, increase the number of errors in them.

Modern methods of developing and testing programs do not ensure the creation of optimal programs, even with well-known ways to improve them. In programming practice, it is usually difficult for a developer to evaluate several possible solutions, since checking a program is often possible only after combining its parts, when changes in the program are associated with a significant investment of time and money. In addition, previously compiled program blocks are often used, which also makes it difficult to optimize this program. Not all blocks are programmed equally carefully and in detail, often the uniformity of writing different blocks is lost. This is usually discovered too late.

The concept of a program error can be defined as a mismatch between a given and some “ideal” programs. However, if a “perfect” program existed, there would be no problem. Therefore, in order to use the mathematical apparatus of the theory of reliability, failures of the program – events consisting in the transition to incorrect operation or the termination of the program are considered. After a failure occurs, programmers examine the program in order to search (localize) errors and improve the program.

Information about errors and their correction are issued on special notices. The error is considered corrected if during the retest the error is not detected and an addendum to the notification of the error is issued. The time from issuing an error notification to issuing an additional notification is called a debug cycle.

The complexity of the program can be divided into several types.

The length of standard programs for calculating elementary functions does not exceed hundreds of commands. These programs are checked very carefully, but sometimes errors are detected in them, usually with specific values ​​of the argument. Posting standard programs is straightforward.

More complex programs are translators, which are used to convert algorithms written in a programming language into a sequence of machine instructions. Translators contain 10,000-50,000 commands. A full check of the translator is usually not possible, therefore, during the operation, error detection continues.

The most complex are real-time control programs implemented on computers with a multiprocessor (containing hundreds of thousands of instructions). A full check of such programs during debugging is not possible. The functioning of the program can be fully evaluated only during the application. Program errors usually occur only under the action of certain input signals, which in this case play the role of program conditions.

When considering the many values ​​of the input signals, program errors can be considered random.

4.1. Features of program reliability assessment
The operating time of a program is a sequence of alternating running times T (i) from the time of recovery to failure of the program and the time of recovery from the time of failure to the time of restoration, i.e., the introduction of corrections into the program.

A similar model is considered in assessing the reliability of restored products, and all random variables T (i) are considered equally distributed (similarly). In this case, mathematical models of recovery theory are used.

The direct application of these models to assess the reliability of programs is impractical due to a number of features of the random process of their operation.

First, MTBFs T (i) tend to increase over time. This is due to the fact that as errors are identified and eliminated, their total number in the program decreases, so program failures become more and more rare.

The process of improving control programs can be considered as a process of identifying and eliminating hidden defects.

There is also a tendency to reduce recovery time, as programmers have accumulated relevant experience all the time. At the same time, we can assume the mutual independence of random vectors TvTv.

Secondly, large control programs are usually unique. Whereas for technical products, estimates of reliability indicators are usually calculated from statistics on failures and recoveries of many products of the same type, while assessing the reliability of programs, only individual forecasting is possible. Large programs during the initial period of operation usually work in a single copy and only after identifying and eliminating the vast majority of errors, that is, when a certain level of reliability is achieved, can in rare cases be replicated. Therefore, the method for evaluating the reliability of programs should provide for a period of accumulation of experimental data with subsequent extrapolation of the values ​​of indicators of program reliability. Accumulation period


Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Theory of Reliability

Terms: Theory of Reliability