Signaling node failures

Detection of signaling node failures in a dual-node configuration is application specific. No monitoring of the host or application status is done by the signaling subsystem. Recovery scenarios are similar to the failed board recovery scenarios.

Primary signaling node failure

When a primary signaling node fails, it is up to an application on the backup node to detect the failure and set the backup board into primary operation with hmiPrimary. During the primary node outage, messages arriving on signaling links terminated on the backup node are queued if possible, waiting for the switchover. If the traffic load is too heavy or the failure detection and recovery takes too long, the links can be placed in a local processor outage state and the queued messages may be lost.

After the failed signaling node is restored, the signaling board in the failed node is reloaded and placed into backup state. The sequence is the same as the recovery of a failed board except that additional synchronization is required between the applications on the primary and backup nodes to convey changes in circuit status that occurred while the failed node was unavailable. This synchronization is application specific.

Backup signaling node failure

Recovery of a failed backup signaling node is similar to the recovery of a failed backup board. No disruption of signaling traffic is expected in this case. If there is a total failure (the primary board detects the failure of the backup board), signaling links terminating on the failed board are declared failed until the backup board is restored. Blocking or resetting voice circuits that were terminated on the failed node is up to the application.

If only the backup host processor fails, the signaling links that terminate on the backup board remain operational until the backup node is rebooted.

The application is responsible for the synchronization of circuit states that changed while the backup node was out of service.

If the host fails in a signaling node but the TX board continues working, the TX board could become stranded. While this is acceptable when the backup node fails, some action is necessary when the primary host fails. The primary TX board automatically becomes the backup if it sees its mate board become primary, resulting in no commands received from the Health Management service for approximately one half of a second.