System overview

The Health Management system is a set of hardware and software components that supports SS7 redundancy and the development of distributed, highly available call processing systems that employ SS7 signaling. These systems can detect and recover from signaling link failures, board failures, and node failures without a total service outage. The Health Management architecture facilitates the design of systems whose hardware or software components can be upgraded, or whose call-handling capacity can be increased or decreased.

The core of the architecture is an extended SS7 software capability that allows two TX boards to be paired in a primary or backup arrangement. The boards are connected by a private high speed Ethernet link that allows them to exchange heartbeats, signaling messages, and state information.

The TX boards can be spread across two signaling nodes (multiple chassis) or be located in the same signaling node (single chassis). The two boards appear to the rest of the SS7 network as a single signaling point (SP) with a single point code. In a SIGTRAN configuration, the two boards appear as a single point code, but each board has a separate IP address.

The Health Management system can also be used in a non-redundant single board configuration, known as a standalone configuration, to monitor and control the board.

SS7 layers

The following table describes how the SS7 layers work in a redundant configuration:

Applicable configurations	SS7 layer	Description
All redundant configurations except SIGTRAN	MTP 2	Active on both boards, allowing all configured signaling links to be active and eliminating the need for provisioning of spare signaling links.
All redundant configurations except SIGTRAN	MTP 3	MTP 3 routing and management functions are operational only on the primary board. Link and route status changes are checkpointed to the backup MTP 3 layer to ensure that it has up-to-date network status information in case of a primary board outage.
All redundant configurations	ISUP	Operates in a primary and backup mode, with all circuit switched connections managed by the active board. Call state information can be checkpointed by the local application to the backup ISUP entity, through extensions to the normal call processing APIs, so that stable calls can be preserved across a signaling board or node outage.
	TUP	Preserves stable calls with automated checkpoints between the primary and backup tasks. The primary and backup application must also checkpoint call states to facilitate smooth switchovers.
	SCCP	Operates along with other SS7 layers in a two-board redundant primary and backup configuration, in addition to the current single-board standalone configuration. The objective of the redundant configuration is to maintain the SCCP service across a failure. The backup SCCP layer can re-synchronize its internal state with the primary SCCP layer in cases where communication with the primary board is lost and then re-established, or when the backup board has been reloaded due to a failure or to routine maintenance.
	TCAP	Operates in a primary and backup mode. To allow a backup TCAP task to immediately take over service, the primary TCAP task sends checkpoint messages to inform the backup task of changes in various TCAP transactions. Additionally, if the primary and backup tasks become disconnected due to a failure or a reloading of the backup board, the backup task retrieves the current transaction states for all of the transactions on the primary task.

SIGTRAN layers

In a SIGTRAN configuration, the private Ethernet between the mate boards is not used by SCTP or M3UA for data or checkpoint messages. The private Ethernet is required for TXMON heartbeat messages and higher layer checkpointing.

Both the primary and backup boards establish associations with the remote endpoints when the boards start up. The association from the backup board remains in a stand-by state until the primary board fails or a planned switchover occurs. No data is passed over the association from the backup board until that board becomes the primary.

Signaling

Operation of the signaling subsystem is under complete control of the local signaling application. The application designates each board as either the primary or backup board after it is downloaded. During normal operation, applications using SCCP behave normally. There are no checkpointing responsibilities, other than updating a backup host in the dual chassis arrangement (if necessary). For class 0 connectionless service, best effort delivery service is maintained across switchovers. No other state information, other than the accessible or inaccessible status of the remote SP/SSN, is maintained between primary and backup SCCP layers.

For class 1 connectionless service, SLS values assigned to a sequence are not retained across switchovers. No checkpointing of SLS assignments (SCLI data structures) is required. The backup must, however, avoid re-using frozen segmentation local references (those recently assigned by the primary) for some period after a switchover, so their usage must be synchronized with the backup application.

In general, for both classes of connectionless service, messages can be lost on a switchover. Any detection and recovery of lost messages is the responsibility of the application-level protocol running above SCCP.

For both classes of service, segmented messages in the process of being transmitted or received are lost or discarded on a switchover. If the remaining segments of a partially reassembled incoming message that was lost or discarded due to a switchover are received by the (new) primary, they are detected and discarded. If any of these segments has the return option set, it is returned to the sender in an XUDTS message with a return cause of segmentation failed for ITU or error in message transport for ANSI.

During normal operation, applications using TCAP behave normally. TCAP transaction information is checkpointed by the primary TCAP task and is configurable. An application can configure each user SAP to, by default, checkpoint all transactions, checkpoint only those initiated by the application, or checkpoint no transactions. The default checkpoint action can be overridden by an application, which can checkpoint transactions on an individual basis.

A transaction can be checkpointed at any time during the transaction lifetime. For example, after a begin message is received, the application sends a continue message and specifies that the transaction must be checkpointed. Although the begin message was not checkpointed, the transaction is checkpointed as the continue message is sent. The TCAP task keeps track of which transactions are checkpointed and deletes the checkpoints as the transactions are closed.

If using ISUP, the application must checkpoint call status changes to the ISUP layer on the backup board, as necessary to preserve stable calls. Upon detection of a failure of the primary signaling board (through the Health Management system) or failure of the primary application or signaling node (through application-specific means), the application directs the backup signaling board to become the primary board and take over signaling operations. When a failed signaling board is restored to service as the backup, the application can re-synchronize it with the primary board by checkpointing the state of each circuit through the call processing extensions.

If using TUP, call states are synchronized automatically between the two TX boards. The applications must do the same.

System requirements

Health Management is supported on all TX boards.