Service Engine Failure Detection

Overview

Failure detection is essential in achieving Service Engine high availability. Avi Vantage relies on a variety of methods to detect Service Engine failures, as listed.

Controller-to-SE Failure Detection Method

In all deployments, the Avi Controller sends heartbeat messages to all Service Engines in all groups under its control, once every 10 seconds. If there is no response from a specific SE for six consecutive heartbeat messages, the Controller concludes that the SE is DOWN, and moves all virtual services to the new SEs.

When vSphere High Availability is enabled, if the Controller detects that a vSphere host failure has occurred, SEs will transition to OPER_PARTITIONED or OPER_DOWN prior to missing 6 consecutive heartbeat misses.

SEs (on the failed host) which have operational virtual services will transition to OPER_PARTITIONED state.
SEs (on the failed host) which do not have any operational virtual services will transition to OPER_DOWN state.

SE-to-SE Failure Detection Method

In the above-mentioned Controller-to-SE failure detection method, the Controller detects a Service Engine failure by sending periodic heartbeat messages over the management interface. However, this method will not detect datapath failures for the data interfaces on SEs.

To ensure holistic failure detection, Service Engine datapath heartbeat mechanism was devised, where the Service Engines send periodic heartbeat messages over the data interfaces.

By default, this communication is set to standard mode. It can also be configured for the aggressive mode, as discussed in the Enabling Aggressive Mode using CLI section.

Service Engine Datapath Communication Modes

Depending on the Service Engine deployment, the three modes available for SE-to-SE inter process communication are as discussed below.

1. Custom EtherTypes
This is the default mode applicable when the Service Engines are in the same subnet. The EtherTypes used are:

ETHERTYPE_AVI_IPC 0XA1C0
ETHERTYPE_AVI_MACINMAC 0XA1C1
ETHERTYPE_AVI_MACINMAC_TXONLY 0XA1C2

2. IP Encapsulation
This mode is applicable when the infrastructure does not permit EtherTypes through. Even in this mode, it is assumed that the Service Engines are in the same subnet. This mode is applicable for AWS by default.

Configure IP encapsulation by using the se_ip_encap_ipc X command.

The following example displays configuring IP encapsulation using the CLI:


    #shell
    Login: admin
    Password: 
    [GB-slough-cam:cd-avi-cntrl1]: > configure serviceengineproperties
    [GB-slough-cam:cd-avi-cntrl1]: seproperties> se_bootup_properties
    [GB-slough-cam:cd-avi-cntrl1]: seproperties:se_bootup_properties> se_ip_encap_ipc 1
    [GB-slough-cam:cd-avi-cntrl1]: seproperties:se_bootup_properties> save
    [GB-slough-cam:cd-avi-cntrl1]: seproperties:> save
    [GB-slough-cam:cd-avi-cntrl1]: > reboot serviceengine <IP 1>
    [GB-slough-cam:cd-avi-cntrl1]: > reboot serviceengine <IP 2>

Note: For changes to the se_ip_encap_ipc command to be effective, reboot all Service Engines in the Service Engine group.

The IP protocols used in this mode are:

IPPROTO_AVI_IPC 73
IPPROTO_AVI_MACINMAC 97
IPPROTO_AVI_MACINMAC_TX 63

3. IP packets
This mode is applicable when the Service Engines are in different subnets. The IP packet destined to the destination Service Engine’s interface IP is sent to the next hop router. The IP protocols used in this mode are:

IPPROTO_AVI_IPC_L3 75
IPPROTO_AVI_MACINMAC 97

BGP-Router-to-SE Failure Detection Method

With BGP configured, the SE-to-SE failure detection is augmented as described below.

Bidirectional Forwarding Detection (BFD) detects SE failures and prompts the router not to use the route to the failed SE for flow load balancing.
Routers detect SE failures using BGP protocol timers, as well.

Failure Detection Algorithm

Consider a SE group on which a virtual service has been scaled out. The sequence followed for failure detection is as explained below:

Virtual service’s primary SE sends periodic heartbeat messages to all virtual services’ secondary SEs.
If a SE fails to respond repeatedly, the primary SE will suspect that the said SE may be down.
A notification is sent to Avi Controller indicating a possible SE failure.
Avi Controller sends a sequence of echo messages to confirm if the suspected Service Engine is indeed down.

Based on the time frame and frequency of heartbeat messages sent across the Service Engines, the modes of operation are standard and aggressive. The algorithm for both modes is the same, with a difference in frequency and time frame, as explained below:

The primary SE sends heartbeat messages to the secondary SE on a customized interval, e.g., 100 milliseconds. A string of consecutive failures to respond will indicate that the given SE could be down. According to the settings shown in the second column, the primary SE will suspect a secondary SE to be down if,
- 10 consecutive heartbeat messages fail over a period of 1 sec (standard), or
- 10 consecutive heartbeat messages fail over a period of 1 sec (aggressive). However it could be tweaked to make it aggressive with the below configuration parameters.
As soon as primary suspects that the secondary is down, it apprises the Avi Controller, which then sends echo messages to the suspect. According to the settings shown in the third column, the Controller will declare the suspect down after,
- 4 consecutive echo messages fail over a period of 8 sec (standard), or
- 2 consecutive echo messages fail over a period of 4 sec (aggressive).

By summing the values in the second and third columns, the Controller makes a failure conclusion within 9 seconds under standard settings, but just within 5 seconds under aggressive settings.

The time taken to detect Service Engine failure based on SE-DP heartbeat failure is as follows:

Detection Mode	SE-SE HB Messaging	Controller-SE Echo Messages	Total Time for Failure Detection
Normal Mode	HB-Period: 100ms 10 consecutive failures	Echo-Period: 2 seconds 4 consecutive failures	1+8 = 9 seconds
Aggressive Mode	HB-Period: 100ms 10 consecutive failures	Echo-Period: 2 seconds 2 consecutive failures	1+4 = 5 seconds

The aggressive failure detection as aggressive as 2 seconds can be achieved with the following configuration. However it is recommended only on bare-metal environment, on virtualised environment it can lead to false positives.

serviceengineproperties indicate the aggressive timeout values:


configure serviceengineproperties 
se_runtime_properties
|   dp_aggressive_hb_frequency                    | 100 milliseconds                |
|   dp_aggressive_hb_timeout_count                | 5                              |
se_agent_properties
|   controller_echo_rpc_aggressive_timeout        | 500 milliseconds               |
|   controller_echo_miss_aggressive_limit         | 3                               |

Enabling Aggressive Mode using CLI

Starting Avi Vantage release 17.1.2, Service Engine failure detection can be set to Aggressive mode using only the CLI, as explained below.


  [admin:1-Controller-2]: > configure serviceenginegroup AA-SE-Group
  
  [admin:1-Controller-2]: serviceenginegroup> aggressive_failure_detection
  
  [admin:1-Controller-2]: serviceenginegroup> save

Verify the settings using the following show command:



    [admin:1-Controller-2]: > show serviceenginegroup AA-SE-Group  | grep aggressive

    | aggressive_failure_detection   | True