Linux NIC Failover

So I am designing this highly available network (99.999% uptime).

The application has a bunch of clusters to handle application fail over and load balancing and the cluster members are spread over redundant switches connected to redundant firewalls which are connected to redundant edge routers which are connected to the colo provider using multiple links for fail over.

Still with me there? :)

This network + application level fail over works. But we wanted to take the notch up another level and introduced NIC fail over on individual machines.

Here is how to do it on a CentOS machine -

/etc/sysconfig/network
NETWORKING=yes
HOSTNAME=bond-james-bond.vsharma.net
#VLAN=yes
GATEWAY=xxx.xxx.xxx.xxx
GATEWAYDEV=bond0

/etc/modprobe.conf
alias bond0 bonding
options bonding miimon=100 mode=1

/etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
BOOTPROTO=static
TYPE=Ethernet
ONBOOT=yes
SLAVE=yes
MASTER=bond0

/etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1
BOOTPROTO=static
TYPE=Ethernet
ONBOOT=yes
SLAVE=yes
MASTER=bond0

/etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
ONBOOT=yes
IPADDR=xyz.xyz.xyz.xyz
NETMASK=bla.bla.foo.foo

Thats it! Now connect the machine to ports on the same VLAN on two switches and make sure those switches have a trunk running between them.

Here is something from the logs when you pull the active cable out -

Oct 25 04:52:00 icecream kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
Oct 25 04:52:00 icecream kernel: bonding: bond0: link status definitely up for interface eth0.
Oct 25 04:53:37 icecream kernel: bnx2: eth1 NIC Copper Link is Down
Oct 25 04:53:37 icecream kernel: bonding: bond0: link status definitely down for interface eth1, disabling it
Oct 25 04:53:37 icecream kernel: bonding: bond0: making interface eth0 the new active one.

Sub second switchover. I was continuously pinging the machine and the switchover did not drop even a single ping. It might drop a few packets under heavy load but this is not bad at all.

Sweet!