Notes on the CheckPoint firewall clustering solution based on a review of the documentation in August 2014.
- In March 2011, I previously write about the fundamental failure of CheckPoint clustering –Checkpoint/Nokia Firewall Clustering. Uh Oh. – EtherealMind
- a customer has asked me to review their planned deployment and I performed this paper review based on the ClusterXL R77 Versions Administration Guide 28 July 2014.
- only customers with a support contract can access the documentation. As an independent consultant who helps customers when resellers and ‘approved’ partners have failed I do not have access other documentation. 
- CheckPoint should note that customer do not always trust their reseller or vendor to get things right. Nor is the vendor or reseller able to solve all problems, independent consultants are very common in Europe.
- There seems to be two modes of active/active High Availability for CheckPoint Firewall one. Both of which are called ClusterXL.
A High Availability Security Cluster ensures Security Gateway and VPN connection redundancy by providing transparent failover to a backup Security Gateway in the event of failure.
- I would call that HA or Active/Passive.
A Load Sharing Security Cluster provides reliability and also increases performance, as all members are active.
- and active/active/active etc
ClusterXL uses unique physical IP and MAC addresses for the ClusterXL members and virtual IP addresses to represent the ClusterXL itself. Virtual IP addresses do not belong to an actual machine interface
ClusterXL provides an infrastructure that ensures that data is not lost due to a failure, by ensuring that each ClusterXL member is aware of connections passing through the other members. Passing information about connections and other Security Gateway states between the ClusterXL members is known as State Synchronization.
- Synchronisation between devices is performed to Cluster Control Protocol, UDP8116, bypasses configured firewall rules by default. CCP is an L2 only protocol.
- each cluster member must have a Security Gateway and an Acceleration and Clustering Blade (CPSB-ACCL or CPSB-ADNC) license
- NTP clock synchronisation is critical
The synchronization network must guarantee no more than 100ms latency and no more than 5% packet loss
The synchronization network may only include switches and hubs. No routers are allowed on the synchronization network, because routers drop Cluster Control Protocol packets.
- Synchronisation Network requires elevated security profile, suggests CCP is insecure protocol, isolated network.
- Recommends using an isolated switch – this is stupidly impractical who the heck wants to waste money and power on a dedicated switch in the DMZ.
- Using a crossover cable is equally stupid and impractical. What are they thinking ?
- Appears that sync is ONLY supported on the lowest VLAN ID on an interface.
In Cluster XL, the synchronization network is supported on the lowest VLAN tag of a VLAN interface. For example, if three VLANs with tags 10, 20 and 30 are configured on interface eth1, interface eth1.10 may be used for synchronization.
- by defaults all connection state is synchronised across the cluster.
- you can decide not to synchronise. Why ? Does this imply a performance issue in the devices, network, speed or other issues ? What would I not sync all state ?
- “Synchronization incurs a performance cost” – implies that CheckPoint performance remains a problem for “many customers.
“A significant amount of traffic crosses the cluster through a particular service. Not synchronizing the service reduces the amount of synchronization traffic and so enhances cluster performance.”
- erm, what ?
The service usually opens short connections, whose loss may not be noticed. DNS (over UDP) and HTTP are typically responsible for most connections and frequently have short life and inherent recoverability in the application level. Services which typically open long connections, such as FTP, should always be synchronized.
- Ok, that makes sense.
- Observation CheckPoint firewalls are punishingly expensive to buy and operate. Therefore it makes sense for customers to waste time and project resources to make complex configuration changes to reduce device load.
Synchronized Cluster Restrictions
The following restrictions apply when you synchronize cluster members:
The use of more than one synchronization interface for redundancy is not supported. You can use Link Aggregation (“Sync Redundancy” on page 86) for synchronization interface redundancy. Synchronization interface redundancy is not supported for VRRP clusters.
- that is very restrictive. The sync interface is critical but doesn’t support redundancy by default. With redundancy configured using LAGP, common features are lost. Stupid.
All cluster members must run on identically configured platforms.
All cluster members must use the same Check Point software version.
If a cluster member goes down, user-authenticated connections through that member are lost. Other cluster members cannot restore the connection. Client-authenticated or session-authenticated connections are maintained.
The reason for these restrictions is that the user authentication state is maintained by a process on the Security Gateway. It cannot be synchronized on members the same way that kernel data is synchronized. However, the states of session authentication and client authentication are saved in kernel tables, and can be synchronized.
The connection statues that use system resources cannot be synchronized for the same reason that user-authenticated connections cannot be synchronized.
Accounting information is accumulated on each cluster member and sent to the Security Management Server and aggregated. In the event of a failover, accounting information not yet sent to the Security Management Server is lost. To minimize this risk, you can reduce the time interval when accounting information is sent. To do this, on the cluster object Logs and Masters > Additional Logging page, set a lower value for the Update Account Log every attribute.
Just to confuse things even further:
- Load Sharing Multicast Mode
- Load Sharing Unicast Mode
- New High Availability Mode
- High Availability Legacy Mode
- confusing names.
Load Sharing Multicast Mode
- still uses Multicast Ethernet which causes major problems in switching hardware and effectively creates a denial of service attack on servers.
- Multicast Ethernet relies on frame replication in the Ethernet hardware to be reliable. Many switches do not have high performance replication capabilities and it requires a lot of research to determine which switches have the function.
- Note that Multicast Ethernet is not the same as IP Multicast.
- Multicast Ethernet is BUM traffic and causes every server in a directly attached VLAN to receive, read and process the Ethernet frames. This causes significant processor and network load.
- Where a VLAN extends over many switches, the Ethernet multicast traffic must traverse the entire VLAN. Thus servers on a VLAN that have a default gateway on the CheckPoint are at risk from a traffic flood that looks like a DDOS attack.
- Use of Multicast Ethernet requires disabling of IGMP on some /all switches to prevent multicast suppression.
- Recommends the use of long obsolete partner products
When Clustering, there are many conditional statements about VPN connections. I would have little confidence in running “Load Sharing Multicast Mode” and running VPN services on the same physical unit.
From this review of the documentation, CheckPoint doesn’t have strong solution for clustering and the weak documentation suggests that it’s not fit for critical use cases.
Recommended Clustering Mode is not fit for purpose
The use of Ethernet Multicast for clustering is a high risk strategy and operates is the same way that Microsoft NLB does and is comprehensively deprecated by any competent network engineer.
This type of so-called “load balancing” for “Load Sharing Multicast Mode” is a high risk technology that forces switches to have IGMP disabled to enable static MAC mappings, or other more complex and hard to maintain/operate kludges. It should be avoided at all costs.
Note that a switch with IGMP disabled will create a Denial of Service condition by forcing all devices on the switch/VLAN to receive Unknown Unicasts this will cause CPU processing. It will also cause packet floods on switch trunks. A nasty business indeed.
Poor performance and value
Because the forwarding performance is and price/performance is very poor the use of CheckPoint firewall in any network is not a good choice. The use of software “blades” to add other features was tested but the CPU and forwarding performance was dramatically impacted. The quoted cost of upgrading the hardware was so large that they are now considering replacing the CheckPoint with Fortinet or Palo Alto. An evaluation is under way.
Previous Article: Checkpoint/Nokia Firewall Clustering. Uh Oh. – http://etherealmind.com/checkpoint-nokia-firewall-cluster-xl/
Addendum – 20140903
Phoneboy (don’t know his real name) has made an robust response to my criticism. He is CheckPoint employee and erstwhile mouthpiece – http://securitytheater.phoneboy.com/a-more-balanced-view-of-check-point-clusterxl-load-sharing which puts the CheckPoint in a better light. Its a good post and well worth reading. It doesn’t change my view of CheckPoint since my experience in the real world doesn’t match the claims he makes.
Note that CheckPoint has not offered to help the customer or me to provide a better service / solution as at 20140903.
- This usually means removing CheckPoint firewalls from the network because without current manuals I am unable to fix CheckPoint problems. ↩