Troubleshooting End to End Connectivity In Three Steps
From LB Wiki
You’re stumped. There’s a problem with your infrastructure, and you’re not positive what it is. You checked a few things out, but the symptoms befuddle you. You’re pretty sure it’s not the load balancer, but everyone is pointing at you, and you’ve got no proof.
This situation is endemic to the modern web infrastructure. There are so many different components involved that it's rare for one single person to understand even most of what's going on. A large installation will have one or more specialists for the database, the application programming, the network, security, the underlying servers themselves, and more. When something goes wrong, it could be any one of those issues.
Load balancers are in a particularly vulnerable place for two reasons:
- All client traffic goes through the load balancers
- They are part application device, part network device, and as a result they generally aren't well understood
So what is a load balancer administrator to do about this?
Contents |
End To End
I’ve been in that situation so many times, I’ve developed a relatively quick checklist process that can quickly be performed. This check list has a couple of benefits:
- It’s methodical and process-based, so it can pick up both the obvious and the oddity.
- When working in an environment where there are different groups responsible for different aspects of the infrastructure, this provides clear demarcation for them and helps with interaction.
- If the problem lies with the load balancer, this troubleshooting will point to the problem in about 90% of the cases.
- If the problem lies elsewhere, this troubleshooting will provide hard evidence to back up that claim.
The heart of this check list is the 4-step process basic to all load balancing:
4 step process
The 4-step process, basic to all load balancing, is how traffic traverses the load balancer (with the exception of DSR).
This process starts at the beginning, from the clients perspective, and moves through the entire connection from end-to-end, testing to make sure everything is hunky dory along the way.
- Make sure the load balancer sees the connection
- Determine how the load balancer handles incoming connections (Layer 4 or Layer 7)
- Check connectivity from the load balancer to the server
Going through the list, if you find a problem, you resolve the problem before you continue on. There may be other problems, but you’ll need to address the first problem you encounter first before moving on, otherwise there will be too many variables.
This check list is particularly useful in situations where you don’t have access to all of the equipment on the network, such as large enterprise situations where separate groups are responsible for areas like firewalls, network routing, switch infrastructure, and servers.
Prepping for the Check List
The three tools you’ll need to run through this checklist the following:
- telnet (installed by default on most Windows systems, although with Windows Vista telnet needs to be installed manually).
- openssl.
- tcpdump (or some other network sniffer).
It’s best if you use use tcpdump on the load balancer itself (which is included in most load balancers), but if that’s not possible, setup a network tap of some sort. For this checklist, we’ll assume you’re using a load balancer with tcpdump.
Step 1: Confirm that the connection is reaching the load balancer
Step 1 is to simply ensure that connections are going to where they are supposed to go. While this is obvious if you’re in a situation where the connection times out, this also works when you get a definite reaction (connection accepted/connection refused), this at least proves that the load balancer is the one sending the response, and not some other device that's inadvertently hijacking the connection.
In this test, we’re only concerned with whether the connection is reaching the load balancer, step 1 in the diagram above. To do this, run TCP dump on the load balancer with the following attributes:
tcpdump -i [interface] -n host [ip address of virtual] and port [port of virtual]
Then telnet to the IP and port of the virtual service on the load balancer. If you’re doing SSL termination at the load balancer, use telnet anyway, as we’re just testing for a valid TCP connection. It’s best that you do this from a subnet that is not the virtual service, so as to eliminate routing issues.
You can try ping, but it really doesn’t tell us anything. For one, ICMP is not the protocol we’re concerned with. Firewall rules also may block ICMP and not TCP, or it may block TCP and not ICMP. Either way, telnet works much better because on a TCP level it mimics a connection from a browser.
Typically, one of three things will happen:
- Connection refused
- Nothing connects, and the operation times out
- A connection is made
What we’re looking for is to see if the load balancer sees the attempted incoming connection. If the load balancer doesn’t see the incoming connection, it may be a routing issue (either Layer 3 or even Layer 2) or it may be that a firewall rule is blocking the connection. In any event, if you’re not seeing the connection, stop at this step, and figure out why. If you’re dealing with different networking groups, you can bring them the output from tcpdump they’ll have something substantive to go on.
If you do see the incoming connection, move on to the next step.
Step 2: How Is The Load Balancer Handling The Connection?
Load balancers exhibit different behavior depending on whether or not the virtual service is configured for Layer 4 or Layer 7. A layer 4-configured virtual service will not complete a TCP connection unless there's a connection all the way through to a real server (Diagram 1).
In a Layer 7-configured virtual service, as long as you can reach the IP and port of the load balancer's virtual service, you’ll probably get an established TCP connection regardless of what's going on with the back-end servers (although some load balancers allow you to change this behavior). With Layer 7, the load balancer acts as a proxy, so two separate TCP connections are involved (Diagram 2):
- The connection from the client to the load balancer
- The connection from the load balancer to the real server
Most load balancers don’t tell you explicitly whether or not you’re running in Layer 4 or Layer 7 mode. They switch between one or the other automatically depending on how you configure the virtual service. Only one load balancer that I know of tells you explicitly (KEMP Technologies). With others however, you’ll need to use your powers of deduction.
Generally if you’re using any type of cookie persistence, SSL termination, content rules, or programming language on the load balancer, you’re in Layer 7. In F5’s BIG-IP V9, when you set up a virtual server, there are a few options on the type of virtual server to setup.
"Standard" and "Performance HTTP" are Layer 7 configurations, while the others are Layer 4-limited. If you select the Layer 4 configurations, you'll notice many options disappear, since certain functionality requires Layer 7 (such as cookie persistence).
So what happens when you try to connect to a Layer 7 virtual service that has connectivity problems with real servers on the back end?
First, the connection will be accepted:system1> telnet 192.168.0.200 80 Trying 192.168.0.200... Connected to testvip (192.168.0.200). Escape character is '^]'.
A valid TCP connection has been established. If you’re troubleshooting and you get this, you may assume that the device you’ve connected to is the server. But this is not the case. You never directly connect to the server when the load balancer operates in Layer 7. In this example, there are no real servers that are on line. The BIG-IP shows all available servers as unavailable. Yet I was still able to make a connection.
Now I’ll do a simple “GET /”. What happens with this “GET /” depends on the vendor, and even on the version. Take for the example BIG-IP Version 4 and BIG-IP Version 9.
With version 9, I type in:
GET /
And soon as I hit <Enter>, the connection is closed by the BIG-IP sending a reset packet.
Connection closed by foreign host.
Checking TCPDump, I see the reset sent by the F5:
13:37:36.679409 192.168.0.200:80 > 192.168.2.2.33962: R 1:1(0) ack 7 win 4387 (DF)
With BIG-IP version 4, there is a slightly different behavior. The connection will still be open, and you can type to your hearts content, but nothing will show up. After about 15 seconds, the connection will eventually be reset by the V4 BIG-IP. This can make you think that the web server is hanging, but again, what is happening is that the server.
Step 3: Connectivity From The Load Balancer To The Server
First off, perform some sort of test to see if the real servers are even operational. Open up a browser and plug the IP address (and port) and see if you can bring up a site. If that doesn’t work, or if the server is a non-HTTP protocol, use telnet to see if you can get a TCP connection. If you can’t, you may want to figure out why. If the servers aren’t responding, you’re obviously not going to get far with a load balancer.
If you can get to the servers, log onto the load balancer and telnet from the load balancer to the real server on the port configured. Try connecting to at least one of the servers in a multi-server group.
> telnet 10.0.0.100 80
Again, one of three things will likely happen:
- You’ll get a valid TCP connection. If this occurs, try to make an HTTP request (assuming it's a web server). A simple “GET /” and <enter><enter> (hit enter twice) should suffice to get some sort of response. As long as you get some sort of response, that’s good.
- You’ll get a connection refused. This is usually because the server isn't listening on that particular IP/port, or a firewall is blocking you and sends a TCP RESET on the server's behalf.
- Your connection will time out. For whatever reason, packets aren’t getting to the server. This can either be a firewalling issue, or some Layer 2/3 routing issue. If you’ve got access to run tcpdump or other type of network trace on the server, see if you can see the incoming connection from the load balancer.
Doctor House, I Presume
After running these tests, you should have a much better picture on what’s going on with your network. If the issue was caused by the load balancer, this would probably have spotted the root cause. If it wasn’t the load balancer, then you’ve got evidence to prove that it’s not.



