[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ADSL MTU mystery - partial (complete?) solution



I am CCing this email to the bezeq intl. support email. The specific 
problem was solved for the person who was involved, but perhaps they 
will choose to fix the overall problem.

I have had a call from a good friend, who is connecting to the Internet 
using a Linux box doing IPChains masquarading, and a dial-up (56K). No 
ADSL, or any other fancy stuff.

It turns out that, starting two days ago, whenever a bigger than trivial 
mail message arrives for him, outlook express hangs when trying to D/L 
it using POP3. Yes, they are hosting a W2K machine behind the Linux gateway.

SSHed into his box, and spent the next hour and a half or so trying to 
figure out what was going on. tcpdump proved to be just useful enough to 
be able to capture the packets into a file, which I then scped into my 
machine, and analyzed with ethereal.

It turns out that the pop3 server (mail.bezeqint.net) was sending 
packets with the "Don't Fragment" bit set. So far - no great suprise 
(PMTU discovery). It also turned out that packets amounting to 1500 
bytes were dropped. I know this because I did receive the packet 
following the 1500 bytes dropped packet, and I could calculate it's size 
by the sequence numbers delta.

Packets as big as 1440 bytes seemed to go through with no problem.

My conclusion was that the mail server was configured to use a local MTU 
of 1500 bytes. One of the routers along the path between the POP server 
and my friend's IP had a lower MTU (if anyone can guess why?), and since 
the packets were marked "DF", dropped it. A router (possibly the same 
one) was configured to block (probably) all ICMPs, and so the 
"Fragmantation Needed" ICMP never reached the POP server, which simply 
tried again and again to retransmit.

Has I had the desire, I could have found out exactly which router it was 
that had the MTU lower, and which blocked the ICMPs. I don't think it's 
my job, however. I have notified bezeq intl. by phone of the problem 
(more details later), as well as CCing them on this email, so they 
should be able to solve the problem on their own.

Two points of interest are:
1. How come no "legitimate", i.e. - Windows only - user came across this 
problem.
2. How to solve the problem for my friend.

2 seemed easier. We all heard of this problem before, and know that you 
need to set the MTU lower. Doing this on the Linux gateway indeed 
allowed us to telnet to port 110, and D/L the message. It did not allow 
the window machine, however. Why? Why would lowering the MTU solve the 
problem, as the problem was for packets coming from the server. It 
appears that setting the MTU for out machine shouldn't affect the 
problem for incoming packets.

The answer lies in a field called "MSS", or "Maximum Segment Size". This 
is negotiated between the hosts during connection establishment. Each 
side tells the other one not to send packets bigger than X. The value 
each host chooses for it's advertised MSS is the MTU for the same 
interface!! Lowering the MTU on the Linux machine caused it to send out 
a lower MSS, which meant that the packets never required fragmantation, 
and the problem was bypassed.

This also explains why lowering the MTU on the linux's interface didn't 
help clients behind it. Lowering the MTU on the Linux's side indeed 
caused hosts behind it to lower the chunk size of outgoing packets, but 
did not affect the MSS negotiated at startup, and therefor did not work 
around the configuration bug in the ISP's routers.

As for question number 2 - why don't Windows machine suffer from the 
same problem? The answer is that the default MTU for a PPP connection on 
Windows is 512 bytes.

I called Bezeq International on 20 past midnight. I waited approx. 10 
minutes on hold. At the end I was answered by a nice girl. To save me a 
lot of explaining I started by asking her whether she knew what "Don't 
Fragment", "Fragments" and "MSS" meant. She started guessing about the 
fragmentation neede thing, and I told her that I wanted to report a 
router misconfiguration, and she would save me time if she transferred 
this to someone who could actually fix it.

About 5 minutes later a technician called "Assi" called me back. it took 
him a little while of totally not following me, and then he said "ok, 
we'll take care of it". Whether he actually understood me, or just got 
tired of me is left as an exercise for the readers.

Summery:
A. The connectivity problems people have been expriencing are a result 
of routers dropping ICMPs, and (possibly other) routers needing smaller 
MTUs. This is a configuration problem at the ISP, and is not a result of 
misconfiguration on your end (ADSL, NAT, or otherwise). Some ISPs (and 
Bezeq Intl does, sadly, fall under that category) won't recognize this.
B. Luckily, you can work around this problem by lowering the MTU on ALL 
MACHINES that participate in the communication. This works not because 
the MTU is too high, but because the MSS is taken from the MTU. It will 
therefor not help to lower the MTU only on the gateway, or to change MTU 
after the connection is already established.
C. I tried to contact Mulix on IRC to find out how to lower the MTU on 
Windows 2000. While he didn't know the answer to that one, he was able 
to tell me that 2.4.4 and higher is capable of rewriting the MSS (I 
think that's what it means) on packets going through. There is also a 
module for 2.2 that does the same (STFW for clamp_mss).
D. There is a (very incomplete) KB article at Microsoft's that explains 
how to change Window's MTU. The very short summery is "change the 
registry at (for Nt or 2000) 
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces
Find the right interface (by the IP address), and add a DWORD type value 
called "MTU". Set it for the right MTU for that interface, and reboot 
the machine (this is Windows, remember?). As usual - apply at your own 
risk, I will not be liable blah blah blah.
E. ISP's support can be downright MEAN when they want to. Bezeq's Intl 
asked my friend's mother for her WINDOWS REGISTRATION KEY!!!! What 
possible relevance can it have - noone knows.

I hope this helps. It certanly shed some light on the mystery for me. I 
believe the offending router drops all ICMP packets, of any type. 
traceroute out of the machine miracolously siezes to return replies two 
hops away from the machine.

The problem appeared all of a sudden, about two days ago. My first 
suggestion was that traffice to the POP server was directed through 
routers residing in downtown new-york, in the world trade center's tower 
buildings. And yet, noone at Bezeq Intl. was in a position to know 
anything about it.

            Shachar



=================================================================
To unsubscribe, send mail to linux-il-request@linux.org.il with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail linux-il-request@linux.org.il