DrayTek UK Users' Community Forum

Help, Advice and Solutions from DrayTek Users

2820 random reboots and lockups 3.3.0 & 3.3.1.2

  • techhead
  • Topic Author
  • Offline
  • Junior Member
  • Junior Member
More
04 Jun 2009 11:08 #56199 by techhead
We have four v2820 units at a customers sites that I installed, and at random intervals they simply reboot for no apparent external reason, and three of them have gone unresponsive and required a power cycle. UPSs are in use also.

Firmwares used was the original 3.3.0 that they shipped with and now 3.3.1.2 that was recently released because of the need to pipe syslog messages down the VPN tunnels.

They are each used in conjunction with Cisco 878 acting as a SDSL modem/router on WAN2 port and it provides enough Syslog output to inform me that the Cisco on the same power source as the v2820 is not experiencing any power interuptions and is reporting when the reboots occour as the Ethernet port goes down then up again.

The devices are all using load sharing according to speed whcih is set explicitly, we have a pair of IPsec VPN tunnels setup between three of the instalations and a central office housing the servers etc.

Each pair of VPN is configured with the primary L2L profile set as always on so it dials aggresively to the head office SDSL first by its WAN2 then fall back to its WAN1. The secondary L2L profile is set with timeout set to 9999 so it is almost always on which sets that tunnel to NOT dial aggressivley and it dials the head office ADSL WAN1, first via its WAN2 or fall back to its WAN1 if WAN2 is down only if the aggressive dialing profile has failed to connect. If head office WAN2 gets restored the aggressive primary L2L profile connects and causes the backup L2L profile to drop and back off.

Load balancing policy directs ISP specific DNS requests down the respective WAN's at each site with a default policy to send everything not otherwise specified down WAN1 with the option to use the other WAN if WAN1 goes down for any reason, and the head office has a few firewall filters in place to control outbound SMTP traffic from workstations and control access to open ports. Not all sites have complicated firewall filtering in place but all sites have experience either or both the spurious reboots or lockups requiring an on-site power cycle to restore comms.

It could be days between these events but it can happen at any time, ie if a clerk is in mid print the printer has timed out before the devices are passing data again so time, paper and toner is wasted.

I have noticed some significant behaviour differences between 3.3.0 and 3.3.1.2 with the firewall filters and DOS protection that has caused me to have to re-design the firewall rules for the new firmware, ie that it by default applies all firewall filter rules including DOS protection to the VPN tunnel contents with no option to disable that behaviour. Because the firewall is applied to the VPN tunnels it incorrectly applies application and IM filtering through the VPN tunnels and blocks legit ethernet printing data streams that the application filter confuses with something else... couldn't narrow it down to which filter was the cause of the problem as the sites are remote and always in use during business hours so cannot be taken offiline to diagnose faulty firmware filtering. I had to disable the application and IM filtering.
I aslo noticed that 3.3.1.2 does not perform DNS cacheing while 3.3.0 did forcing the use of 3rd party DNS servers instead of ISP dedicated DNS server when using a multi-ISP dual WAN solution.

Regardless of the new undocumented features in 3.3.1.2 they still do the random reboots or lockups with no Syslog indication that they were under any particular attack from either WAN or LAN.

Any advice or insights or shared experiences would be apprietiated.

Please Log in or Create an account to join the conversation.

More
04 Jun 2009 12:22 #56202 by louis-m
i have had some similar issues with the above with the 3.3.2_rc5.

1. no reboots as yet but the odd lockup even though the router was showing everything as fine ie leds, status etc
2. csm blocking vpn's also.

not had a real chance to look into it yet but keeping an eye on it.

2820 = 3.3.2_RC5
2950 = 3.2.4

Please Log in or Create an account to join the conversation.

  • techhead
  • Topic Author
  • Offline
  • Junior Member
  • Junior Member
More
15 Jun 2009 10:31 #56318 by techhead
I note there is a thread with simlar problems running at:

http://www.forum.draytek.co.uk/viewtopic.php?p=56316

Had another lockup with no WAN remote access connectivity yesterday evening with the device power cycled first thing this morning when staff arrived on-site... one strange thing is although there was no connectivity using WAN1 or WAN2 the Syslog 4.4.1 server seems to show some unaccounted activity on the graph!

Please Log in or Create an account to join the conversation.

  • techhead
  • Topic Author
  • Offline
  • Junior Member
  • Junior Member
More
21 Jul 2009 16:09 #56816 by techhead
I believe I may have found my cause for the instability I have been experiencing with deployed V2820 devices.

It is a resource depletion vulnerability.
It turns out that unknown to any of the logging facilities a malicious user can attack any port used by remote admin facility even if it is supposedly locked down to a safe set of IP numbers or subnets due to the flawed way the developers have implemented the access control filter.

The filter is placed AFTER the TCP stack so an initial TCP connection phase is allowed to occour regardless of the source IP, only AFTER the TCP/IP connection exchange has completed is the source IP compared against the access list rules and if it matches the active connection is then connected to the relevant managment service. If it is not part of the permitted group the connection is simply closed.

The problem here is absolutly no syslog event are genereated by any of the phases of the remote admin connection so no indication is given if the system is under attack or being used by an authorised individual on a permitted IP range.

It would be a simple matter to Deny service and either reboot or crash these routers with a linux based machine if they have any remote admin activated and the ports used were discoverd... one would simply have to half-open a managment interface port a couple few hundred times and be pretty much assured at least one internal buffer would be overflowed.

No events will be recorded in the syslog server and there would be very few clues what was going on.

The workaround is to NEVER use remote admin on default ports or any ports used by any other standard service. choose random ports above port 2000.
The fix would be to redesign ALL draytek firmwares so the access list filter for remote admin services is applied imediatly the first TCP_SYN packet is received. If the TCP_SYN packet originating IP does not match the access list then the TCP_SYN should be silently ignored and no resources allocated. Only if the originating IP is a match for the access list should the TCP connection handshake be completed and access be permitted.

This does need to be addressed quite urgently, it is a significant design flaw and will be submitted to vulnerability reporting sites in due time.

Please Log in or Create an account to join the conversation.

Moderators: ChrisSami