Page 1 of 2 12 LastLast
Results 1 to 10 of 15
  1. #1
    Newbie
    Join Date
    Jun 2018
    Posts
    14

    Default Untangle and problems with high amount of sesions

    Our untangle server configuration:
    HP Proliant 360e gen8 server
    2x Intel Xeon CPU E5-2450L 1.80GHz processors (in total 16 cores)
    32Gb DDR3 RAM
    HP Ethernet 1GbE 4-port 366i Adapter (External network)
    1GbE (Quad Port) RJ45 NIC - Intel I350 adapter (Internal Network).

    We use this untangle server to connect rack of 30 servers to internet (1Gb/s dedicated line from ISP).

    Our typical usage is about 100Mb/s and ~100 000 sessions.
    All sessions are bypassed and we use untangle only for routing purpose without any active apps.

    Recently we noticed that if we have higher amount of sessions 150-200k we start to have problems (even though speed is only slightly higher at about 150Mb/s).
    From the servers themselves everything seems to be working fine, however we can't connect to servers via Remote desktop connection from other locations, if we ping from other location to untangle IP, pings are dropping (1-2 out of 5) and traceroute shows that there are timeouts between last ISP location and our untangle server.

    The untangle itself shows no signs of overload (~4Gb RAM used and 3.0-4.0 cpu usage, swap usage is 0%).

    Firstly we thought that this is ISP problem, but they checked from their end and said that everything works with their line and it is problem with our equipment. They are offering to sell us managed Fortigate 60F instead of our untangle server, but I doubt that it is hardware problem because our untangle server is way more powerful than that Fortigate box.

    If we lower amount of sessions (to about 100k) problem is gone.
    It is also not speed related because we tried to download large files with 600Mb/s for 12+ hours and this doesn't cause any problems.
    Only when we go higher than 150-200k sessions, problems appear.

    Maybe someone can help us investigate what could cause instability with high amount of sessions?

    Thank you in advance for any suggestions.

  2. #2
    Untangler jcoffin's Avatar
    Join Date
    Aug 2008
    Location
    Sunnyvale, CA
    Posts
    8,842

    Default

    Try removing reporting. While the sessions are bypassed, the session events are recorded and most likely the transaction queue is maxed out.
    Attention: Support and help on the Untangle Forums is provided by
    volunteers and community members like yourself.
    If you need Untangle support please call or email support@untangle.com

  3. #3
    Newbie
    Join Date
    Jun 2018
    Posts
    14

    Default

    Quote Originally Posted by jcoffin View Post
    Try removing reporting. While the sessions are bypassed, the session events are recorded and most likely the transaction queue is maxed out.
    Can you clarify a bit, how it should be done?
    We don't have Reports app installed, the only reporting is in main dashboard window with sessions count, traffic speed, etc.

  4. #4
    Untangler jcoffin's Avatar
    Join Date
    Aug 2008
    Location
    Sunnyvale, CA
    Posts
    8,842

    Default

    Oh, you already have reporting app removed so this is not the issue.
    Attention: Support and help on the Untangle Forums is provided by
    volunteers and community members like yourself.
    If you need Untangle support please call or email support@untangle.com

  5. #5
    Untangle Ninja sky-knight's Avatar
    Join Date
    Apr 2008
    Location
    Phoenix, AZ
    Posts
    24,659

    Default

    If CPU and RAM stats report no overload, but you're experiencing frame loss... you have three things to look at...

    And they are ALL hardware.

    1.) Drive IO limitations (but without reports installed, that pretty much rules this out)
    2.) Network IO limitations, with 10gbit interfaces the drivers can get wonky sometimes. Do you have frame errors incrementing on an interface?
    3.) The hardest to manage... PCI Bus limitations. The hardest nut to crack, and seems exceedingly unlikely for a Gen8 HP Proliant... but always a possible issue.

    Now given the information you've presented there is one thing that concerns me. 3-4 load? On a 16 core server means 3-4 processes WAITING for the CPU on a unit that has all traffic bypassed? That's nuts... that should be wirespeed saturation to see that, of multiple 10gbit interfaces. That is.. unless you've got a terrible NIC in play.

    I think, you've got a torrent hound or something similar applying a DOS, and it's not the session count that's causing you to drop frames, you're saturating a pipe. You won't see this in your bandwidth reports because shield is doing its job, and it works even with bypassed stuff.
    Last edited by sky-knight; 06-21-2020 at 10:57 AM.
    Rob Sandling, BS:SWE, MCP
    NexgenAppliances.com
    Phone: 866-794-8879 x201
    Email: support@nexgenappliances.com

  6. #6
    Newbie
    Join Date
    Jun 2018
    Posts
    14

    Default

    Thank you for good insight.

    Actually our servers are working with data scrapping and each of them have multiple virtual machines, which itself use multiple threads to scrape data. So actually even though there are only 30 servers, talking about network load,it would be more like 5000 regular internet users browsing at the same time, even though traffic speed is not that high (100-150Mb/s). So it is actually quite comparable how torrents work (large amount of connections from single machine).

    I thought that with all traffic bypassed, there should be no firewall/filtering issues because we just using Untangle as a router, however as I understood from your post it might be that Untangle starts to block some sessions even though they should be bypassed?

    Talking about hardware issues, I think they could be eliminated. We have another servers rack with 30 servers and similar load, but it uses other server (I think it is HP proliant DL160 gen8) with different networks adapters, but same amount of CPU cores and RAM. We tested this problem in it and it is same problem, if we are over 150-200k sessions, problems starts to appear, but if we lower sessions to about 100k - no problems at all.

    As for CPU load, I thought that it is just like in Cpanel, when number represents used cores, so I thought that with 3.0-4.0, it is ok, as we have 16 cores, but from your explanation it seems that it indicates overload...

    One more thing I want to add, you talk about 10gbit interfaces, but we use 1gbit interfaces and our load during these problems are lower than 200Mb/s, if that changes annything.

    I was also thinking about problem with ISP method of delivering optical line to us. It uses GPON and even though we get fiber to our object, we have to use their Optical to Copper ethernet adapter (Huawei EchoLife HG8010H) to connect untangle because there are no commercial SFP+ module, which works with GPON (as far as I know). I thought maybe it overloads with our usage, but ISP said that it doesn't do any processing and could be used with any kind of load.

    Any ideas, what else we could try to avoid this problem?

    Quote Originally Posted by sky-knight View Post
    You won't see this in your bandwidth reports because shield is doing its job, and it works even with bypassed stuff.
    Does this mean that we should try to disable Untangle Shield and see if that helps?
    Last edited by falco; 06-21-2020 at 01:35 PM.

  7. #7
    Untangle Ninja sky-knight's Avatar
    Join Date
    Apr 2008
    Location
    Phoenix, AZ
    Posts
    24,659

    Default

    Given what you've stated, you're going to have to argue with shield... it will cause problems. All it cares about are sessions per IP address. I'd say check the logs on it, but without the reports module you don't have any!

    As for LOAD, be careful... a load of 3 means 3 processes waiting for CPU, that means on a 16 core system you have 16 running, and 3 in line... if Untangle has to wait on CPU you're going to drop sessions. So... dig into your hypervisor and give Untangle priority scheduling, that should help. Untangle cannot wait in line, no virtual router can... they must have priority over other VMs if they don't the entire network feels it when the hypervisor gets busy.
    Rob Sandling, BS:SWE, MCP
    NexgenAppliances.com
    Phone: 866-794-8879 x201
    Email: support@nexgenappliances.com

  8. #8
    Untangle Ninja
    Join Date
    Jan 2011
    Posts
    1,268

    Default

    sky, that's not how CPU Load works. a Load of 3 means there's 3 CPU's worth of work in the scheduler's task list. on a single-core system that'd be very bad. On a quad-core or better system, that's completely fine. On a 16-core system, what Linux calls a load of 3 a Windows system would call 18.75% (that is to say, if there were just 4 CPU's that were otherwise equal, Linux would still call the load 3, while Windows would call it 75%)
    that said, a load of 3 does seem unusual for an untangle not otherwise doing anything, it'd be interesting see which processes are using CPU and what states they're in; if a bunch of time is being wasted in I/O Wait ("Wa:" in top) that certainly points to an VM environment problem.

  9. #9
    Untangle Ninja sky-knight's Avatar
    Join Date
    Apr 2008
    Location
    Phoenix, AZ
    Posts
    24,659

    Default

    *Edit* yes... you're correct.

    A load of CPU count is 100% utilization. Linux does count the stuff in process.

    I still think the CPU's are overly allocated in this case. The scheduling needs addressed.
    Last edited by sky-knight; 06-21-2020 at 03:46 PM.
    Rob Sandling, BS:SWE, MCP
    NexgenAppliances.com
    Phone: 866-794-8879 x201
    Email: support@nexgenappliances.com

  10. #10
    Newbie
    Join Date
    Jun 2018
    Posts
    14

    Default

    Thank you guys for all your input, it is really helpful!

    From @sky-knight comments I understand that with our usage default shield policies might start blocking our usage.
    I will try to disable Shield and check if it helps.

    Talking about CPU load, it is quite different at times for us. Sometimes it is just 0.1-0.5, but sometimes it is 3.0-4.0, even though our servers are doing same work 24/7 and network load are quite stable, unless we manually give them more work and sessions count increases. Maybe there is a way to investigate for what processes CPU is doing the work?

    Also Untangle is installed as the only OS in our server (not as virtual machine), so there are no other things, which could increase its usage.

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

SEO by vBSEO 3.6.0 PL2