Stability bug in UVM JVM (Crashes, Orphans OpenVPN Processes) in Untangle Firewall
Perhaps someone can help. I opened a case as I do have an Untangle home license. They asked me to reinstall before they would help me, which I did, and then they proceeded to say they could no longer help me because they changed their policy to disallow Untangle home licenses from access to technical support.
UVM (JVM) crashes daily, sometimes many times per day, orphaning the openvpn processes and stopping all communication through TunnelVPN service. The only remedy is to 'pkill openvpn', and then toggle (restart) the TunnelVPN service to re-launch the processes. They are normally owned by UVM JVM, but since that crashes, the openvpn processes are orphaned. UVM then attempts to start new openvpn processes, while the existing orphaned processes are running, and those crash while attempting to start, because the management socket used for UVM to communicate with openvpn and read the connection statistics and manage the instances is already in use with the orphaned processes.
I'll demonstrate below, as I supplied Untangle, that there is an application bug with their java code. Forgive me, as there is a lot of information here.
Example below of the orphaned openvpn processes. You can see PID 1 owns them. Normally it is the java application UVM. You can see the ports 2200, 2202, etc, are the management ports that are bound by openvpn and the reason why the new openvpn instances fail to start.
Code:
root 97546 1 0 Dec09 ? 00:09:15 /usr/sbin/openvpn --config /usr/share/untangle/settings/tunnel-vpn/tunnel-202/tunnel.conf --writepid /run/tunnelvpn/tunnel-202.pid --dev tun202 --cd /usr/share/untangle/settings/tunnel-vpn/tunnel-202 --log-append /var/log/uvm/tunnel.log --auth-user-pass auth.txt --script-security 2 --up /usr/share/untangle/bin/tunnel-vpn-up.sh --down /usr/share/untangle/bin/tunnel-vpn-down.sh --management 127.0.0.1 2202 --bind --local [IP ADDRESS] --port 0
root 97548 1 0 Dec09 ? 00:00:39 /usr/sbin/openvpn --config /usr/share/untangle/settings/tunnel-vpn/tunnel-200/tunnel.conf --writepid /run/tunnelvpn/tunnel-200.pid --dev tun200 --cd /usr/share/untangle/settings/tunnel-vpn/tunnel-200 --log-append /var/log/uvm/tunnel.log --auth-user-pass auth.txt --script-security 2 --up /usr/share/untangle/bin/tunnel-vpn-up.sh --down /usr/share/untangle/bin/tunnel-vpn-down.sh --management 127.0.0.1 2200 --nobind
.....
The following errors show up in tunnel.log when the UVM restarts and tries to spawn new openvpn processes, or when you try to respawn them by toggling the service.
Code:
.....
Thu Dec 10 14:42:27 2020 WARNING: Using --management on a TCP port WITHOUT passwords is STRONGLY discouraged and considered insecure
Thu Dec 10 14:42:27 2020 OpenVPN 2.4.7 x86_64-pc-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [PKCS11] [MH/PKTINFO] [AEAD] built on Feb 20 2019
Thu Dec 10 14:42:27 2020 library versions: OpenSSL 1.1.1d 10 Sep 2019, LZO 2.10
Thu Dec 10 14:42:27 2020 MANAGEMENT: Socket bind failed on local address [AF_INET]127.0.0.1:2205: Address already in use (errno=98)
Thu Dec 10 14:42:27 2020 Exiting due to fatal error
As you can see above, because the ports are already bound by the orphaned processes, they fail to start. This leaves the TunnelVPN service unusable. I would presume it also would affect the OpenVPN service, since it likely uses the same mechanisms and would be spawned by UVM.
In the console.log.crash file, which was generated when UVM crashes, it has the following. This seems to imply that they are using libnetcap, and I'm guessing the process is unable to keep up (buffer) and is crashing. Not sure what specifically this is tied to or how they are using this exactly, but seems to be a potential indicator.
Code:
.....
Dec 10 14:35:41 gateway uvmconsole: 12-10 14:35:41.018221| ERROR:./libnetcap/src/netcap_udp.c:187:WARNING: Mailbox Full: Dropping Packet (xxx.xxx.xx.88:16393 -> xxx.xxx.xxx.55:16393)
Dec 10 14:35:41 gateway uvmconsole: [93004.388s][info][gc] GC(12601) Pause Full (System.gc()) 91M->80M(115M) 278.974ms
Dec 10 14:35:41 gateway uvmconsole: 12-10 14:35:41.041086| ERROR:./libnetcap/src/netcap_udp.c:187:WARNING: Mailbox Full: Dropping Packet (xxx.xxx.xx.88:16393 -> xxx.xxx.xxx.55:16393)
Dec 10 14:35:42 gateway uvmconsole: [93005.852s][info][gc] GC(12602) Pause Young (Normal) (G1 Evacuation Pause) 96M->80M(115M) 4.452ms
Dec 10 14:35:45 gateway uvmconsole: [93009.312s][info][gc] GC(12603) Pause Young (Normal) (G1 Evacuation Pause) 95M->81M(115M) 4.313ms
.....
Dec 10 14:37:32 gateway uvmconsole: [93116.254s][info][gc] GC(12646) Pause Young (Normal) (G1 Evacuation Pause) 97M->85M(115M) 4.155ms
Dec 10 14:37:35 gateway uvmconsole: #
Dec 10 14:37:35 gateway uvmconsole: # A fatal error has been detected by the Java Runtime Environment:
Dec 10 14:37:35 gateway uvmconsole: #
Dec 10 14:37:35 gateway uvmconsole: # SIGSEGV (0xb) at pc=0x00007efe23516a00, pid=60824, tid=60869
Dec 10 14:37:35 gateway uvmconsole: #
Dec 10 14:37:35 gateway uvmconsole: # JRE version: OpenJDK Runtime Environment (11.0.8+10) (build 11.0.8+10-post-Debian-1deb10u1)
Dec 10 14:37:35 gateway uvmconsole: # Java VM: OpenJDK 64-Bit Server VM (11.0.8+10-post-Debian-1deb10u1, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
Dec 10 14:37:35 gateway uvmconsole: # Problematic frame:
Dec 10 14:37:35 gateway uvmconsole: # C [libpthread.so.0+0xca00] pthread_rwlock_rdlock+0x0
Dec 10 14:37:35 gateway uvmconsole: #
Dec 10 14:37:35 gateway uvmconsole: # Core dump will be written. Default location: /tmp/core
Dec 10 14:37:35 gateway uvmconsole: #
Dec 10 14:37:35 gateway uvmconsole: # An error report file with more information is saved as:
Dec 10 14:37:35 gateway uvmconsole: # /tmp/hs_err_pid60824.log
Dec 10 14:37:35 gateway uvmconsole: #
Dec 10 14:37:35 gateway uvmconsole: # If you would like to submit a bug report, please visit:
Dec 10 14:37:35 gateway uvmconsole: # https://bugs.debian.org/openjdk-11
Dec 10 14:37:35 gateway uvmconsole: #
From the above, you can see it created a hs_err_pid60824.log file. Inside, there is a lot of information, but I'll post the more relevant part only.
Code:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007efe23516a00, pid=60824, tid=60869
#
# JRE version: OpenJDK Runtime Environment (11.0.8+10) (build 11.0.8+10-post-Debian-1deb10u1)
# Java VM: OpenJDK 64-Bit Server VM (11.0.8+10-post-Debian-1deb10u1, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C [libpthread.so.0+0xca00] pthread_rwlock_rdlock+0x0
#
# Core dump will be written. Default location: /tmp/core
#
# If you would like to submit a bug report, please visit:
# https://bugs.debian.org/openjdk-11
#
--------------- S U M M A R Y ------------
Command Line: -Xmx4023m -Xms32m -Xss228k -XX:+PerfDisableSharedMem -XX:+UseG1GC -XX:MaxHeapFreeRatio=30 -XX:MinHeapFreeRatio=20 -XX:InitiatingHeapOccupancyPercent=30 -Xlog:gc*,gc+heap=debug:file=/var/log/uvm/gc.log:time,tags -verbose:gc
-Dcom.sun.jndi.ldap.object.disableEndpointIdentification=true -Dcom.sun.jndi.ldap.connect.pool.protocol=' plain ssl ' -Dcom.sun.jndi.ldap.connect.pool.timeout=30000 -Dprefix= -Djava.library.path=/usr/lib/uvm -Dnetcap.numthreads=25 -Drepo
rts.max_queue_len=1000000 -Dnetworkaddress.cache.ttl=30 -Dsun.net.inetaddr.ttl=30 -Dnetworkaddress.cache.negative.ttl=10 -Djava.net.preferIPv4Stack=true com.untangle.uvm.Main
Host: Intel(R) Celeron(R) CPU J1900 @ 1.99GHz, 4 cores, 7G, Debian GNU/Linux 10 (buster)
Time: Thu Dec 10 14:37:35 2020 CST elapsed time: 93118 seconds (1d 1h 51m 58s)
--------------- T H R E A D ---------------
Current thread is native thread
Stack: [0x00007efdf0110000,0x00007efdf014f000], sp=0x00007efdf0141d88, free space=199k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libpthread.so.0+0xca00] pthread_rwlock_rdlock+0x0
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000130
Register to memory mapping:
RAX=
[error occurred during error reporting (printing register info), id 0xb, SIGSEGV (0xb) at pc=0x00007efe22edac08]
Registers:
RAX=0x0000000000000100, RBX=0x00007efda0071d40, RCX=0x0000000000000000, RDX=0x0000000000000001
RSP=0x00007efdf0141d88, RBP=0x00007efda0067920, RSI=0x0000000000000000, RDI=0x0000000000000118
R8 =0x00007efda0067960, R9 =0x0000000000000000, R10=0x0000000000000000, R11=0x0000000000000246
R12=0x0000000000000001, R13=0x00007efdf0b05951, R14=0x00007efd9409d400, R15=0x0000000000020102
RIP=0x00007efe23516a00, EFLAGS=0x0000000000010202, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
TRAPNO=0x000000000000000e
Top of Stack: (sp=0x00007efdf0141d88)
0x00007efdf0141d88: 00007efdf0ae1174 00007efd940724f0
0x00007efdf0141d98: 00007efda0071d40 00007efda0067920
..........
As you can see, their JVM process com.untangle.uvm.Main is crashing due to a SIGSEGV. It references libpthread.so.
[root @ gateway] /var/log/uvm # addr2line -e /usr/lib/x86_64-linux-gnu/libpthread.so -C -f 0xca00
__GI___pthread_rwlock_rdlock
??:?
The library they are using is libpthread.so that it is crashing from, which may be due to an access violation or a stack overflow. Suspect it is a stack overflow, but it's not my code.
There are no resource issues on the system. It is idle most of the time, and has 8GiB RAM, of which about 4.7 is available at any one time.
Perhaps Untangle will take notice and want to address a pretty clear bug in the software. When their java code crashes, it does leave a crash dump. But analyzing a foreign program's crash dump isn't my expertise.
Any help would be sincerely appreciated.