[AES6.3.x] TSAPI process suddenly down with core file

Doc ID		SOLN310530
Version:		3.0
Status:		Published
Published date:		08 Feb 2019
Created Date:		02 Jun 2017

Author:

kyoshitake

Details

TSAPI client users suddenly disconnected. According to messages log, tsrv process had down unexpectedly with core file.

Version:
AES 6.3.0.0.212-0

Problem timing:
Around 16:58 May 8

A part of messages:
May 8 16:58:39 node2-AES kernel: INFO: task tsrv:19405 blocked for more than 120 seconds.
May 8 16:58:39 node2-AES kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 8 16:58:39 node2-AES kernel: tsrv          D 019E8E9D 2348 19405 19402         19425       (NOTLB)
May 8 16:58:39 node2-AES kernel:        ca604e58 00000286 a6037ba1 019e8e9d a6038edc 00000009 ec225aa0 a60388db
May 8 16:58:39 node2-AES kernel:        019e8e9d 00000d3a ec225bac c201bcac ecc08dc0 0000000f ffffffff 00000003
May 8 16:58:39 node2-AES kernel:        00000002 c2016168 00000000 00000001 00000000 00000000 ecc08f6c ecc08f68
May 8 16:58:39 node2-AES kernel: Call Trace:
May 8 16:58:39 node2-AESAES4 kernel: [<c0627717>] wait_for_completion+0x6f/0x97
May 8 16:58:39 node2-AES kernel: [<c041ae3e>] default_wake_function+0x0/0xc
May 8 16:58:39 node2-AES kernel: [<c0421fec>] exit_mm+0x69/0xb9
May 8 16:58:39 node2-AES kernel: [<c0423507>] do_exit+0x234/0x7c8
May 8 16:58:39 node2-AES kernel: [<c0423b17>] sys_exit_group+0x0/0xd
May 8 16:58:39 node2-AES kernel: [<c042bd77>] get_signal_to_deliver+0x3b2/0x3dd
May 8 16:58:39 node2-AES kernel: [<c0404a7e>] do_notify_resume+0x77/0x68e
May 8 16:58:39 node2-AES kernel: [<c041ae3e>] default_wake_function+0x0/0xc
May 8 16:58:39 node2-AES kernel: [<c0445795>] audit_filter_syscall+0x79/0x98
May 8 16:58:39 node2-AES kernel: [<c044675c>] audit_syscall_exit+0x2d4/0x2ea
May 8 16:58:39 node2-AES kernel: [<c040550d>] work_notifysig+0x13/0x1a
May 8 16:58:39 node2-AES kernel: =======================

~~~~~~
May 8 17:05:48 node2-AES -bash: HISTORY: PPID=5623 PID=5624 UID=500 uptime
May 8 17:06:55 node2-AES -bash: HISTORY: PPID=5623 PID=5624 UID=500 cat /etc/hosts
May 8 17:07:09 node2-AES -bash: HISTORY: PPID=5623 PID=5624 UID=500 ssh [email protected]
May 8 17:07:58 node2-AES -bash: HISTORY: PPID=5623 PID=5624 UID=500 exit
May 8 17:09:26 node2-AES maintenanceMng[7122]: TSAPIMonitor: avAESServiceDown: life-cycle detected service TSAPI is DEAD
May 8 17:09:42 node2-AES sudo: avaya : TTY=unknown ; PWD=/var/log/avaya/aes ; USER=root ; COMMAND=/opt/mvap/bin/tsrv
May 8 17:09:43 node2-AES maintenanceMng[7282]: TSAPIMonitor: avAESServiceUp: Service TSAPI started

Problem Clarification

I found similar issue below, and compared core dump files, but does not match with thats case.

AES 6.3.0/6.3.1: TSAPI Service Crashing Creating Core Dumps

https://support.avaya.com/ext/index?page=content&id=SOLN240265&group=UG_INTERNAL

So, this case needed escalation to AES engineer.

I collected getlogs, and /var/log/sa. <<< Files under this directory includes record of memory usage, so you need to collect it. It will be kept about 9 days.

Regarding TSAPI trace, we had not enabled it for this AES.

Cause

TSAPI process down could be caused by TSAPI memory leak.

When the issue happens, the swapped space has increased a lot. In fact, before the issue happens, swapped usage at 50% is unusual. So it looks like the AES is starved of memory.

A part of sar file under /var/log/sa :

00:00:01    kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused kbswpcad
17:00:02        58364   2038972     97.22       776    307336   1048268   1048172     50.00    542852
17:10:01      1413660    683676     32.60      5324    251492   1609764    486676     23.21    101436

* "%memused " has been always used over 90 % since total memory size is only 2 giga-byte. We think it is not problem.
But %swpused should not getting bigger gradually such as this case.

Comment from T4 and R&D:

With the core dump alone without TSAPI trace log, the analysis is partly based on educated guess and is never 100% guarantee to pinpoint the triggering condition. Besides, this issue is ad-hoc and happened only once.

The developer did take a look at the core dump stack trace and suspect possible lack of TSAPI memory causing the core dump.

Solution

Since there is known memory leak issue in the customer system, we want the customer to

1)      Upgrade to AES 6.3.3 SP7
2)      Apply the TSAPI hot-patch (It is for SP7. Need escalation to Avaya AES engineer to obtain hotfix)
3)      Enable TSAPI trace log and monitor the situation (for recurrence of issue)

(NOTE: As of January 2019, SP11 is the latest GA patch).

Attachment Description

This file is the decorded core file of our case. Detailed of content is not match with core file of SOLN240265.

A part of core file:
******************************************************************************

warning: Can't read pathname for load map: Input/output error.
Core was generated by `/opt/mvap/bin/tsrv'.
Program terminated with signal 11, Segmentation fault.
#0 0x0038ef7b in TsapiMemoryMapEntry::setNext (this=0x0, offset=4294967295) at ../TsapiMemoryMapEntry.cpp:98
98      ../TsapiMemoryMapEntry.cpp: No such file or directory.
        in ../TsapiMemoryMapEntry.cpp
(gdb) bt
#0 0x0038ef7b in TsapiMemoryMapEntry::setNext (this=0x0, offset=4294967295) at ../TsapiMemoryMapEntry.cpp:98
#1 0x00391d4b in TsapiMemoryMapEntryList::remove (this=0x496004, entry=0x499594) at ../TsapiMemoryMapEntryList.cpp:522
#2 0x0808bbcd in TsapiMemoryMap<TsapiClosedClientConnectionMemoryMapEntry>::freeMapEntry (this=0x91c763c, entry=0x499594)
    at ../TsapiMemoryMap.cpp:369

Attachment File

coredecoded.txt
13K • < 1 minute @ 56k, < 1 minute @ broadband

Additional Relevant Phrases

tsrv TSAPI AES6.3 memory leak down core dump

Avaya -- Proprietary. Use pursuant to the terms of your signed agreement or Avaya policy