FUJ00173153 - Peak Incident Management System Log - PC0261282

Evidence on official site

FUJ00173153
FUJ00173153

Peak Incident Management System

Call Reference PC0261282 Call Logger Deleted User -- Live Supp.Test
Release Targeted At -- HNG-X 15.31 Top Ref. WIN_ITM_OS_ AGENT CFG 1520 D001
Call Type Cloned call Priority C -- Progress restricted
Contact Deleted Contact Call Status Closed -- Build Fix Available to Call Logger
Target Date No Forcast Effort (Man Days) 0
Se: ‘The Monitoring Agent for Windows OS — Primary’ pid is using 4.7gb of memory

(C\IBM\ITM\TMAITM~1\kn
All References Type Value

DevIntRel-Director Live Supp.Test

Clone Master PC0261026

Release PEAK PC0269263

MSC 04350457573

TRIOLE for Service A16497108

Release PEAK PC0261780

DevIntRel-Director Live Supp.Test

Release PEAK PC0264301

Product Baseline WIN_ITM_OS_AGENT_CFG_1520 D001

Release PEAK PC0264293

Product Baseline WIN_ITM_OS_AGENT CFG_1520_ V001
Impact
St see User Date

Gerald Barnes 04-Aug-2017 17:49:26

SMG have suspended the saving of events because of this bug. This is a security issue.

The problem has uncovered an inefficiency in the sealer. It is repeatedly checking folders to see whether
anything needs to be done in a hard loop. It is always good practice to put a sleep of some duration if there
is nothing that needs to be done so resources will be freed to do other things. This fix should make for
examples prosecution queries quicker than before.

Progress Narrative

oate:10-Aug-2017 15:03:19 User:David Bower

CALL PC0261282 opened

Details entered are:~

summary: “The Monitoring Agent for Windows OS - Primary’ pid is using 4.7gb of memory (C:\IBM\ITM\TMAITM~1\kn
call Type:

call Priority:¢

Target Release:HNG-X Rel. Ind.

Routed to:Live Supp.Test - David Bower

bate:02-Aug-2017 13:42:50 User: Customer Call_

CALL PC0261026 opened

Details entered are:-

Summary: ‘The Monitoring Agent for Windows OS ~ Primary’ pid is using 4.7gb of memory (C:\IBM\ITM\TMAITM~1\kn
call Type:1
call Priority
target. Releas
lkouted to:E

ING-X Rel. Ind.
Unassigned

bate/Time Raised: Aug 2 2017 12:32PM
priority: ¢
contact Nam
contact Phone
riginator: X
wiginator's
product Serial No:
product Site:

ence: A16497108

FUJ00173153
FUJ00173153

lfransfer Note: Please pass to Tivoli-Dev via PEAK, thank:
[Below mail was received from Michael Greene

from: Greene, Michael
sent: Wednesday, August 02, 2017 1:
fo: FC.IN.POA_SMC
Isubject S Call

iti SMc, please raise a call for the following,

Priority : P(3)

pescriptioi : Service Name - ‘KNTCMA Primary’, ‘The Monitoring Agent for Windows OS - Primary’ pid is using 4.7gb of
remory (C:\IBM\ITM\TMAITM~1\kntcma.exe)

Please pass call to ‘POA-HNG NT Support’

Thanks

lichael Greene
lrusrrsu

2017-08-02 12:32:36 [ Sahanir, Rajkumar ]

INIT : Create a new request/incident /problem/change/issue

2017-08-02 12:35:13 [ Sahanir, Rajkumar ]

zneut_en_poa : Transfer Notification

2017-08-02 12:35:13 [ Sahanir, Rajkumar }

Jzneun_en_poa : Open Notification

2017-08-02 12:35:43 [ Sahanir, Rajkumar ]

zneut_en_poa : Transfer Notification

2017-08-02 12:38:28 [ Greene, Michael ]

ItoG : Noticed that the pid for ‘Monitoring Agent for Windows OS - Primary! service on LPRPARC201 was using 4.7qb of memory, (pid
lc: \IBM\ITM\TMATTM~1\kntcma.exe), server has &gb and memory was over 80% utilized. Service was stopped and started and memory has
been freed up.

c:\IBM\ITM\IMATTM~1\kntcma-exe details
File Version : 6.3.0.0

Product Version : 6.3.0.0

size : 2.28mb

bate Modified : 14/07/2017 10:55

11 attach log files from} :\IBM\ITM\TMATTM6 x64\loqs' to PEAK.

the pid is using 623mb memory on I IRRELEVANT

Please pass to Tivoli-Dev via PEAK to investigate, thanks.

2017-08-02 12:42:12 [ Greene, Michael I
lnout_en_poa : Transfer Notification

Jbate:02-Aug-2017 13:57:01 User:Joe Harrison
Product Infrastructure -- Tivoli (version unspecified) added.

jDate:02-Aug-2017 13:57:48 User:Joe Harrison
[fhe Call record has been transferred to the team: Tivoli-Dev
Progress was delivered to Consumer

Date:02-Aug-2017 1.
evidence Addea -{

1:12 Uscr:Michael Greene

Date: 02-Aug-2017 14:14:36 User:Shaun Wood
[the Call record has been assigned to the Team Member: Shaun Wood
Progress was delivered to Consumer

Jbate:02-Aug-2017 14:51:05 User:Shaun Wood
target Date/Time updated: new value is 31/12/9999 00:00

[Start of Response]

it have checked platforms on LST HDCR, the ARC201 has similar issues to live. This is a 4GB machine which is at 88% memory and 55%
cpu, the kntcma.exe was using 1.9gb of memory so nearly half the memory.

It have checked other Windows 2012 platforms on LST as all Windows 2012 are running ITM OS Agent 6.3.0.6 but none have memory
usage as high as the ARC201 platform.

FUJ00173153
FUJ00173153

[TEM201 193,516k
C201 50,184k
lssc201 40, 400k

It have checked IBM, there is a Fix Pack 7 available i.e, 6.3.0.7 but nothing documented about memory leaks. I did find APAR
Itv62549 for a memory leak issue on the Windows OS Agent but this was fixed in 6.3.0.5. It may be that 6.3.0.6 re-introduced the
issue ?

It will stop/start the ITM OS Agent on {if Tion LST and monitor plus I will get a PMR logged with IBM if over the next few
days we see an increase in memory usagé on the Live and LST ARC201 platforms.

549

[End of Response]
Response code to call type L as Category 40 -- Pending -- Incident Under Investigation
Response was delivered to Consumer

JDate:02-Aug-2017 14:58:53 User
ttm OS Agent restarted on

:Shaun Wood
@ 02/08/17 14:55

checked memory usage after this which was

a0, 904
a0, 968
40, 984
40, 968
a0, 916
40, 900

[this shows a low memory usage which does up/down as we'd expected, I will check again tomorrow.

JDate:02-Aug-2017 15:22:56 Uscr:Shaun Wood
It have asked Michael to keep an eye on the live ARC201.

Date:03-Aug-2017 15:40:43 User:Shaun Wood
We've just hit another issue with{
lsenerated 1.8million since 15:04 1

as ITM OS Agent is using 5.5GB, there have been a large number of security events
ig has been overwritten

It suspect the ITM OS agent is grabbing memory to read all of the events as it does provide details of 0S Log Files.

lAccording to Michael

{y03/y08/¥2017 15:33] Greene, Michael:
II can see a lot of audit type security events against the sealer.exe - An attempt was made to access an object. then The handle
to an object was closed - @ 15:04:25, thousands of them

thats whats filled the sec event log up

probably need to relax the audit settings

Jbate:03-Aug-2017 15:54:49 User:Shaun Wood
Looking at the Security log there are vast amounts of Audit Success Events around 15:05 of ID 4663 and 4658 for auditsvrcomp-

[the Security log goes from 15:04 to 15:48, there are 1.8 million records in 44 minutes. Of the 1,833,927 of these 1,825,830 are
Audit Success.

so based on this rate of 2.4 million security events per hour this server along will rack up 59 million security events per day.
this is being done due to new security measure for auditing. I would question what is running which is creating so many of these

levents as these are Success so this looks to be normal running which I'm guessing will only increase as we move into R16 & R17
are most systems will be audited.

joate:03-Aug-2017 15:59:08 User:Shaun Wood
iT have now stopped and disabled the ITM OS Agent so that we don't hit this issue. In order to progress this issue I need Gerald

[Barnes to check the platform to explain why we are getting so many security events for Audit, this is to be expected ? If so then

lwe may need to consider relaxing the security auditing as this will also be creating millions of events to go into audit whi

i'm sure will be 100 times more or higher than the current system. I won't raise a call with IBM at this moment as I suspe
y just advise us to reduce event loads as we don't have any issues on other platforms.

they

IK will pass this over to the audit team.

jbate:03-Aug-2017 15:59:26 Uscr:Shaun Wood
[the Call record has been transferred to the team: Audit-Dev

[the Call record has been assigned to the Team Member: Gerald Barnes
Progress was delivered to Consumer

bate 03-Aug-2017 11
[Start of Response]

8:15 User:Gerald Barnes

FUJ00173153
FUJ00173153

ff have sent an email to Dave Haywood asking whether we can stop generating these success events:

It have no reason to believe it is anything other than BAU.
[End of Response]

Response code to call type L as Category 40 -- Pending -- Incident Under Investigation
Response was delivered to Consumer

fours spent since call received: 4 hours

Joate:04-Aug-2017 10:24:58 Uscr:Dave Haywood
efore I agree to considering relaxing event logging on the ARC servers, I would like to understand why the auditsvrcomp userid
is (I presume) opening and closing so many files over such a short period of time. The evidence doesn't seem to contain details
bof which files are being accessed and why. I would like to rule out a software issue that is causing a large number of events to
be logged. Please provide some analysis of which files are being opened / closed, at what rate and why.

[fhe events in question are:
lan attempt was made to access an object - Event ID 4663
the handle to an object was closed - Event ID 4658

Please supply further analysis / evidence as requested above.

[Date:04-Aug-2017 17:44:10 Uscr:Gerald Barnes
Product HNG-X Platforms -- Audit Server (ARC) (version:2) added.

Jbate:04-Aug-2017 17:49:26 User:Gerald Barnes
lA new Business Impact has been adde
lsc have suspended the saving of events because of this bug. This is a security issue.

[the problem has uncovered an inefficiency in the sealer. It is repeatedly checking folders to see whether anything needs to be
Jaone in a hard loop. It is always good practice to put a sleep of some duration if there is nothing that needs to be done so
jresources will be freed to do other things. This fix should make for examples prosecution queries quicker than before.

Jbate:04-Aug-2017 18:08:52 User:Gerald Barnes
Development Cost updated: new cost is 2 (Man Days)
[Start of Response]

DEVELOPMENT IMPACT OF FIX:

SPECIFY THE HNG-X PLATFORMS IMPAC’

Di
[the platform has been specified and it is the audit server.

I(ECHNICAL SUMMARY:

In routine RGSchedule of SealContol.c it gets into a hard loop of checking ?
Ip: \Archiveserver\CONTROL\SEALER MODULE

ID: \Archiveserver\INTERFACES\IMPORT CAT\Data

lp: \Archiveserver\CONTROL\SEALER 2 MODULE

Jo: \Archiveserver\ INTERFACES \ IMPORT _CAT\Md5

jaiting for something extra to do.

[this is not efficient.

it will be wasting a lot of machine resources doing this.

[the code does sleep for a second of so in the loop when there is absolutely nothing to do.

tt has multiple threads and the problem occurs when some threads are doing things and it is trying to decide whether to start
another one or not.

so in conclusion a sealer fix is required.
[his fix will greatly reduce the number of events and make processing much more efficient at the same time!
IST OF KNOWN DIMENSIONS DESIGN PARTS AFFECTED BY THE CHANGE:

lAUDIT SERVER _APP_v2

DEPENDENCIES:

[there are no dependencies.

DEPLOYMENT DETAIL:

lkeplacement files to be supplied during the evening backup.

DEV EFFORT IN MANDAYS:

2 man days. I have another fix to work on which may need to be done first for 16.21. We may decide to schedule this first in
which case I can start immediately.

IMPACT ON USER:

it will speed things up for SecOps though I am not sure by how much.

FUJ00173153
FUJ00173153

IMPACT ON OPERATIONS:
[they will be able to harvest events again.

HAVE RELEVANT KELS BEEN CREATED OR UPDATED?

io KEL is needed from the audit team.

IMPACT ON TEST:

hey need to check that gathering, ARQs and the evening robocopy works as before without filling up the event log.
RISKS (of releasing and of not releasing proposed fix):

releasing

It cannot see any disadvantage.

Jot. releasing

je will continue to get flooded with these audit success events.

je will continue to needlessly keep checking the same folders hundreds of times a second when it would be sufficient to do it
jonce a second.

LIST OF LIKELY DELIVERABLES:
lsealer.exe definitely

le may decide to make the sleep configurable so as to fine tune the fix later.
In this case additionally -

larchive.exe
configDLL.dll
beleter.exe
catherer
lessages.d11

lketriever.exe

sealer.exe

[Boot le\ConfigurationFile.txt
igan\ConfigurationFile.txt
igan\ConfiurationFile DR.txt

exe

[End of Response]
esponse code to call type L as Category 55 -- Pending -- Live Fix Impact Supplied
lkesponse was delivered to Consumer

Hours spent since call received: 7 hours

jbate:04-Aug-2017 18:10:17 User:Gerald Barnes
(the call Target Release has been moved to Proposed For -- lING-X 15.21

jDate:04-Aug-2017 18:10:46 Uscr:Gerald Barnes
lAction placed on Team:BIF

Jbate:07-Aug-2017 10:37:33 User:Jubita Gurung
fhe call Target Release has been moved to Targeted At -- HNG-X 15.20

Jbate:07-Aug-2017 10:58:57 User:Jubita Gurung
IBIF approved and targeted at 15.20

Jlate:07-Aug-2017 10:59:01 Uscr:dubita Gurung
laction has been removed from the call

Jate:07-Aug-2017 11:28:12 Uscr:Shaun Wood
fter discussing this issue with John Bradley I have raised PMR Ref 16388,019,866 with IBM as we don't think their agents should
lbe utilising so much memory and need to know if there is a way of disabling checking event logs as we have Netcool monitoring the

indows event logs.

JDate:07-Aug-2017 11:54:30 Uscr:Gerald Barnes
Reference Added: Jira POA-2216

[Date :08-Aug-2017

0:01 User:Dimensions Automated User

FUJ00173153
FUJ00173153

Reference Adda
lReference Adde:

Product Baseline AUDIT SERVER APP V2_1520_V019
Product Baseline AUDIT SERVER APP V2_1520_v019-v009

bate:08-Aug-2017 19:18:30 User:Gerald Barnes
[Start of Response]
partially fixed by version 15.20.0.5 of sealer.exe.

If the sealer is not busy then you will get 259,200 of these success events per day.
[this would increase to a maximum of 10 times this number if the sealer was very busy all the time which would never be the case.
So even in the very worst case there will be far less than 59 million security events a day.

[End of Response]

Response code to call type L as Category 46 -- Pending -- Product Error Fixed

Response was delivered to Consumer
tours spent since call received: 15 hours

Date:08-Aug-2017 19:18:35 Uscr:Gerald Barnes
Ibefect cause updated to 14: Development — Code

bate: 08-Aug-2017 19:18:48 User:Gerald Barnes
the Call record has been transferred to the team: Dev-Int-Rel
Progress was delivered to Consumer

jbate:09-Aug-2017 08:30:01 User:Dimensions Automated User
Reference Added: Product Baseline AUDIT SERVER APP_V2_1520 D019-p009

bat e:09-Aug-2017 12:03:20 User:PIT Automated User
[Start of Response]
Peak 0261026 handled by integration auto handler

[fhe following baselines attached to this peak have the targeting flags set:
\UDIT_SERVER_APP_V2_1520_D019-D009 FOR (LIVE:YES TEST:YES RDT:YES) Integrator: Geoff Inglis

these baselines have completed integration testing, moving to holding stack awaiting peak ejection.
[End of Response]

Response code to call type L as Category 47 (Fix Processed by PIT)

the incident has been transferred to the Team: Int-Rel

Progress was delivered to Consumer

pate:09-Aug-2017 12:05:53 Uscr:PIT Automated User
{Start of Response]
## AUTOMATED UPDATE - INTEGRATION PEAK BOT #f

Fix proc

sed by integration, routing to dev-int-rel director...

PLEASE NOTE: If this fix has failed, to send this peak back to integration it MUST have the response code Fix Failed or Response
Rejected on it, otherwise the peak will bounce.

[End of Response]

Response code to call type L as Category 49 (Fix Available for IndependentTest)

the incident has been transferred to the Team: Live Supp-Test

Progress was delivered to Consumer

Date:09-Aug-2017 15:34:57 User:Victoria Griffin
Reference Added: Rele: AK PCO261232

jDate:10-Aug-2017 14:07:37 User:Shaun Wood
It need to get this call cloned So that I can test / change the ITM OS Agent as per advice from IBM.

jbate:10-Aug-2017 15:03:19 User:David Bower
call cloned from original call:PC0261026 by User:David Bower

Date:10-Aug-2017 15:04:30 User:David Bower
lhe Call record has been assigned to the Team Member: David Bower

jbate:10-Aug-2017 15:05:05 User:David Bower
[the Call record has been transferred to the team: Tivoli-Dev

[the Call record has been assigned to the Team Member: Shaun Wood

jDate:10-Aug-2017 16:55:16 Uscr:Shaun Wood

FUJ00173153
FUJ00173153

nis call be used to progress the changes provided by IBM, I will raise an Bmergency MSC to make the changes on I
tomorrow to test as this platform will has the issue and so will prove if the IBM changes are successful.

bate: 11-Aug-2017
Reference Added:

bate:11-Aug-2017
target Date/Time updated
[Start of Response]

SC raised to update KNTENV file on {~ } to address memory issues. Once this has been implemented we will then need to
jonitor for a few days to confirm this has addressed the issue. A formal fix will then be delivered.

[End of Response]

Response code to call type C as Category 41 -- Pending -- Product Error Diagnosed

9:54 User:Shaun Wood
new value is 31/12/9999 00:00

Date:11-Aug-2017 15:22:21 User:Shaun Wood
SC has been implemented, the ITM OS Agent has started and is running fine. I have monitored memory for 5 mins, this as stayed
fairly static around 41,000k. I will inform NT and then check the server again next week.

Date:44-Aug-2017 10:22:16 User:Shaun Wood

It have just checked the ITM 0S Agent ont the memory usage is at 42,656k which looks fine as there have been millions
lof events so the agent is no longer consuming memory like this did prior to the changes. I will continue to monitor for the rest
lof this week, if all still looks fine I will get a formal release sorted.

lbate:14-Aug-2017 1
[Start of Response]

It will action QFP as I'm not sure what target release I should use for this Peak, Gerald has delivered his fix at R15.20 which I
lsuess now needs to go through LST as a hot fix, this ITM OS Agent change also needs to do the same so R15.20 also ? The

IN ITM OSAGENT_VO01 was delivered at R15.20 so I'd just need a VO02-V001 incremental.

[End of Response]

lkesponse code to call type C as Category 40 -- Pending -- Incident Under Investigation

229:20 User:Shaun Wood

[Date:14-Aug-2017 10:29:38 Uscr:Shaun Wood
ction placed on Team:QFP Forum

JDate:47-Aug-2017 17:44:30 User:Shaun Wood
[Start of Response]

Ir have ust cheeked the TIM OS Agent onl IRRELEVANTL the memory usage is at 44. 156k which confimms that we no longer have an lasue
Iso I now need to deliver a formal fix. GFP WITT hed to sanction this and target, I'll propose R15.20 as Gerald has delivered his
ix at this release.

[End of Response]

kesponse code to call type C as Category 41 -- Pending -~ Product Error Diagnosed

bate:i7-Aug-2017 17:44:45 User:Shaun Wood
the call Target Release has been moved to Proposed For ~~ HNG-X 15.20

Jbate:18-Aug-2017 09:03:52 User:Nick Lawman
the call Target Release has been moved to Targeted At -- HNG-X 15.20

JDats:21-Aug-2017 12:37:08 Uscr:Shaun Wood
ction has been removed from the call

jDatc:23-Aug-2017 13:50:02 Uscr:Dimensions Automated User
Reference Added: Product Baseline WIN ITM OS AGENT CFG 1520 vo0l

bate: 23-Aug-2017 13:
(Start of Response]
lew ITM OS Agent config product released to amend agent values to address memory issues. This now needs to be installed onto all
indows 2012 Servers as a top-up to address this issue which has been tested on the live platform.

[End of Response]

Response code to call type C as Category 48 ~~ Pending -- Fix Released to PIT

2:13 User:Shaun Wood

JDate:23-Aug-2017 1:
the Call rec

2:19 Uscr:Shaun Wood
rd has been transferred to the team: Dev-Int-Rel

pate: 23-Aug-2017 14:25:01 User:Dimensions Automated User
lkeference Added: Product Baseline WIN ITM OS AGENT CFG 1520 D001

FUJ00173153
FUJ00173153

pate: 24-Aug-2017 14:41:59 User:Sarah Payne
[the call Target Release has been moved to Targeted At -- HNG-X 15.31

Date: 24-Aug-2017 14:42:30 Usor:Sarah Payne
peak re-targeted to R15.31 as LST have signed off R15.21.

Jbate:24-Aug-2017
Reference Added:

11 User:Karen Cooper
< PC a0

bate:30-Aug-2017 11:48:46 User:Vijesh Pandya
the Call record has been transferred to the team: Live Supp.Test

JDate:04-Sep-2017 14:16:05 User:Mark Ascott
[the Call record has been assigned to the Team Member: David Bower

JDate:26-Oct-2017 15:53:03 User:David Bower
[Start of Response]

Baseline nm all LST win 2012 servers and no issues encountered. This is a top up for changes that were tested by Shaun
joods on {IRRELEVANT: This has passed LS? testing.

[End of Respons:

Response code to call type C as Category 61 -- Final -- Build Fix Available to Call Logger

routing to Call Logger following Final Progress update.

6-Oct-2017 15:5:
CALL PcO261282 closed

1 User:David Bower
Category 61 Type C

[Date:14-Nov-2017 15:36:48 User:Victoria Griffin
a Spe

Reference Added: Release BE02 6425:

Jbate:14-Nov-2017 16:58:50 User:Victoria Griffin
Reference Added:

jDate:17-Apr-2018

05 User:dubita Gurung
Reference Added: cK

ie 269263

Root Cause Development - Code

Logger Deleted User -- Live Supp.Test

Subject Product General/Other/Misc -- Unknown (version unspecified)
Assignee Deleted User -- Live Supp.Test

Last Progress 17-Apr-2018 16:02 -- Jubita Gurung