FUJ00075544
FUJ00075544
Peak Incident Management System
Call Reference PC0058994 Call Logger Deleted User -- UK Bridge Team
Release Targeted At -- Horizon Future Unspecified Top Ref PC0057957
Call Type Cloned call Priority C -- Progress restricted
Contact EDSC Call Status Closed -- Administrative Response
Target Date 07/12/2000 Effort (Man Days) 0
Summary Copy PC0057957 FAD260801 - Timeout occurred waitin
Progress Narrative
pac
6-Nov-2000 14:04:00 User: Customer Call_
CALL PC0057957 opened
CALL PC0057957:Priority B:CallType L - Target 21/11/00 14:04:26
16/11/00 13:24 @00.01 16-11-00 A critical event was registered on
l#26080100101 Stating: An unexpected error occurred while attempting to
nodify an entry in the run map. Timeout occurred waiting for lock.
(0xC1090003). KEL Reference: JBallantyne5245K.htm
6/11/00 13:25 SMCtemp3
information: @00.01 16-11-00 A critical event was registered on
l#26080100101 Stating: An unexpected error occurred while attempting
to modify an entry in the run map. Timeout occurred waiting
for lock. (0xC1090003). KEL Reference:
WBallantyne5245K.htm. An event log will be downloaded for onward transfer to
Issc.
6/11/00 14:03 sMCtemp3
information: The event log has been downloaded. the file ID is 67964,
this shows all events on the counter. Please peruse this and
investigate the events.
lr) call details
Diagnostician nam
Customer opened date 16/11/2000 13:24:42
JDate:16-Nov-2000 14:13:00 User:Barbara Longley
the call summary has been changed fro
je00.01 16-11-00 A critical event was registered o
the call summary is now:
faD260801 ~ Timeout occurred waiting for lock.
target Release updated to CSR-CL4R
CALL, PC0057957:Priority B:CallType N - Target 21/11/00 14:04:26
Product Infrastructure KMS added
Date:23-Nov-2000 10:15:00 User:Patrick Carroll
Ir} Response :
PRESCAN,
{END OF REFERENCE 23160785]
lkesponded to call type N as Category 40 -Incident Under Investigation
[the response was delivered to: Powerllelp
fhe Call record has been assigned to the Team Member: John Ballantyne
Defect cause updated to 99:General - Unknown
Hours spent since call received: 0 hours
Date: 23-Nov-2000 11:10:00 User:John Ballantyne
lr} Response :
this event was reported in PC0056922, this call has been closed but the
coments from Mark Jarosz, were that if calls of this nature were > 1 per
nonth then further investigation should be carried out. In this case I
jpresume that archiving was processing and there was still an outstanding lock
lon the run table. I presume that the reload of Riposte at cleardesk will
release the locks. Investigating frequency of event in the estate.
[END OF REFERENCE 23163800]
Responded to call type N as Category 40 -Incident Under Investigation
[fhe response was delivered to: PowerHelp
Date:23-Nov-2000 11:45:00 User:John Ballantyne
New evidence added - Text message store Audit/Event logs
IF} Response =
this event has some 129 counters reporting this and also MBOCOR02 and
IMBOCORO3 has reported this event although it may be expected on the Corr
servers. I think this needs investigating Please state what evidence is
required will attach Eventlog/message store & audit logs for this outlet.
[END OF REFERENCE 23165836]
Responded to call type N as Category 40 -Incident Under Investigation
fhe response was delivered to: PowerHelp
[the Call record has been transferred to the Team: QFP
Defect cause updated to 41:General - in Procedure
FUJ00075544
FUJ00075544
flours spent since call received: 0 hours
[Date:23-Nov-2000 13:17:00 Uscr:Lionel Higman
[the Call record has been assigned to the Team Member:
’ours spent since call received: 0 hours
areth Jenkins
IDate: 24-Nov-2000
IF} Response :
fhe Call record has been assigned to the Team Member: Gareth Jenkins
[END OF REFERENCE 23212636]
Responded to call type N as Category 40 -Incident Under Investigation
[fhe response was delivered to: PowerHelp
9:00 User: Tara Mills
Jbate:24-Nov-2000 11:49:00 User:Gareth Jenkins
the Call record has been transferred to the Team: Escher-Dev
fours spent since call received: .5 hours
[the Call record has been assigned to the Team Member: Mark Jarosz
Hours spent since call received: 0 hours
bate: 30-Nov-2000 16:12:00 User:Lionel Higman
CALL 2C0058994:Priority B:Callfype C - Target 05/12/00 16:12:39
call Pc0058994 cloned from original call PC0057957
IDate:30-Nov-2000 1
loFP agreed change.
ltfarget Release updated to DTL - unknown
CALL PC0058994:Priority C:CallType C = Target 07/12/00 16:12:39
fhe call references have been updated. They are now:-
copy From : PC0057957
Other : Escher-Dev
3:00 User:Lionel Higman
bate:30-Nov-2000 16:14:00 User:Lionel Higman
[fhe Call record has been transferred to the Team: Futures
fours spent since call received: 0 hours
JDate:05-Dec-2000 17:25:00 User:Mark Jarosz
It have discussed this error event at length with Escher and the current view
is that
(a) The timeout being reported is benign in the sense that it should not
cause any corruption of the message store.
(b) Had the operation that was impacted by this timeout been an internal
message server one, for example and index maintenenace thread then there
should of been further error events logged. Therefore we suspect that an
lerror is being returned by the Riposte API which is not being trapped by a
pplication, in this presumably an LFS Agent since 10 seconds prior to the
levent it created some messages.
[therefore in order to progress finding the cause of this error event I would
recommend that:
(1) The relevant LFS agent code is checked to ensure that all API failures
lare both reported via the event log and handled correctly. If this is not the
case then the relevant changes should be made.
(2) In order to reproduce the problem I need a detailed description of the
lactions taken by the LFS agent. Note as per point (1) this is based on the
Jassumption that the LFS Agent provoked the problem in Riposte.
Please email this to me and I will progress.
bate:12-Dec-2000 12:58:00 User:Lionel Higman
[the Call record has been transferred to the Team: QFP
Hours spent since call received: 0 hours
Jbate:12-Dee-2000 12:59:00 User:Lionel Higman
{the Call record has been assigned to the Team Member: Rex Dixon
Hours spent since call received: 0 hours
Jat e:12-Dec-2000 16:50:00 User:Rex Dixon
[fhe Call record has been transferred to the Team: TSC-Dev
Hours spent since call received: 0 hours
lbate:13-Dec-2000 16:40:00 User:Les Andrew
fhe problem is with the LFSEndOfDay agent on the counter. This agent is not
lone provided by the Agent team. Having looked at the Riposte messages written
around the time of the Event log timeout messages (Thu Nov 16 00:01:56) they
FUJ00075544
FUJ00075544
fre all written by the LFSmndOfbay agent and there ave 2 transactions
(<TranStartNum:25835> started <Time:00:01:46> and <TranStartNum:25838>
started <Time:00:02:38>) straddling the timeout messages. This timeout
message is reported when one program has a transaction outstanding for a long
time whilst another program is trying to write to the same node. Transactions
should be kept as short as possible. It appears that the LFSEndOfDay agent
has a transaction open for a long time, which is causing the KMRX and
c_HV_POACK agents to have the timeout error. See 58994 extracts.txt.
Les Andrew
JDate:13-Dec-2000 16:41:00 Uscr:Les Andrew
lew evidence added - extracts of messages and events
fhe Call record has been transferred to the Team: LFS-Ctr-Dev
Hours spent since call received: 2 hours
Date:14—Dec-2000 10:28:00 Uscr:Deleted User (David McDonnell feb01)
If} Response :
uFSEndofPay writes only messages and never opens/uses transactions. When a
essage is too large to be written it is written as a BLOB. It seems that
pon the request to write a BLOB Riposte opens a transaction in order to
bundle all the fragments into a single commit. It is here that the locking
land timeout issue is occuring. I am not aware of any processing that we
should be doing in order to expediate this process as the control is down
with Riposte at this point.
in this example the BLOB is not very large at all, only 3 or 4 fragments.
I have discussed this with Les Andrew and this needs investigating by Escher.
Passing to Gareth in QFP for routing.
[END OF REFERENCE 23761724]
Responded to call type C as Category 40 -Incident Under Investigation
the response was delivered on the system
[fhe Call record has been transferred to the Team: QFP
Defect cause updated to 99:General - Unknown
liours spent since call received: 4 hours
[Date:14-Dec-2000 11:37:00 User:Del (12/01 Vin Patel)
fhe Call record has been assigned to the Team Member: Gareth Jenkins
lfours spent since call received: 0 hours
IDate:15-Dee-2000 08:06:00 User:Gareth Jenkins
[the Call record has been transferred to the Team: Escher—Dev
Hours spent since call received: 0 hours
Date:15-Dec-2000 08:07:00 User:Gareth Jenkins
[fhe Call record has been assigned to the Team Member: Gareth Jenkins
Hours spent since call received: 0 hours
Jbate:18-Dec-2000 11:53:00 User:Gareth Jenkins
It think that les might have confused things with his comments about Riposte
transactions. Looking at the message store there are 2 separate Riposte
transactions and this is nothing to do with transaction locks.
can the LFS team please answer Mark's original question, namely, does their
code check all Riposte API failures and report errors to the event log if
unexpected failures are received, if not, it should be changed to do so.
so is it possible to tell (for example by looking at the message store) if
LFS has behaved as expected (ie has it missed out some processing as a result
lof this error).
Please give me a ring if you need to discuss this further.
lcareth
the Call record has been transferred to the Team: LFS-Ctr-Dev
tours spent since call received: .1 hours
IDate:18-Dec-2000 15:13:00 User:Deleted User (David McDonnell feb01)
If} Response :
the event log shows that this error was around the time that LFSEndofday was
reading/writing blobs where the LFS-Riposteblobapi code does check all return
codes.
I can confirm that the LFS processing has completed successfully with all the
levidence in the message store showing a complete and successful run with
Jncthing missing and no errors.
Ihe timeout is not occuring in and is not reported by LFS and it is not LFS
that is failing.
Please can the TDA should look into the application that has reported the
timeout and find out what it is waiting for.
Please can the TDA also answer the question as to whether riposte is using
transaction to commit a set of attatchments ?
1s, given Gareths concerns regarding error handling I think the Program now
has a serious issue in other desktop applications as they do not check
{END OF REFERENCE 23813655]
Responded to call type C as Category 40 Incident Under Investigation
FUJ00075544
FUJ00075544
fine response was delivered on the system
the Call record has been transferred to the Team: QFP
Hours spent since call received: 2 hours
Date:18-Dec-2000 16:11:00 User:del (05/01 John McLean)
the Call record has been assigned to the Team Member: Gareth Jenkins
Hours spent since call received: 0 hours
bate:19-Dec-2000 1
le} Response :
Vin & I have discussed this with Mark.
tt can be seen from the evidence file that there at least 4 applications
jceporting an unexpected error - Timeout on lock.
[this shows that it is not LFS that is experiencing the problem and that the
lpplications that are are reporting it accordingly.
lark is going to look into this further along these lines.
[END OF REFERENCE 23827821]
lkesponded to call type C as Category 40 -Incident Under Investigation
[the response was delivered on the system
6:00 User:Deleted User (David McDonnell feb01)
jDate:19-Dec-2000 12:14:00 Uscr:Gareth Jenkins
[the Call record has been transferred to the Team: Escher-Dev
fours spent since call received: 1 hours
[Date:19-Dec-2000 12:15:00 Uscr:Gareth Jenkins
the Call record has been assigned to the Team Member: Mark Jarosz
ours spent since call received: 0 hours
JDate:20-Feb-2001 16:31:00 User:Lionel Higman
target Release updated to CI4S10
Date:43-Jun-2001 08:57:00 User:Ldonel Higman
Brining Target Release into line with that set for this call in the Release
nagement Database.
target Release updated to C14S10R
jbate:14-Jun-2001 08:08:00 Uscr:Lionel Higman
the call references have been updated. They are now:-
[t Copy From : PC0057957
Date:07-Aug-2001 13:27:00 User:iionel Higman
farget Release respecified during Escher-Dev PinICL Review.
target Release updated to Future Unspecified
Date: 40-Jan-2002 09:58:00 User:Mark Jarosz
lr) Response :
kesponded to call type C as Category 94 -Advice and guidance given
tours spent since call received: 6 hours
Defect cause updated to 42:Gen - Outside Pathway Control
fhe response was delivered on the system
Jate:10-dan-2002 16:57:00 User:Lionel Higman
Please see (hidden) update from Mark Jarosz. If you are happy to close return
to me and I will do so.
[fhe Call record has been transferred to the Team: EDSC
Hours spent since call received: 0 hours
Jbate:10-Jan-2002 17:06:00 Uscr:Barbara Longley
the Call record has been assigned to the Team Member: John Ballantyne
tours spent since call received: 0 hours
bate:11-gan-2002 0:
If} Response :
It concur with Marks comments, there have been no further occurancies of this
error in the live estate, returning for closure.
{END OF REFERENCE 28580224]
lkesponded to call type C as Category 40 -Incident Under Investigation
the response was delivered on the system
the Call record has been transferred to the Team: QFP
Hours spent since call received: 0 hours
7:00 Usér:John Ballantyne
FUJ00075544
FUJ00075544
fpaterTi-dan-2002 09:31:00 User:@arig Arain
{the Call record has been gned to the Team Member: Lionel Higman
Hours spent since call received: 0 hours
jate:11-Jan-2002 12:15:00 User:Lionel Higman
CALL PC0058994 closed: Category 68, Type C
jours spent since call received: 0 hours
Root Cause Gen - Outside Program Control
Logger Deleted User -- UK Bridge Team
Subject Product Infrastructure -- KMS (version unspecified)
Assignee Deleted User -- UK Bridge Team
Last Progress 11-Jan-2002 12:15 -- Lionel Higman