FUJ00075159 - Peak Incident Management System (PC0053384) - Time Service errors at counter

Evidence on official site

FUJ00075159
FUJ00075159

Peak Incident Management System

Call Reference PC0053384 Call Logger Deleted User -- Analysts
Release Targeted At -- CSR-CI4R Top Ref FSTK_2_0 WP10691

Call Type Product Incidents/Defects Priority C -- Progress restricted
Contact Deleted Contact Call Status Closed -- No fault in product
Target Date 08/09/2000 Effort (Man Days) 0

Summary Time Service errors at counter

All References Type Value
Fast track fix FSTK_2_0 WP10691
Other Futures?
Other IC
Work Package PWY _WP 10691
Progress Narrative

ate:01-Sep-2000 15:07:00 User:dohn Pope
CALL PC0053384 opened
lReferences entered ar:
Product Infrastructure Unknown Infra'sture added

Target Release entered: Unknown

lime Service errors at counter

two and three nights ago 1000+ transactions failed to go to TIP for reasons
laccociated with apparent negative transaction times on OBCS
jtransactions/events at counters. IT was found that the time service to the
correspondence servers was not working, and that there was a difference in
time of a couple of minutes. This caused the counters to be re-set to and
fro according to which server the gateway connected to on successive OBCS
Foreign transactions.

tast night 5,000 + transactions were dropped despite the fact that time
service was apparently working. Jim Stinchcombe confirmed to me in the Ist
nour that the servers were apparently synched to within a fraction of a
second. The "negative" times last night were smaller, and suggest that the
counter clocks were jumping by something of the order of 20 seconds despite
the correspondence servers being in line. This should not be happening and
Inceds investigation, as there is a Requirement to maintain accurate counter
times. This is not high priority, because changes being made to the
harvester tonight will stop rejects even if the time service defect remains.
secondly, it seems to me that the correspondence servers should not have
\irifted appart by 2 minutes + just because time service was off for a couple
lof days. Does this imply that the hardware clock is defective on one of the
servers?

CALL PC0053384:Priority C:Callfype P - Target 08/09/00 16:07:51

the Call record has been transferred to the Team: QFP

Defect cause updated to 99:General - Unknown

Hours spent since call received: .5 hours

Jbate:01-Sep-2000 15:33:00 User:Lionel Higman

FAO James Stinchcome

the Call record has been transferred to the Team: TDA
tours spent since call received: 0 hours

lDate:01-Sep-2000 15:40:00 User:Roger Donato
[there were about 5000 instancess on 31/08/2000 which calculated a negative
value for foreign enquiry time. I've attached one example for FAD 105002
showing start time taking place after end time. There a many others!!!

Jbate:01-Sep-2000 15:42:00 Uscr:Roger Donato
New evidence added - Example transaction for FAD 105002

ldaté:04-Sep-2000 09:36:00 User:Deleted user (Carolyn Payne Jun0i)
fhe Call record has been assigned to the Team Member: James Stinchcombe
Hours spent since call received: .1 hours

JDate:04-Sep-2000 11:30:00 User:James Stinchcombe

It've talked this through with Mark Jarosz. We believe that the problem is
that Riposte doesn't take account of the ISDN delay caused by the dialling
the case of first choice dial problems. This will cause the clocks to move by
this much, probably every day at most counters.

; This would typically be around 2 seconds but could be much longer in

FUJ00075159
FUJ00075159

fiohn, Could you please find what we are contractually obliged to do in this
larea. We can then try and sort out the problem. What we need to know is the
Hevel of accuracy required, and what we have committed to. Many thanks James
the Call record has been transferred to the Team: Requirements

liours spent since call received: 0.5 hours

lDate:04-Sep-2000 1
If} Response =

le have not committed ourselves to any specific time, but have made a number
lof comments which give a flavour. We talk about 0.5 seconds in the context
lof the accuracy at the NP servers. We talk about haveing robust operational
systems to maitain accuracy. We explicitly comment that our methodology will
be good at sychronising times on the transactions themselves.

I think a couple of seconds would be acceptable, but that 20 seconds is not.
lscems to me that Riposte is logically at fault if it uses the time stamp on
the first syncronisation message in a conversation (i.e. the message that
initiated the ISDN call) for synch purposes, and we should ask Esher to fix
this asap, whatever else we do choose to do.

[END OF REFERENCE 21509129]

Responded to call type P as Category 40 -Incident Under Investigation

the response was delivered on the system

[fhe Call record has been transferred to the Team
Hours spent since call received: .5 hours

3:00 User:dohn Pope

DA

JDatc:05-Sep-2000 08:02:00 Usci:Gareth Jenkins
[the Call record has been assigned to the Team Member: Gareth Jenkins
fours spent since call received: 0 hours

lDate:27-Sep-2000 07:03:00 Uscr:Gareth Jenkins
Passing on to Glenn to raise a CP as discussed 26/9/00.

[fhe Riposte configuration parameter to change at the counter to only synch if
lreater than 5 secs out is TimeSynchDriftLimit. This is defined in
illiseconds. The Riposte default is 60000 (ie one minute). The rollout
script currently sets this to 1000 (ie 1 second). As discussed it should be
changed to 5000 (5 seconds).

fhe Call record has been assigned to the Team Member: Glenn Stephens

Hours spent since call received: .5 hours

ate:07-Nov-2000 10:22:00 User:del (05/01 John McLean)
ffarget Release updated to MiClone

lbate:01-Dec-2000 17:04:00 User:Lionel Higman
lbpdates agreed at tdaqfp (JD/JMcL/LMH)

ffarget Release updated to DPL ~ unknown

the call references have been updated. They are now:-
lt Other : Futures?

lDate:06-Dec-2000 09:12:00 User:Gareth Jenkins
Having discussed this with Glenn, it would appear we need a change to
lutoConfig and to the Standard Riposte Build for counters at Ml. There are
actually 2 PinICLs on this problem (58817 and 53384).

58817 will be used to handle the Autoconfig change

53384 (this PinICL) will handle the Riposte Build change

Gareth

the change for the Riposte Config parameters is to change the value of
(fimeSynchDriftLimit from the current value of 1000 to a new value of 5000.
\'D/SPE/010 version 0.5 contains this change.

Passing PinICL to PIT to be implemented as part of the standard M1 build /
gration.

lcareth

IDate:06-Dec-2000 09:14:00 User:Gareth Jenkins
[the
fours spent since call received: 0 hours

all record has been transferred to the Team: PIT

[Date:06-Dec-2000 10:15:00 Uscr:Del (01/03 Ajay Nehra)
the Call record has been assigned to the Team Member: Ajay Nehra
Hours spent since call received: 0 hours

Date:07-Dec-2000 13:11:00 User:De1(01/03 Ajay Nehra)
the call references have been updated. They are now:~
ther : Futures?

It Work Package : PWY_WP 10691

IF} Response :

fix issued in PWY WP_10691 (CI4R_WP10691)

[END OF REFERENCE 23617414]

FUJ00075159
FUJ00075159

Responded to call type Pas Category 46 -Product Error Fixed
[the response was delivered on the system
the Call record has been transferred to the Team: Dev-Int-Rel
Hours spent since call received: 1 hours

ldate:07-Dec-2000 15:55:00 User:Miho Fujii
[the call references have been updated. They are now:~

Other : Futures?

jork Package : PWY_WP_10691

lt Fast track fix : FSTK_2 0 WP10691

IF} Response +

fast track available, please test.

[END OF REFERENCE 23626649]

Responded to call type P as Category 60 -S/W Fix Released to Call Logger
Hours spent since call received: 0 hours

[the response was delivered on the system

IDate:08-Dec-2000 09:15:00 User:del (01/01 Denise Jackson)
1 Clone confirmed by QFP

{target Release updated to MIClone

the call references have been updated. They are now:~
ther : Futures?

Work Package : PWY_WP_10691

lt Fast track fix : FSTK 2 0 WP10691

ther : C

bate: 20-Dec-2000 1.
If} Response :

fhe original complaint was that there was evidence to show that time synch at
the counters was out by up to 20 seconds. The "solution" suggested is to
llegrade accuracy from 1 second to 5 seconds !?

is it suggested that the original evidence (provided by Roger Donato) was
rong?

[END OF REFERENCE 23865613]

Responded to call type P as Category 52 -Response Rejected

[the response was delivered on the system

6:00 User:dohn Pope

fhe Call record has been transferred to the Team: TDA
Hours spent since call received: 1 hours

lDate:20-Dec-2000 20:47:00 User:Allan Hodgkinson
the Call record has been assigned to the Team Member: Gareth Jenkins
Hours spent since call received: 0 hours

IDate: 22-Dec-2000 13:14:00 Uscr:Gareth Jenkins
[fhe original problem had the 2 Data Centres out of sych by a significant
Jamount (I believe it was 2 mins rather than 20 secs). This was due to time
service software not running correctly. That was an operational problem. I
believe that this PinICL has successfully sorted out the consequences of
that. NB there will still be “oscillations” of the clock if the data Centre
time is not maintained correctly. If you wish to persue that, then please
raise a separate PinICL. I believe that this PinICL can be closed.

careth

Ik) Response +

[END OF REFERENCE 23917567]

lkesponded to call type P as Category 94 -Advice and guidance given

Hours spent since call received: 0 hours

[fhe response was delivered on the system

Date: 07-Jan-2001 12:43:00 User:Lionel Higman
ffarget Release updated to DIL - unknown

joate:08-dan-2001 14:02:00 User:John Pope
If} Response :

Gareth.

[the original problem was indeed 2 minutes, and was due to Time Service not
jrunning. Hoewever, evne after thsi was rectified there was still clear
levidence from observing OBCS transaction time start and finish times in the
hast week of August that the clocks were still jumping by up to 20 seconds,
hich is why I raised this PinICL. It has not been explained (at least not

lin the responses to this PinICL) what caused the 20 second jumps, and why the
laction taken is an appropriate response.

[END OF REFERENCE 24031122]

lkesponded to call type P as Category 40 -Incident Under Investigation

the response was delivered on the system

[the Call record has been transferred to the Team: TDA

Hours spent since call received: .2 hours

FUJ00075159
FUJ00075159

[Date:i5-dan-2001 15:12:00 User:Allan Hodgkinson
[the Call record has been assigned to the Team Member: Gareth Jenkins
ours spent since call received: 0 hours

Jate:30-dan-2001 11:36:00 User:Lionel Higman
target Release updated to CSR-CI4R

IDate:16-Feb-2001 14:10:00 Uscr:Gareth Jenkins
It have heard allegations of the 20 second jumps but I've not seen any
levidence of it nor details of when it happened. I don't believe it is worth
spending any further time investigating alledged problems that occurred
uring a period of instability after CI4 migration.

Ir am aware of the following related issues to do with Time Management that
nave been addressed :-

i) TPS Harvester treated negetive OBCS response times as exceptions (This
is the issue which originally caused this PinICL to be raised). This was
fixed in early Septemeber 2000.

2) OBCS measured response time from cl
earlier), it has been changed to use "ticks" which avoids time drift issues
(provided it handles the 49.7 day "tick wrap issue)

8) Clock drift at the counter has been configured (from M1) to be 5 seconds
rather than 1 second to avoid hysteresis

4) Configuration / build problems with Time Services at the Data Centre

nave been addressed. This was addressed in February 2001.

fhe only known outstanding issue is the fact that Riposte uses the packet
send time in its time synch protocol and does not allow form ISDN call set up
time. This is item 4,36 on the Riposte Enhancement Register. With the other
Incasures taken this is unlikely to give negetive transaction times.

Ir suggest that this PinICL is closed and the situation reviewed again once M1
is fully operational. If problems are still found, then further PinICLs
should be raised to investigate the specific problems idnetified with
appropriate evidence.

Gareth

If} Response :

ck readings. At M1 (or possibly

[END OF REFERENCE 24843510]

Responded to call type P as Category 94 -Advice and guidance given
tours spent since call received: .1 hours

[the response was delivered on the system

Date:15-May-2001 1.
If} Response :
lexamination of a sample of OBCS foreigns (courtesy of SRC) show no evidence
lof strange clock re-setting.

[END OF REFERENCE 26076962]

Responded to call type P as Category 62 -No fault in product

Hours spent since call received: 1 hours

{the response was delivered on the system

2:00 User:John Pope

ldate:15-May-2001 13:03:00 User:John Pope
ALL PC0053384 closed: Category 62, Type P

Hours spent since call received: 1 hours

Defect cause updated to 39:General ~ User Knowledge

Root Cause General - User Knowledge

Logger Deleted User -- Analysts

Subject Product Infrastructure -- Unknown Infra'sture (version unspecified)
Assignee Deleted User -- Analysts

Last Progress 15-May-2001 13:03 -- John Pope