FUJ00171953 - Peak Incident Management System Log [PC0070702] - [Time drift during Slave counter recovery] User Madhu Karia

Evidence on official site

FUJ00171953
FUJ00171953

Peak Incident Management System

Call Reference PC0070702 Call Logger POA Deleted User -- Deleted Team

Release Targeted At -- Horizon Future Unspecified — Top Ref 900
Call Type System Testing Incidents/Defects Priority D -- Non-urgent
Contact Deleted Contact Call Status Closed -- Administrative Response
Target Date —_—‘:10/10/2002 Effort (Man Days) 0
Summary Time drift during Slave counter recovery
All References Type Value
Supplier reference 900
Test reference TCMSG10
Progress Narrative

bate:15-Oct-2001 08:22:00 User:Madhu Karia
ALL PC0070702 opened
lkeferences entered ar
jf Test reference : TCMSG10

Product Infrastructure RIPOSTE messaging sw added

llarget Release entered: Unknown

time drift during Slave counter recovery

n test rig INF2, both the Gateway and Slave at FAD 505038 were built to S10
level with riposte ver 6.0.3.22. A large messagestore (from Trafalgar Square)
jas replicated from the Gateway to the Slave. It was observed that during
index inserts following message inserts (as indicated on the performance
jonitor) the clock on the Slave started drifting behind the Gateway and
catching up at the end of replication. Inspection of the NT Eventlog showed
licts of time synchronisation messages.

lkvidence has been forwarded to Mark Jarosz.

Please route this call to Mark Jarosz.

ALL PC0070702:Priority D:Calltype S - Target 29/10/01 09:22:02

[the Call record has been assigned to the Team Member: Chris Rayner

Defect cause updated to 42:Gen - Outside Pathway Control

fiours spent since call received: 3 hours

Jbate:15-Oct-2001 09:15:00 User:Chris Rayner

If} Response :

please route this call to Mark Jarosz (TDA) for investigation/resolution,
Mark is already aware of the problem and evidence.

[END OF REFERENCE 27867466]

lkesponded to call type $ as Category 30 ~TL confirmed

[the response was delivered on the system

the Call record has been transferred to the Team: QFP

fours spent since call received: 0 hours

JDate:15-Oct-2001 10:07:00 User: Tariq Arain
fhe Call record has been transferred to the Team: Escher-Dev
Hours spent since call received: 0 hours

[Dat ¢:26-Oct-2001 06:53:00 Uscr:Gareth Jenkins

Mark has talked to Madhu about this.

Basically, Mark has asked him to come up with a recipe for reproducing this
problem based on a synthetic message store that Escher can generate
Jchemselves. Once he has done this then we can forward direct to Escher.
Iherefore we need to route this PinICL back to Madhu.

Gareth (from MJ email)

[the Call record has been transferred to the Team: IP System Test

Hours spent since call received: 0 hours

[Date:20-Nov-2001 12:40:00 User:Tarig Arain
ffarget Release updated to BI_2

JDatc:21-Nov-2001 14:50:00 Uscr:Madhu Karia
the Call record has been assigned to the Team Member: Madhu Karia
Hours spent since call received: 0.1 hours

lDaté:04-Dec-2001 08:48:00 User:Madhu Karia

install riposte on two systems on the same LAN giving them the same group id
lout different node id, say 123456,1 and 123456,2.

Start riposte on both systems and apply the appropriate license and build

FUJ00171953
FUJ00171953

pbjects-
Use riposteputmessage to create a large number of messages, say 200000, on
leach system.

create about 30 indexes, including Date, Time, CRC, on each system.

Use ripostenode to configure the two systems to be neighbors and let the
essagestores synchronise.

configure 123456,2 to synchronise time with 123456,1 with a drift value of 5
seconds
stop riposte on 123456,2 and delete the messagestore.

start riposte on 123456,2 and let it recover from 123456,1.

Display clocks on both systems.

tun perfmon on 123456,2 and display riposte counters MessageInserts/sec,
indexInserts/sec.

fter about 20 minutes, when MessageInserts have been completed and
Iindextnserts is taking place, the clock on 123456,2 starts to lag the one on
123456,1. The lag continues to increase until all the indexing activity is
complete. Then the two clocks synchronise. NT event log shows a large number
lof time synch messages (ID 198, 199)

[fhe Call record has been transferred to the Team: Escher-Dev

Hours spent since call received: 3 hours

Dat e:04-Dec-2001 10:37:00 User:Del (04/03 Brian Orzel)
Email to Gareth

lDate:04-Dec-2001 12:11:00 User:Madhu Karia
lapclogies for being a bit too hasty with the 'send' key.

lattached herewith is a zip file containing the full event log from node 2 and
la snapshot of the network trace while node 2 was falling behind node 1. At
the time the cpu usage on node 2 was between 20 and 30 percent. The time
drift observed at one stage was nearly an hour and still drifting behind.

lew evidence added - Event log and network trace

Dat e:04—Dec-2001 12:36:00 Uscr:Madhu Karia
lew evidence added - Text of event log

lDate:05-Dec-2001 11:36:00 User:Madhu Karia
arian Orzell suggested we repeat the above test on riposte 6.0.3. The ‘time
lirift' problem was first observed on riposte 6.0.3.22 using the large
jfrafalgar Square messagestore. Then the problem was reproduced on riposte
6.2.5 using the test messagestore created by riposteputmessage (so Escher can
reproduce the problem). I have, today, used the test messagestore procedure
jon riposte 6.0.3.22 using the same two counters and following the same
rocedure as before.

servations:

at 22:00 started node2 in recovery.

clocks displayed on both nodel and node2

lPerfmon showing MessageInserts and IndexInserts going on.

Approx 22:45 MessageInserts end, Indexinserts carries on and the clock on
Inode2 starts drifting straight away.

letmon trace shows node2 responding with <Offline:1>

ter aboot an Nour) ver eee ee

Inode1 clock= 23:43:59 - node? clock= 23:02:09

jnode2: ripostenode status for nodel shows Marker Delta: 0 messages
ipostestatus 2 shows Status: Recovering ..License Object not

validated

Inodel: ripostenode status for node2 shows Offline Disconnected and Marker
jbelta: 0 messages

Hope this gives some clues in locating the problem.

Jbate:13-Dec-2001 11:00:00 User:Del (04/03 Brian Orzel)

It am sending this to Escher.

the Call record has been assigned to the Team Member: At-Escher
ours spent since call received: 0 hours

jbate:15-an-2002 0:
Feedback from Esche.
900 PC0070702 ~—- Reproduced scheduled for WR 2.1.2

the Call record has been assigned to the Team Member: Escher recreated
liours spent since call received: 0 hours

9:00 User:Del (04/03 Brian Orzel)

JDate:17-Jan-2002 08:02:00 User:Del (04/03 Brian Orzel)

\The following item needs additional investigation to determine how
destabilizing the fix might be:

Pc0070702 900 D Recreated Time drift during Slave counter recovery

jbate:22-Jan-2002 10:15:00 Uscr:Del (04/03 Brian Orzel)
lPc0070702 900 D rec.Discus? Time drift during Slave counter recovery
km = An improvement is being planned in this area. The pure fix remains

FUJ00171953
FUJ00171953

jinder investigation:

bate:05-Feb-2002 13:06:00 User:Del (04/03 Brian Orzel)

ldispite the above, I can find no mention of this bug in the release note wor
R211. Retest?

fhe Call record has been transferred to the Team: IP System Test

Hours spent since call received: 0 hours

Ibate:05-Feb-2002 13:08:00 User:De1 (04/03 Brian Orzel)
It is also still marked as open on the Escher database. I will ask them for
lan update.

Jbate:06-Feb-2002 0:
Please return this c

3:00 User:Madhu Karia
11 for testing when a fix / improvement is confirmed by
lescher. If an improvement is claimed, please get a definition of the
improvement so we know what we are looking for in the absence of a fix.

[the Call record has been transferred to the Team: Escher-Dev

fours spent since call received: 0.5 hours

jDate:07-Feb-2002 09:45:00 User:Gareth Jenkins
du,

le're getting confusing stories from Escher. Please can you repeat the test

lon 6.2.10 to see if anything has changed. We are not expecting any further
ersions of the message server for B12, so we need to know exactly what is

the case on this one.

hanks

careth

the Call record has been transferred to the Team: IP System Test

fours spent since call received: 0 hours

JDate:08-Feb-2002 16:42:00 Uscr:Chris Rayner
the Call record has been assigned to the Team Member: Madhu Karia
fours spent since call received: 0 hours

Joate:14-Feb-2002 08:44:00 User:Madhu Karia
ith the Counters upgraded to Riposte 6.2.10 used the Trafalgar Square
nessagestore and a smaller ‘synthetic! one to test the recovery. No time
ldxift was evident on the clocks displayed on the Counters or the event logs.
lo, whatever it is that Escher did seems to have fixed the problem. There are
ls lot of events, EID 201: "Connected to neighbor (nnnn,m) while in recovery
mode. <Marker:........> "in the event log which occurred previously but were
not mentioned as I thought they were part of the same problem. Seems to me to
Ye a minor problem. Therefore, closing the call.

cALL PC0070702 closed: Category 60, Type $

fours spent since call received: 4.0 hours

Jat c:19-Jun-2002 08:46:00 User:Del (04/03 Brian Orzel)
Fix claimed in todays WR2.1.2 release

lbate: 26-Sep-2002 09:07:00 User:Madhu Karia
CALI. PC0070702 Reopened: Riposte 6.2.15 failed the test.
CALL PC0070702:Priority D:CallType $ - Target 10/10/02 1

Date: 26-Sep-2002 09:10:00 User:Madhu Karia
IRiposte 6.2.15 was used for testing Gateway to Slave replication. Time drift
as evident as before.

fhe Call record has been transferred to the Team: Escher-Dev

fours spent since call received: 2 hours

lbate:30-Sep-2002 13:07:00 User:Madhu Karia
summary of the time synch and time drift problems
in Riposte 6.0.3 and 6.2.5, the time drift during replication problem
described in this call was noticed.
A fix for this problem was provided in Riposte 6.2.10. However, time synch
between time server and client was affected (see PinICL 74043). In Riposte
6.2.12, the time synchronisation appeared to revert to the 6.2.5 behaviour.
but, because this version of riposte was not taken on the counters, the time
Jarift problem was not tested.
IRiposte 6.2.15 is due to go on all platforms and was therefore tested on
counters and correspondence servers. The time synch problem (PinICL 74043)
las been fixed but the time drift during replication problem has reappeared.

FUJ00171953
FUJ00171953

[Date:02-Oct-2002 10:07:00 Uscr:Del (04/03 Brian Orzel)
the Call record has been assigned to the Team Membe
tours spent since call r

At-Escher
ived: 0 hours

Dat e:02-Oct-2002 10:35:00 User:Del (04/03 Brian Orzel)
the call references have been updated. They are now:
fest reference : TCMSG10

If Supplier reference : 900

JDate:10-Oct-2002 10:41:00 User:Lionel Higman
ffarget Release updated to BI

Date:10-Dec-2002 11:10:00 User:Lionel Higman
lretarget at request of Gareth Jenkins.
lfarget Release updated to Future Unspecified

bate:09-Jan-2004 14:11:36 User:_Customer Call_
Please route this call to Mark Jarosz (TDA) for investigation/resolution, Mark is already aware of the problem and evidence.

JDate:10-Mar-2004 08:06:32 User:Tain Janssens
jpdated to Escher recreated after meeting with Alan Smalley 09-Mar-2004
[the Call record has been assigned to the Team Member: Escher recreated

joate:21-un-2004 11:51:21 User:Lionel Higman
the call TargetRelease has been changed from:-
Future Unspecified

the call TargetRelease is now:-

Future Unspecified

bate:13-Sep-2004 14:05:29 User:Iain Janssens
{Start of Response]

[his call is being returned for closure as a result of the Escher $80 review
(held 10-Sep-2004). This has been agreed between Development and ITU (Simon
aughan)

(END OF REFERENCE 38535697]

esponse code to call type S as Category 68

JDate:44-Sep-2004 07:29:08 User:Madhu Karia
[Start of Response]

closing as per instructions above - no fix to the problem.
[END OF REFERENCE 38538960]

Response code to call type S$ as Category 68

outing to Call Logger following Final Progress update.
fours spent since call received: 0 hours

IDate:14-Sep-2004 07:29:19 User:Madhu Karia

ALL PC0070702 closed: Category 62 Type S

Root Cause Gen - Outside Program Control

Logger POA Deleted User -- Deleted Team

Subject Product Infrastructure -- RIPOSTE messaging sw (version unspecified)
Assignee Deleted User -- Deleted Team

Last Progress 14-Sep-2004 07:29 -- Madhu Karia