POL00028464
POL00028464
. & -ICL Pathway Acceptance Incident 298 — Resolution Plan
Gile- Aecophse
@
Ref: CR/ACD/298
0.5
10/9/99
Document Title: Acceptance Incident 298 — Resolution Plan
Document Type: Acceptance Resolution Plan
Abstract: This document contains ICL Pathway’s updated resolution
plan for Acceptance Incident 298.
Status: Draft
Distribution: Expert:
Peter Copping
ICL Pathway:
Terry Austin
David Hollingsworth
Library
POCL:
John Meagher
Min Burdett
Jeff Austin
Author: J CC Dicks & D.C.Hollingsworth
Comments to: Pathway list
Comments by:
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 1 of 24
POL00028464
POL00028464
i ICL Pathway Acceptance Incident 298 — Resolution Plan Ref: CR/ACD/298
Version: 0.5
¢@ Date: 10/9/99
@
0 Document control
0.1 Document history
Version Date Reason
0.1 20/8/99 Initial draft for comments
0.2 24/8/99 Version for the Expert and workshop 26/8
0.3 2/9/99 Redrafted as a resolution plan
0.4 9/9199 Material added on longer term incidence rates and defect
prevention for future releases; distributed as a draft at
Acceptance Workshop 9/9/99
@ 0.5 10/9/99 Statistics updated to CAP 24; amendments to show
statistics by counter volumes as a result of Acceptance
Workshop 9/9/99
0.2 Approval authorities
Name Position Signature Date
J H Bennett Managing Director
JCC Dicks Customer Requirements
Director
TP Austin Development Director
[—] 0.3 Associated documents
Reference Vers Title Source
0.4 Abbreviations
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 2 of 24
POL00028464
POL00028464
a
ICL Pathway Acceptance Incident 298 — Resolution Plan
Version: 0.5
Date: 10/9/99
Ref: CRV/ACD/298
@
0.5 Table of content
1 PURPOSE. ee 4
2 SUMMARY osesessessessssesssesssessesesesscnrcnsonsensonsoncosecesessessensensassnneaneanessen 4
3 CRITERIA. 4
4 POCL POSITION eseesenee. 4
I 5 PATHWAY POSITION 5
5.1 I PATHWAY WORK PROGRAMME..
5.1.1 Short- Medium Term Activities.
5.1.2 Medium-Long Term Activities.
© 5.2 STATISTICS FOR THE PERIOD SINCE 29 JULY.
5.2.1 High level analysis...
5.2.2 System Load Events & “Unauthorised” Reboot
5.2.3 System Incident Metrics ...
5.3. DETAILED INCIDENT ANALYSIS, CATEGORISATION & RESOLUTION..
5.3.1 Button No Entry Signs..
5.3.2 Suspense Account Prin
5.3.3 Virtual Memory Problems.
5.3.4 Printer Hanging...
5.3.5 Freezing during /after log-o:
5.3.6 Fl Twice during log-on.
5.3.7 System Busy Message...
5.3.8 Query Logged-on Users Message
5.3.9 Miscellaneous Freezing / Usage.
5.3.10 Counter Printer problems...
5.3.11 APS Problems.
5.3.12 OBCS Problems ..
5.3.13 Counter Printer Busy Problems.
5.4 RESOLUTION OF INCIDENT METRICS,
(—] 5.4.1. Contractual Requirements...
5.4.2. Comparison against Industry Norms.
5.4.3. Acceptance Position.
5.4.4, Resolution Proposal.
5.5 IMPROVED DEFECT REMOVAL FOR FUTURE RELEASES
5.5.1 PINICL Analysis...
5.5.2. Implications for CSR+
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE
Page 3 of 24
POL00028464
POL00028464
%
e
ICL Pathway Acceptance Incident 298 — Resolution Plan Ref: CR/ACD/298
Version: 0.5
Date: 10/9/99
Purpose
This paper seeks agreed ways forward to resolve the system instability issues.
Summary
Pathway presents for review the relevant statistics for the period since 29 July,
with particular reference to System Load Events; the progress to date at a
detailed level; and the approach to future measurement, which it is Proposed
will involve POCL.
Criteria
The Criterion cited is 536/1.
“peripheral and input devices supplied as part of the elements of the Service
Infrastructure on which OPS is provided shall be reliable, robust and easy to
use”,
POCL position
Based upon the minutes of the Acceptance Board Meeting of 18 August 1999,
POCL contended that:
“the proposed rectification plan does not provide an understanding of how the
problems will be resolved by the proposed fixes. It is also unclear when fixes
will be implemented”.
“POCL would need to see the outturn of [the fixes] as this was the only way to
confirm the impact of the changes”.
“evidence from ringarounds suggested the problem could be 50% higher than
reported at the help desk and that there was no clear evidence from Pathway to
confirm or deny this”.
At the Acceptance Workshop on 6 September POCL introduced a proposed
metric of 1 system “lock-up” or “crash” (requiring reboot) per counter PC per
annum. This is based upon the achievement of a 95% reduction in stability
incidents reported against week 19 and is said to be broadly in line with system
stability statistics from ECCO and ALPS.
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 4 of 24
POL00028464
POL00028464
s
e
ICL Pathway Acceptance Incident 298 — Resolution Plan Ref: CR/ACD/298
Version: 0.5
Date: 10/9/99
8
5
Pathway position
5.1 Pathway work programme
5.1.1 Short- Medium Term Activities
The ICL Pathway programme of work to stabilise the current level of system
comprises root cause analysis and resolution of system incidents:
e detailed examination of Horizon System Help Desk call records
direct telephone contact with post offices to more fully understand the
detailed nature of the problem as seen by the users
© reconstruction and analysis of problems within Pathway test systems
¢ testing and automated distribution of fixes as described in the Acceptance
Incident Analysis of 17 August
The details of this work programme are provided in Section 5.3, which gives an
analysis of the various system stability faults by category, along with details of
fixes applied and associated incidents levels pre- and post-fix.
5.1.2 Medium-Long Term Activities
=
5.2
In parallel with this short term activity, a thorough review of the detected faults
is underway to ascertain their nature and to identify what changes may be
appropriate to the ongoing Pathway development and testing approach. Section
5.5 of this document provides details of the analysis already undertaken in this
respect, the initial conclusions and suggestions for improved defect removal for
future releases.
Statistics for the period since 29 July
5.2.1 High level analysis
The principal measure of systems instability has been the calls made to the
Horizon Systems Help Desk by outlet staff reporting a problem with the
functioning of the system at the outlet.
For a proportion of such calls the incident is resolved by a system unit reboot (a
Help Desk “authorised reboot”). In other cases the Help Desk staff may
recommend an avoidance action that provides a simple workaround to the
problem without rebooting the system unit. In certain cases the Help Desk may
also receive a call from an outlet advising that outlet staff have locally initiated a
reboot; such calls are recorded by the Help Desk and normally provide some
additional information relating to the circumstances of the incident.
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 5 of 24
POL00028464
POL00028464
“ ICL Pathway Acceptance Incident 298 — Resolution Plan Ref: CR/ACD/298
Version: 0.5
v Date: 10/9/99
5.2.2 System Load Events & “Unauthorised” Reboots
POCL expressed concern over the potential occurrence at outlets of locally
initiated system unit reboots that had not been reported to the Help Desk. ICL
Pathway subsequently mounted an exercise to extract this information by
extracting and analysing the Windows NT System Event Logs at each outlet.
This provides precise statistics for all System Load Events (SLEs) whatever their
cause. By correlating these load events with reboot instructions issued at the
Help Desk it has been possible to produces metrics for both authorised (via
HSH) reboots and unauthorised (via local office action) reboots. This analysis is
continuing on a day by day basis.
Such unauthorised reboots may occur for a variety of reasons, including:
1. in response to a perceived systems malfunction of some kind, where the
clerk does not contact the Help Desk and initiates such action of his own
o volition
2. in response to an environmental incident such as a power cut or through
disconnection of the power supply
3. through failure to leave the machines switched on during periods of
unattended operation (e.g. overnight or weekends) with corresponding
reboots when operation restarts, e.g. on a Monday morning
Since the circumstances relating to such incidents are unknown, the incidents
cannot be directly attributed as systems stability incidents and must be excluded
from the detailed analysis in the following section. Both POCL and ICL
Pathway are working to reduce the incidence of such reboots to the core
unavoidable events (category 2) through improved user education and
discipline.
—] 5.2.3 System Incident Metrics
The high level analysis of system instability incidents thus includes three
categories:
e Authorised reboots (correlated with Help Desk instructions)
© Unauthorised reboots
¢ Total Help Desk system incidents (including authorised reboots and other
calls closed via avoidance actions)
Summary totals for the Cash Account Periods 19-24 are shown in the following
chart.
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 6 of 24
POL00028464
POL00028464
« © Ich Pathway Acceptance Incident 298 — Resolution Plan Ref: CR/ACD/298
y
©
Version: 0.5
Date: 10/9/99
8
Reconciled Totals.
a
ea
e08S8888
n
"gf 5 8 BS 2 8
838 8 8 8 8 8 8
IOHSH System Incident Calls BHSH Authorised Reboots
{Unauthorised Reboots
Note that CAP23 included a Bank Holiday and a planned (authorised) reboot of
all counters, by request to outlets.
A more detailed scheme of incident analysis was instigated by Pathway from
CAP23, to facilitate focused incident analysis and resolution. This places
emphasis on that class of incidents which requires a system reboot. From week
24 an individual reconciliation of incidents totals between Pathway and POCL
has been occurring with inclusion of a category for “disputed” items which
involve an HSH call but not a reboot. For week 23 a retrospective adjustment
has been added to the weekly total to support comparison between the two
weeks. However, direct comparison with earlier weeks is not valid since the
totals were not reconciled in this way.
The following chart shows the same data, with the planned reboot data
removed (31/8/99) and adjusted for the volume of counters installed. This
shows the incidence of the same measurements expressed as a rate of occurrence
per counter per week.
06 + P> Reconciled Totals
05.
0.4.
03.
0.21"
01
CAP19 CAP20 CAP21 CAP22 CAP23 CAP24 CAP25 CAP26
[BusH System Incident Calls HHSH Authorised Reboots DUnauthorised Reboots ]
From the above analysis it can be seen that there is a reducing trend, particularly
towards the end of the current period. The chart following shows the incidence
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 7 of 24
POL00028464
POL00028464
S Ie Pathway Acceptance Incident 298 ~ Resolution Plan Ref: CR/ACD/298
« Version: 0.5
e Date: 10/9/99
~@
of HSH calls per counter per week relating to systems incidents. The level of
HSH (authorised) reboots is now at the level of approximately 0.5 per month
per counter, below the first Pathway target (1.0) and the proposed threshold for
classification as medium severity.
0.3.
0.25.
0.2.
0.15:
0.1
0.05
CAP19 CAP20 CAP21 CAP22 CAP23 CAP24 CAP25 CAP26—
[HSH Authorised Reboots EIHSH Avoidance Advice ]
The achieved rate of reduction against the initial incident set is actually more
significant since weeks 23 and 24 included a significant number of calls resulting
from a one-off OBCS problem (now fixed) and from the introduction of the:
“System Busy” indicator. When these items are separately accounted the
position is as shown below.
0.25.
0.2.
0.15.
[BOriginal Incident Set HM System Busy Incident HOBCS Incidents ]
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 8 of 24
POL00028464
POL00028464
Py © ICL Pathway Acceptance Incident 298 - Resolution Plan v Ref: cuacnns8
ersit ..
$ Date: 10/9/99
5.3 Detailed Incident Analysis, Categorisation & Resolution
To facilitate analysis and resolution, system incidents have been filtered into
individual categories, each typically associated with one particular problem area
of system operation. To provide confidence in the improving stability of the
system, incidents are recorded as daily totals within each category, to allow
correlation against the dates at which particular fixes were issued to resolve
specific problems. This analysis includes all system stability incidents whether
resolved by a system reboot or by procedural workaround.
As detailed investigation of incidents proceeds, certain faults may be grouped
together into a new category. Initially 12 categories were identified. At week
CAP24 a a number of system busy incidents (category 7) have been categorised
differently as the detail of the fault has been understood. Certain incidents
previously recorded under “system busy” have been identified as hang
during/after log-on (category 5) and a specific problem associated with the
counter printer during busy conditions has been created (category 13).
From version 0.5 of this document, the incident count has been based against
the number of counters installed and quoted as average incidents per counter
per annum.
5.3.1 Button No Entry Signs
From time to time under normal system operation Horizon buttons are
“locked” to prevent user entry to the particular function at that point in the
menu navigation. Such locked buttons are represented to the user by a “no
entry” sign across the button. Examples of legitimate usage of locked functions
include:
© prevention of more than one user selecting cash account functions or
producing certain types of daily printed report
[—) © prevention of logout or entry to training mode when a suspended session
exists
At LT2 substantial changes were made to button locking particularly to prevent
access to conflicting functions during cash account and printing functions. The
logic associated with button locking is complex and typically requires
combinatorial analysis of multiple conditions.
Fixes were issued to correct the majority of incidents recorded within this
category, by correcting the complex logic associated with button locking. A
minor residual usage problem has been identified, which results in button
locking if the printer goes offline immediately following a SU balance report.
This problem has a simple workaround and does not require a reboot.
The history of button locking incidents is shown below.
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 9 of 24
POL00028464
POL00028464
© “ ICL Pathway Acceptance Incident 298 — Resolution Plan Ref: CR/ACD/298
‘ Version: 0.5
- Date: 10/9/99
Button No Entry Signs
oy 18 eudef
onak Clyde}
crowd
t Note that reported incidents tend to be higher on cash account days because of
a higher incidence of legitimate button locking associated with cash account and
office printing functions. A number of disputed items (incidents which do not
require reboot) are excluded from week 23/24. With these included the average
incident rate is running at approximately 1 — 1.5 per counter p.a.
5.3.2 Suspense Account Print
The suspense account was taking an excessive time to print under certain
circumstances, giving the appearance of a system hang. A fix to improve the
performance was issued in two parts. The history of such incidents is provided
below.
Suspense A/C Print Hang
Ba cask ‘ 03 Fix: 5296 - 1
0.25 ; aaa
ro Cheng . 02 I
oss}
01
0.05. I
0 pie Met 8 :
CAP19 CAP20 CAP21 CAP22 CAP23 CAP24 CAP25 CAP26
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 10 of 24
POL00028464
POL00028464
© ~ ICL Pathway Acceptance Incident 298 — Resolution Plan Ref: Gacn/98
ersit )..
é Date: 10/9/99
5.3.3 Virtual Memory Problems
Two problems have been observed which result in progressive memory leakage.
(In these circumstances application routines are obtaining virtual memory from
Windows but not freeing it correctly after use, leading to eventual virtual
memory exhaustion.) The reported symptoms include very slow system
operation, virtual memory messages being displayed and, occasionally a
Windows shutdown and reboot. The principal problem was memory leakage
associated with the Print Monitor routine, which resulted in a substantial loss of :
virtual memory during print operations. This was fixed in WP 5408. A further
residual, but relatively minor, problem associated with the cash account reprint
function has been diagnosed. A fix (lower priority) will be issued for this in the
future.
Virtual Memory Incidents:
Fix:WP 5408 - 17/8
Forecasts
CAP19 CAP20 CAP21 CAP22 CAP23 CAP24 CAP25 CAP26
5.3.4 Printer Hanging
Several problems were detected which result in back office printer hang-ups
) under various specific circumstances. A fix for one class of problem, associated
with memory leakage, has already been distributed as part of WP 5408. This has
reduced the average incidence of such hang-ups. A second problem associated
with printing the 2 final copies of the cash account is under detailed diagnosis,
using results obtained from a diagnostic fix distributed to the live estate. The
fix to the cash account print routine was issued on the 7 September. There
were no occurrences on the following Wednesday or Thursday (CAP 24)
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 11 of 24
POL00028464
POL00028464
© ~ ICL Pathway Acceptance Incident 298 — Resolution Plan Ref: CR/ACD/298
Version: 0.5
¢ Date: 10/9/99
Cash A/C Print Hang
Ceca 6
0.
CAPt9 I CAP20. CAP21 “CAP22 I cAP23 CAP24 “CAP25 I CAP26
The residual count shown under CAP 24 relates to incidents from Thursday 2™
September.
5.3.5 Freezing during /after log-on
A number of incidents were observed in which the system froze after user log-on
to Riposte. On detailed investigation these were all connected with the Riposte
(35 day) message archiving procedure. After log-on various Riposte checks are
called to trace message sequences for integrity and (potential) recovery
requirements. It was found that certain of these routines were attempting to
check message sequences which lay beyond the message archiving window,
resulting in system lock-up when the messages could not be accessed. Three
fixes were issued covering APS recovery routines and Stock Unit checking.
Freezing during/after Log-on
Genk
Season (seh) I
_ SUNS
ond 6)
0.
CAP19 cAP20 )CAP24 “CAP22 CAP23 I CAP24 I CAP25 CAP26
An occasional occurrence of freezing during log-in (prior to entering Riposte)
has also been detected and this residual error is under investigation. Some
instances of System Busy incidents have been discovered to relate to freezing
after log-in, which accounts for the significantly higher incident rate in CAP 24.
Note that the “Double F1” problem (immediately following section) is also
related.
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 12 of 24
POL00028464
POL00028464
co
ICL Pathway Acceptance Incident 298 — Resolution Plan Ref: CR/ACD/298
Version: 0.5
Date: 10/9/99
5.3.6 Fi Twice during log-on
This was a specific condition associated with incorrect handling of double
keystroke “F1” during log-on (to navigate directly to “Serve Customer”) which
could result in a system hang. A fix was issued for this (WP5406), which left a
residual problem with certain OBCS book operations. A re-implemented fix was
issued to cure this - see Section 5.3.12. A second fix to eliminate a small
residual occurrence of the “F1”condition is under test at the time of writing and
will be issued to the live estate during week commencing 13" September.
Freeze following F1 twice during Log-on
0.12
@ os: on
- 0.08
-
6D
he % 0.06
0.04 i
0.02
oe fa eS zea :
CAP19 CAP20-CAP21 CAP22 CAP23 CAP24 CAP25 CAP26
5.3.7 System Busy Message
This was introduced following discussion (via CR & Pathway CP2134) to
provide visible indication to the user when the system is busy, particularly
during longer, complex operations such as processing the cash account. This
was distributed in WP 5407. The introduction of this message has itself resulted
in a number of Help Desk calls, which have also been tracked and analysed. An
improved version of the busy monitor routine was distributed (week
commencing 6" September) ; this monitors only resource usage associated with
the Riposte desktop and invoked applications. (The original utility monitored
the total processor usage and could display the hourglass when background
routines such as NT or Tivoli functions were consuming resource.)
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 13 of 24
POL00028464
POL00028464
~ ICL Pathway Acceptance Incident 298 — Resolution Plan Ref: CR/ACD/298
9 Version: 0.5
* Date: 10/9/99
‘System Busy Message (exc printer/log-on)
Ferecatk. [we sade a
No Creag : :
C Seven rancle~
oe wk UF
Rand NI 4,
Voss wotler?)
A minor problem has been detected with the operation of the Busy Monitor, in
that after a few seconds it can partially obliterate a system message displayed on
the screen if there is a printer problem when printing a Giro transaction. (This
can occur when EPOSS is cycling awaiting the user response before continuing.)
The touch panel is not disabled under these circumstances and the Help Desk
will advise users to complete the response to the printer prompt, thereby
allowing normal operation to continue without reboot. A “fix” to provide
reworking of the Giro printer dialogue will be issued in due course. From CAP
24 specific problems associated with printer busy and log-on freezes have been
separated into their own categories.
Note that it the clerk may legitimately return from a screen to a previous,
having set off a print or transaction log query, and then undertake a second or
third intensive transaction. A number of occurrences of the system busy .
condition are believed to result from such clerk initiated sequences. A block on
the “previous” button is being investigated to preclude such behaviour.
5.3.8 Query Logged-on Users Message
This was a specific problem that occurred during various operations when a user
incorrectly received a message querying details of logged-on users. This was
fixed in WP 5406 which has eliminated the problem.
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 14 of 24
POL00028464
POL00028464
¢ ~ ICL Pathway Acceptance Incident 298 — Resolution Plan
ca
Version: 0.5
Ref: CR/ACD/298
Date: 10/9/99
Forecat:
Slog ak Zao «
Query Logged on Users Message
WP 5406 ~24f8 :
old
CAP19 CAP20 CAP21 CAP22 CAP23 CAP24 CAP25 CAP26
5.3.9 Miscellaneous Freezing / Usage
There have been a few occurrences of miscellaneous screen freezing during
usage, mostly within Stock Unit declaration and balancing operations. A few
reported occurrences were associated with virtual memory problems and are
resolved with the fix identified in section 5.3.3. Several occurrences resulted
from attempts to access message sequences beyond the 35 day archiving period
and other occurrences are associated with multiple button pressing.
Miscellaneous freezing not in other categories
CAP19 CAP20 CAP21 CAP22 CAP23 CAP24 CAP25 CAP26
Diagnosis continues on these and appropriate fixes will be issued in due course.
© 1999 ICL Pathway Ltd
COMMERCIAL IN CONFIDENCE
Page 15 of 24
POL00028464
POL00028464
« ICL Pathway Acceptance Incident 298 — Resolution Plan Ref: CR/ACD/298
9 Version: 0.5
+ Date: 10/9/99
6
5.3.10 Counter Printer problems
Two specific problems have been identified with counter printer operations.
One was associated with the failure to print a second APS receipt, resulting in a
subsequent system hang; this was fixed as part of WP 5406.
Counter Printer Problems exc. System busy
(orecaste
Bin Tarlpnne
Vode nse KU
@ 26,
‘CAP19 CAP20 CAP21 CAP22 CAP23 CAP24 CAP25 CAP26
A second problem, associated with incorrect handling of printer failure
conditions within the Giro transaction printing routine, has been identified and
work is progressing on detailed diagnosis and resolution.
5.3.11 APS Problems
A number of APS application problems associated with receipt issue were
identified (including the second receipt problem identified above).
In certain circumstances a failure in the APS receipting routines could leave
buttons locked and a transaction on the stack. This was also fixed as part of WP
5406. A further fix was issued as part of the system freezing work (WP 5208)
=) to specifically identify to the user the presence of APS recovery operations since
this could give the appearance of a system freeze.
APS issues
Fore cask: 25. a =
1
OS.
0 Mle Va
API CAP20 CAP21 CAP22 CAPZ3 CAP24 CAP25 CAPZ6
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 16 of 24
POL00028464
POL00028464
e
ICL Pathway Acceptance Incident 298 — Resolution Plan
Versi
Date:
Ref: CR/ACD/298
a
As can be seen, the overwhelming majority of APS related problems have now
been eliminated.
5.3.12 OBCS Problems
The “Double F1” fix (see section 5.3.6) which resulted in problems with
jumping screens during OBCS transactions (rather than normal screen
navigation) introduced a further problem. This showed up on Help Desk call
analysis as a significant problem following the “Double F1” fix. The majority of
the problems were addressed by WP5490; a fix relating to one further
circumstance was included in WP 5405.
Qorvecat:
No Weng
Coby los —eccartR
pac
OBCS problems
= “Double FI” fix WP 5490-288]
CAP19 CAP20 CAP21 CAP22 CAP23 CAP24 CAP25 CAP26
There have been no further recurrences of the problem.
5.3.13 Counter Printer Busy Problems
—) One particular class of problem shown up from the “system busy” indicator
relates to a continuing counter printer busy condition returned to the
application These have now been classified as a particular incident type in their
own right (CAP 24).
© 1999 ICL Pathway Ltd
COMMERCIAL IN CONFIDENCE
Page 17 of 24
POL00028464
POL00028464
9 “ ICL Pathway Acceptance Incident 298 — Resolution Plan Ref: CRACD/298
ersit )..
& Date: 10/9/99
Counter Printer Busy
0.2 Ji eS! I oa
0.18 zi =: ES Ss
0.16} = z = =
Foreceatts 0.14: ia Pee
0.12 = -
0.08: s. - .
= On? 26, ool =
0.02. = “
0
CAP19 CAP20 CAP21 CAP22 CAP23 CAP24 CAP25 CAP26.
A fix for the Riposte Peripheral Server is currently under test and is expected to
be issued to the live estate during CAP 26.
5.4 Resolution of Incident Metrics
Pathway notes the POCL proposed metric of 1 system “lock-up” or “crash”
(requiring reboot) per counter PC per annum.
The Pathway position is that this is an unrealistic and unwarranted requirement.
to be placed on the Pathway Solution.
5.4.1. Contractual Requirements
@ There is no contracted Service Level which Pathway is required to meet relating
to lost time associated with OPS system stability incidents. (Lost time at the
counter may contribute to an increase in the volume of fall-back transactions
which may fall within the service reporting requirements of individual services —
EPOSS, APS and OBCS.)
Comparison against Industry Norms
The POCL proposed level is unrealistically high when compared against normal
operational usage of complex distributed systems based upon Windows NT.
Typical industry norms of 1 event per month are reported.
5.4.3. Acceptance Position
AI 298 was raised against Requirement 536, on the basis of Live Trial usage
experience.
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 18 of 24
POL00028464
POL00028464
° * ICL Pathway Acceptance Incident 298 ~ Resolution Plan Ref: CR/ACD/298
coy
©
Version: 0.5
Date: 10/9/99
The planned acceptance testing associated with this Requirement was fully
completed with no outstanding issues. This comprised a combination of detailed
technical test and a review of the technical specifications of the relevant
equipment.
ICL Pathway has accepted that there have been some incidents at outlets which
have affected certain aspects of system operation. As detailed within Sections
5.2 and 5.3 there has already been a significant reduction in such incidents from
the earlier levels in June and July when this AI was raised. Pathway set an
internal target of one (authorised) reboot per month per counter and proposes
that achievement of this level reduces the incident to'a medium severity. The
levels of lost time associated with the current incident rate fall well within this
yardstick.
5.4.4, Resolution Proposal
POCL has indicated a desire to associate this incident with a further metric
which would represent an “acceptable” ongoing level of operation with respect
to the occurrence of system incidents. A timescale for the achievement of such a
level of operation has also been requested.
ICL Pathway will use all reasonable endeavours to reduce the incidence of
interruptions to normal counter operations resulting from the use of the OPS
platform and the Riposte desktop functions. Pathway has set a medium term (6
months) internal target of 1 such incident per counter per 4 months. This
represents a fourfold improvement beyond the initial target.
It is important to recognise that ICL Pathway is strongly motivated to reduce
such incidents as they directly affect its own costs through staffing levels
required at the Help Desk. The Pathway Help Desk model and projected
staffing levels are consistent with this approach. For ICL Pathway this equates
to a requirement to deal with up to 700 such calls per week as the outlet
population increases over the next six months (and the incident rate falls).
Clearly Pathway will be strongly motivated to seek any further possible
reductions in incidents to reduce the corresponding call rate applied to a full
estate.
For POCL the achievement of this target would result in a predicted loss of
service of the order of 6.25 minutes per counter per month. For a typical outlet
operational period of 42 hours per week this equates to a loss of service of <
0.06% per counter. In reality lost customer service time is likely to be
significantly less than this since the above calculation:
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 19 of 24
POL00028464
POL00028464
’ * ICL Pathway Acceptance Incident 298 — Resolution Plan Ref: CR/ACD/298
»
@
Version: 0.5
Date: 10/9/99
(i) makes no allowance for the possibility of directing customers to other
counters during an incident
(ii) makes no allowance for that proportion of incidents which occur during
back office processing and have no direct impact on customer service.
It should be noted that the Pathway incident target does not represent a
contractual obligation (since there is none within the scope of the contract).
However, it is in the mutual interests of both parties to reach this level of
achievement (and indeed to exceed it if possible).
The incident analysis which has been jointly undertaken to date and the
improved level of understanding of system usage within the live outlets both
suggest that the target will be met within the projected 6 months. The most
recent rate of authorised reboot incidents is approaching half the initial target
level, leaving a further required halving to reach the final target. Pathway has
undertaken analysis of several outstanding incidents and diagnosed the detail of
the problem. Software fixes will be progressively released following regression
testing which will see a further reduction on the current incident rates towards
the target. Hence the progression towards the target is already substantially
underpinned by known, diagnosed problems which are awaiting fix issue.
5.5 Improved Defect Removal for Future Releases
The level of testing conducted on the Pathway solution has by any standard
been exceptionally high (over 100 dedicated testers, a staggering array of test
environments, at a cost of 10s of £Millions). The large, complex and distributed
nature of the system demands a sophisticated multi-layered approach to testing
and integration. The strategy was developed and agreed in conjunction with the
sponsor organisations at the outset, and was independently assessed during the
treasury review as being ‘leading edge’. It has been maintained in the light of
experience of Release 1, and is currently again under review in respect of
Release 2 (CSR). Of particular importance here is the experience of the Live
Trial period, and the lessons that may be learned to further improve the Defect
Removal rate for future releases, and so reduce the number of incidents
encountered in the Live Estate.
5.5.1 PINICL Analysis
A review is underway of all the PinICL fixes applied across the whole of the
Counter systems for the Live Trial Period. This period was split into 3, known
as LT1, LT2, and CSR. Initial findings, measuring up to 31/08/99, indicate that
a total of 133 PinICLs were involved. Of these, 2 were data related (including 1
on POCL Reference Data), 1 was build related, and 2 were purely
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 20 of 24
POL00028464
POL00028464
* ICLPathway Acceptance Incident 298 — Resolution Plan Ref: CR/ACD/298
oA Version: 0.5
Date: 10/9/99
administrational to introduce the decommissioning of BPS, leaving 128 software
faults to be considered in all. (It may be of interest to note here that about 30 of
these were for BPS, although this does not have a material bearing on the
analysis.)
Of these 128 faults, just 50 were actually raised from activity in the Live Trial.
The other 78 were all in fact raised during the course of testing. (Most of these
were found long before the Live Trial in Pathway’s System Test and Integration
Test stages or in the MOT/E2E test stages immediately before the Live Trial.
These were the subjects of agreed deferral via the KPR process, to allow for
their controlled introduction during the course of the Live Trial, to avoid
destabilisation. A small number were raised after the KPR, as a result of
Pathway’s ongoing regression testing)
The records for these PinICLs have been analysed to determine the nature of the
defects concerned. As a result they have been categorised accordingly, to help
assess how best the Development Lifecycle, and in particular the testing and
integration approach, may be revised to best detect such defects earlier, and so
better protect the Live service. A large number of low level classifications were
used, which can be summarised into the following high level categories:
1. Usability/Robustness:
MMI, Menus, Button locking, No-Entry signs, Double key stokes,
Cosmetics,
Enforcement of correct practice, Operational usability, Correct error
handling, etc.
2. Stability/Performance:
Screen freezes, Printer hangs, Memory leaks, Blue screens, NT messages,
Archiving anomalies, Function performance.
3. Application Logic:
Plain software bugs.
Initial findings indicate that the 128 fixes applied to during the Live Trial (78
faults found in Testing and 50 faults found in Live) can be categorised as
follows:
Category I» Testing Faults ones Live Faults —-.
Usability/Robustness 38 38
Stability/Performance 14 5
Application Logic 26
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 21 of 24
POL00028464
POL00028464
, * ICL Pathway Acceptance Incident 298 — Resolution Plan v Ref: GRACD/258
erst jes
a Date: 10/9/99
@
(To set these figures in context, overall testing has trapped several thousand
defects, commensurate with the great size and complexity of this system.)
The following conclusions can be drawn:
e The overall approach has been extremely successful in reducing the
exposure of the Live Estate to a very small residue of defects remaining in
the system (which the industry recognises can never be entirely eliminated,
although there is always room to improve).
e The incidence of defects discovered is demonstrably reducing over time,
indicating a steady improvement in overall system stability.
e There is clear evidence that the majority of defects in the
Usability/Robustness category have been trapped during testing, despite
this being a notoriously difficult and expensive problem domain to address
exhaustively through testing.
¢ Nonetheless, the majority of defects escaping capture during test are in the
category Usability/Robustmess, suggesting that there really is no substitute
for genuine Live exposure to flush out these types of defect (as per
generally accepted industry wisdom). It also suggests that this is the main
area to target for future improvement, offering more scope. Further to
this, the report from the EPOSS Defensive Test exercise was encouraging.
It indicates that such short focussed test activities, concentrating on
particular aspects of system usage, can have considerable success in
removing defects both of the Usability/Robustness and
Stability/Performance categories.
e Testing has eradicated all but a very few remaining Stability/Performance
defects, albeit that these can impart a disproportionate effect on the Live
Estate, further suggesting the importance of a Live Trial or equivalent
period, where the impact on the business can be limited and controlled.
The fact that a significant number of such defects were still being
discovered in these late testing stages indicates that there is potential for
improvement here also. It suggests that a more detailed analysis of the
precise circumstances of these defects should be conducted to determine
any common factors and to assess whether any benefit is to be had from
specific testing actions earlier in the lifecycle.
e Testing has eradicated all but a very few remaining Application Logic
defects. Little scope for improvement in this area, other than the perpetual
goal of earlier discovery.
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 22 of 24
POL00028464
POL00028464
es
.* ICL Pathway Acceptance Incident 298 — Resolution Plan Ref: CR/ACD/298
¢ Version: 0.5
Date: 10/9/99
A further observation arising from the analysis would be that many of the
PinICLs arising in the Live Trial system had in fact been the subject of earlier
PinICLs raised during the course of Testing. This is a common phenomenon.
Typically it comes about because for certain classes of defect (particularly where
it is related to timing, or multiple streams of activity in combination) the
symptoms revealing the defect can not easily be reproduced until the underlying
defect is properly understood. Because it cannot be reproduced the underlying
defect can not be properly diagnosed. The faults are then often put down to
some flaw in the test environment, or the wrong code versions being used, and
the PinICL is closed ‘unable to reproduce’. There is no easy remedy.
5.5.2 Implications for CSR+
A full review of the testing conducted for Release 2 (CSR) has already been
conducted and a proposal paper has been drafted “Revisions to the Testing &
Integration Approach for Pathway Release CSR+”. Based on the findings above,
it is now suggested that the following additional proposals be considered for
inclusion in that paper:
a) Analyse the precise circumstances of the defects in the
Stability/Performance category. Identify any common factors. :
b) Analyse the precise circumstances of the defects in the Usability/Robustness
category. Identify any common factors.
c) From (a) and (b) above, establish any potential test points for existing
testing stages, and consider extending their respective objectives/review-
checklists accordingly. (Include Unit Test, System Test, and Conformance
Test.)
d) Consider extending Code Review checklists to cover. the specifics from (a)
& (b) above.
e) Adopt the principles of the EPOSS Defensive Test exercise for wider
‘application, and in particular to mount earlier exercises specifically
targeting those attributes identified in (a) and (b) above.
f) Work with POCL in determining appropriate and agreeable alternative(s)
to the Live Trial for future releases, to allow each new product to be
exposed to substantial Live use, but with limited business impact, for an
appropriate period of time prior to general (national) release.
It should be noted that CSR+ has already benefited from the revisions included
in the Testing Strategy and will benefit, in due course, from the additions listed
above. ICL Pathway believes that introducing changes to the Design and
Development stages (other than ensuring that good practice is maintained)
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 23 of 24
POL00028464
POL00028464
;* ICL Pathway Acceptance Incident 298 — Resolution Plan Ref: CR/ACD/298
cd Version: 0.5
& Date: 10/9/99
would result in only a marginal reduction of the defects in question. The
majority of the CSR+ functionality has now entered Link Test or System Test,
so it would be sensible at this stage to focus in these and later stages of the
lifecycle.
© 1999 ICL Pathway Ltd COMMERCIAL IN CONFIDENCE Page 24 of 24