Guest guest Posted August 26, 1999 Report Share Posted August 26, 1999 Westergaard 2000 Y2K FAILURE CURVE VINDICATED? By Ian Hugo August 25, 1999 At the beginning of this year, I attempted a reasoned approach to predicting what might happen later in the year as a result of the Year 2000 problem. These thoughts and prognostications, a kind of Old Hugo's Almanac for the Millennium, are published on the Taskforce 2000 website at www.taskforce2000.co.uk under Articles (Predicting Year 2000 Disruption); you can probably find the paper on various other Y2K sites too. Since then (the thinking was actually done towards the end of last year), I've had time to revisit my reasoning; and the passing of time also has allowed some evidence to be gathered. So, as all would-be soothsayers should, I'll try to re-assess how my reasoning and predictions are bearing up. MAJOR ELEMENTS OF THE REASONING This is essentially a short recapitulation of the main points in the reasoning I produced. It will serve to refresh the key points for readers who d then and bring new readers up to date. The first and most obvious key point is that whatever level of disruption we shall see will not be seen at the change of century (01.01.2000). What we shall see is a failure curve, of currently unknown dimensions but possibly predictable shape, operating over some period before and after the century change. The second key point was to make a distinction between failures of any kind and disruptive failures. Very many failures are possible but many also can be recovered from quite quickly. The insight here was that the cure for the so-called Millennium Bug could be quite as dangerous as the disease (ripping interconnected IT systems apart and putting them back together is an inherently risky exercise). So the "implementation" stage of Y2K projects, far from being a formal sign-off, could be the most fraught part of the whole program. Thirdly, I tried to separate potential incidents that were inherently unpredictable, such as an overlooked chip embedded in an important control system from what might reasonably be predictable. The second and third points together led me to suggest that implementation in general, and installation of new replacement systems in particular, would be the most likely cause of predictable disruption. Failure in such cases, based on general experience to date, won't be a matter of hours or days but of weeks or months. Finally, as regards disruptive possibilities, I made the point that a single failure at any given point in time would be unlikely to cause major disruption. Companies can and do occasionally experience major "hits," such as loss of a data center, but can recover from them within days (to all external appearances) if sufficiently prepared. Short-term disruption could occur from a single failure but longer-term (and possibly terminal) failure would only occur if multiple impacts were experienced within an overlapping time frame. The latter case I termed "congestion." That was my analysis of the (predictable) disruptive failure potential; the questions of where and when remained to be addressed. I proposed that the most likely victims would be large organizations that were late in starting Y2K programs, if these could be known. The reasoning behind this assertion was simply that most large programs comprise multiple large projects which, given sufficient time, can be scheduled so that their completion dates are staggered within the ultimate deadline. This allows contingency time between the scheduled completion time of one project and that of the next. Large and late programs have to be telescoped into the time available, with all completion dates falling within a short period. The latter thus have much greater potential for overlapping failures. As to when the beginning of the failure curve might become discernable, I suggested the middle of this year. That prediction was based upon the assumption that large and late programs would be in the implementation phase as from June this year. REASONING REVISITED Since the time I started this line of reasoning, slightly less than a year ago, I have had some opportunity to reflect upon it. Nothing I have thought of since has caused me to change the line of reasoning. However, I have had cause to reconsider my thoughts on timing. Just about every organization I communicate with, even those with very well managed programs, is late in delivery. To that extent, I believe I may have missjudged the time at which late implementations would be attempted and at which congestion would occur. I predicted the build-up of congestion as beginning from June and now think that may be too early, by about a quarter. It can occur earlier (in fact has done so; see below) but congestion for purely Y2K reasons, if it occurs, I now believe will occur later. Paradoxically, since I was originally arguing against a focus on 01.01.2000, if my timing prediction is now to correct the effects will be seen most visibly around the change of century, even though the causes precede that date. The result of my current revisions is that I think the failure curve I originally drew should stand as regards its shape but be moved forward in time by about three months. THE EVIDENCE At this point I need to acknowledge a debt of gratitude to UK Government Departments and the media. The former (and other infrastructure bodies) seem kindly determined to prove my predictions correct and the latter have been assiduous in reporting the resultant failures. The reported evidence to date comes from reported incidents at John Radcliffe Hospital, Maritime and Coastguard Agency, London Electricity, National Air Traffic Control Systems (NATS), Dept for Social Security (DSS), the Inland Revenue and the Passport Agency. There are more but we'll leave that aside for the moment. All the examples below are from the UK and there is similar evidence available on the Internet from the USA, although not (yet) from other countries, which is something I'll also comment on later. Briefly, John Radcliffe Hospital in Oxford failed in a PABX replacement (for Y2K reasons) and the failure resulted in loss of telephone communications for 9 hours as reported in the Daily Express. In fact, the failure appears to have resulted in loss of full facilities for over 24 hours and caused a neighboring hospital to be put on alert. The Maritime Agency had problems with replacement of its Adas (data acquisition) system which resulted in some minor disruption over a couple of weeks. London Electricity attempted to replace some thousands of key-controlled meters because they wouldn't have been able to record price changes beyond the end of this year. The new keys didn't work (cut off supply) resulting initially in some 2000 users being disconnected. The last I heard was that the replacement program had been temporarily aborted whilst thumbs were stuck in mouths (or in the air). NATS proposed to resort to manual operation for a 2-hour period in order to get some compliant replacement equipment installed. This produced some consternation amongst Members of Parliament because (a) the replacement was scheduled at a peak traffic period and (b) they had been led to believe that NATS systems were already compliant. The replacement reportedly failed, leading to more thumb-sucking no doubt. I've highlighted the Inland Revenue Y2K program as high risk in my last three assessments of UK central Government readiness (the last is viewable at www.taskforce2000.co.uk/articles) because of replacement programs scheduled late this year. The first of these (Infrastructure 2000), previously due for completion in September and now for November, is already hitting problems. The Bradford Midland Tax Office has apologized to various companies for threatening to send in bailiffs to collect amounts supposedly due but which had already been paid. The problem was that failures attributed to the Infrastructure 2000 project (to replace 50,000+ desktops, 30,000 of which were classified as critical) prevented staff from accessing current information. The DSS has failed in implementation of a replacement National Insurance Contributions system, which currently has some 1500 faults in it (low by Microsoft standards?) according to a report to the House of Commons Public Accounts Committee, and has resulted in miscalculation of benefits to some 350,000 people, of which 70,000 cases remained to be cleared as at the beginning of August. Finally, the Passport Agency is dealing with a reported backlog of some hundreds of thousands of applications for passports, in part because of a failed implementation of a new passport issuing system (PASS) to replace a previous and non-compliant one (PIMIS). RECONCILIATION OF PREDICTIONS AND EVIDENCE All these cases, all reported in the mass media, tend to support my earlier conclusion that replacement of systems was likely to produce the most significant cause of disruption in the Year 2000 context. In effect, that is simply recognition of the fact that, in this context, the remedy is about as dangerous as the disease. I think there are a few important points to highlight. The first is that nothing blew up and nobody got killed. Moreover, in all of the above cases other than that of the Passport Agency (and arguably the Inland Revenue), there has been and is unlikely to be any long-term mess. These are cases of local and containable, albeit inconvenient and embarrassing, administrative "hiccups." That much was in my original prognostications. The more interesting cases are the Passport Agency and possibly the Inland Revenue. The Passport Agency is a long-term mess. It results from failure to adequately implement a new passport issuing system overlapping with a second impact: new Government legislation on passports resulting in a large and sudden increase in demand. It is the coincidence in time of the two impacts that has produced the longer-term disruption, as predicted in my paper. At the moment, the Inland Revenue case is producing only minor and locally containable disruption. However, three further replacement projects are scheduled for completion in October and November and, should failure in any these overlap with continuing disruption from the Infrastructure 2000 project, we could well see longer-term disruption here also. UNREPORTED CASES Cases of Y2K failures resulting in disruption that get reported in the media must be the tip of the iceberg. Common sense dictates that that must be so. I, with limited knowledge of individual organizations, know of two further cases that I cannot name in which internal disruption is occurring and has necessitated resort to manual operation of processes. In both cases, the situation is unrecoverable before financial year-end and whether the results become public or not will depend very much on the attitude of the organizations' auditors. I cannot believe these are isolated cases. Whether these and doubtless other cases will result in major disruption probably depends less on auditors' attitudes than on whether they experience another significant "hit" in an overlapping time-frame. But, a good question to ask now is: how many organizations do you know/suspect are already running some previously automated processes manually? They must be at serious risk: either of getting into an unrecoverable situation or of experiencing a second or third and potentially terminal failure. The other question I would like to pose relates to what is happening in other countries. It seems certain that the UK recognized and started work on the Y2K problem earlier than most other countries. The UK should therefore be more advanced and I would expect that other countries should thus be experiencing more of the kind of problems described above than the UK. But there are no reports to confirm this. This leads to three possible conclusions. Firstly, that countries starting late have caught up and instituted better quality programs. Secondly, that they haven't but any failures are not being reported in the national media. Thirdly, that they have yet to progress to the stage that the UK has reached and are thus not yet experiencing the failures reported in the UK. There is a fourth, and rather more alarming, possibility: that other countries are relying heavily on a "fix on fail" strategy. I won't go fully into the folly of such a strategy here. Suffice to say that such a strategy relies on the assumption that you will know when a system fails (quite unlikely unless you create "traps" to detect failure); and data corruption, if it occurs undetected for any appreciable length of time, may well be unrecoverable. FUTURE PROGNOSTICATIONS I believe that we will see increasing numbers of the kinds of failure described above as this year progresses. Whether they result in anything more than a few days local brouhaha and a couple of column inches in the Press will depend, as I've said before, on whether second (third or fourth) impacts occur, from any source, in the same time-frame. If that assumption proves true, then the probability of multiple "hits" occurring within a single organization must grow correspondingly. That is what I have termed "death by attrition." Also, the probability of individual hits weakening individual organizations in a single chain of dependencies must similarly increase: death by a thousand cuts. Thus far, there is no evidence to support "end of the world" scenarios and we must all hope that no such scenario will be realized. However, it would be foolish not to recognize that the early seeds of potential widespread disruption are already sprouting. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You are posting as a guest. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.