Monday, 1 February 2010

41 - Multiples of (B-1) where B is the number base being used always ultimately sum to (B-1) when summed in the number base B

After I finally got to the end of my seemingly interminable series of posts on my PhD, I thought I'd follow it up straight away with something, in the same sort of way that one might eat a chocolate straight after swallowing back down some vomit. However, I didn't actually post this at the time, but thankfully it's about maths. Well, numbers, really, and how they work, and so it's fairly timeless.

I like numbers. This is apparent to anyone who knows me well. One of my favourite numbers is the number 9. 9 is a good number, and it has lots of properties which make it lovely. For instance, any multiple of nine has digits which sum to a multiple of nine. Take the number, 27602742108312 has digits which add up to 45, whose digits add up to 9, therefore 27602742108312 is a multiple of 9.

Some years ago, this property intrigued me. I was on a train and fairly bored, so I had a think about it. Why is this true?

If one writes a list of the multiples of 9, you get the following:


Remember your primary school maths lessons? Remember how, when you were being taught about long addition, they spoke about "units, tens, hundreds, thousands"? Well, this is being used here. Every time you add nine to the previous number, if the unit is greater than 0, it is reduced by 1, while 1 is added to the tens. Thus, the numbers balance out, and the sum remains at 9.

But it's not quite that simple. The reason why we count in units, tens, hundreds, and so on is because we use the decimal number system. That is, we count in base 10. It's possible to count in other bases - if you've ever used a tally counting system, that's using base 1. If you understand binary, that's base 2. Hexadecimal is base 16, and so on. Whatever base you're in (let's call it B), the digits in the numbers you use are arranged in a very specific format:

... B5 B4 B3 B2 B1 B0

So, in base 10, we count in:

... 105 104 103 102 101 100

Which works out as:

... 100000 10000 1000 100 10 1

So, if you want to represent the number "35673" in decimal, you're saying you have three ten thousands, five thousands, six hundreds, seven tens and three units. If you are counting in binary, these numbers are:

... 32 16 8 4 2 1

So, if you have the binary number "101101", you're saying you have one thirty-two, no sixteens, one eight, one four, no twos and one one, which is the same as 45 in decimal.

If you are counting in ternary (base 3), your columns are:

...243 81 27 9 3 1

And thus the number "201212" represents two times 243, no eighty-ones, one twenty-seven, two nines, one three and two ones. 486+0+27+18+3+2=536 in decimal.

Whenever we count on our hands, we use each finger to represent a one - we are counting in base 1. However, if we count in base two on our hands, we can get a much larger range of numbers:

Hold your hands in front of you, palms facing you. Put down all your fingers into fists. Imagine that this represents zero. Whenever you raise a finger, that puts a '1' into the number that that finger represents, while a finger being down represents a '0'. (Note that this assumes complete independence of finger movement, which isn't quite true for the ring and little fingers, but it is good enough for a demonstration). Raise your right-hand thumb. This number is therefore 0000000001 in binary, or 1 in decimal. Put your thumb down, and raise your right index finger. This is 0000000010, or 2 in decimal. Raise your thumb again. This is 0000000011, or 3 in decimal. Put both these down and raise your middle finger. Apologise to whoever is now looking at you in a very offended way, and tell them that this is 0000000100, or 4 in decimal. If you keep counting in this way, by the time you get to raise the thumb on your left hand, you've counted all the way up to 512. Raise all your fingers, and this represents 1023. On two hands which you previously thought could only count up to ten.

But all this is a minor distraction. Hopefully, you're now familiar with number bases. Let's say that we have a number base A, where A merely represents any number. If A is 2, we are counting in binary. If A is 10, we are in decimal, and so on. If we take the number A-1, any multiple of A-1 will have digits which sum to A-1. To show this, in senary (number base 6), we count as follows, with senary on the left hand and decimal on the right:

0 - 0
1 - 1
2 - 2
3 - 3
4 - 4
5 - 5
10 - 6
11 - 7
12 - 8
13 - 9
14 - 10
15 - 11
20 - 12
21 - 13
22 - 14
23 - 15
24 - 16
25 - 17
30 - 18
31 - 19
32 - 20
33 - 21
34 - 22
35 - 23
40 - 24
41 - 25
42 - 26

We are counting in base 6, and so I am saying that any multiple of five (6-1, for those not keeping up) will have digits which sum to a multiple of 5. I have highlighted these in bold above. The multiples of 5 in senary are 5, 14, 23, 32, 41, 50, 55, 104, 113, 122, 131, 140, 145, and so on. Each of the sums of these numbers adds to a multiple of five. However, look at that last number: 145. The numbers add up to 10, which is clearly 5 x 2, but these digits don't add up to 5. However, we need to add the numbers in senary, not decimal: 1 + 4 + 5 equals 14 in senary, and these digits add up to 5. This property is true for all bases.

Again, in hexadecimal, the multiples of 15 (represented as F) are: F, 1E, 2D, 3C, 4B, ..., F0, FF, 10E, and so on. For the number FF, the digits in decimal add to 30, which is, again, a multiple of F. In hexadecimal, F + F = 1E, and 1 + E add to F.

I'm now going to generalise this a bit. It gets a bit technical, so bear with me.

Whenever we count in base A, we set up our columns so that, as above, we have

...A5 A4 A3 A2 A 1

We set all the values to zero, and begin incrementing the units column by one. When we get to the number "A", we set the units column to zero and the "A" column to 1. Thus, any number which is smaller than A lies only within the units column. Likewise, any number which is smaller than A2 lies purely within the "A" and units columns, and so on. Similarly, if a number is greater than (A-1), the number must lie in more than just the units column. This is a very important property. Using this property, a table can be constructed showing all of the digits of any multiple of any base A, and their sum:

NN x (A-1)A3A2A1Digit Sum



From this table we can see that, at least up to a certain point, all of the multiples of (A-1) have numbers which sum to a multiple of (A-1). I believe that it extrapolates to all multiples of any base. Now, recall that I, at the time, was considering all of this on a train. Rather like Fermat, I came up with a terrifically brilliant explanation for why this was so. But I didn't write it down and now I can't remember. This latter point I put down to old age. Bear in mind also, that I've forgotten much of the mathematics I used in my degree, which is something I intend to remedy at a later date.

Still, the concept is a very interesting one, and it's something that I've not come across elsewhere. If one of my (mumble) readers wants to point me in the direction of an interesting explanation for this from someone else, please do. I've yet to find one.

Wednesday, 25 November 2009

39 - Petri Nets

So here it is, Merry Christmas. I wish I could confirm that everybody is having fun, but seeing as this has been a series of blogs about various aspects of a very dull topic, I’ll be amazed if anybody is still reading it. Never mind, though, because this is the final post of this series! Woo!

Thus far, I have explained the importance of measuring various reliability characteristics for a given item. I have briefly explained how this is done, both qualitatively and quantitatively. I have given an explanation of how my work stems from this – with relevance to working out the probability of successfully completing a mission or series of missions. This final section explains how, in my PhD, all the various concepts that I need to model are modelled.

As you can imagine, it is very difficult to model factors such as bringing online a redundant system to cover the failure of a main counterpart, or the method of prediction of future component failure. In order to be able to do this, then, a tool needs to be employed which at least has the capability of modelling these. This is not really true for fault tree methods or Markov methods, for instance, due to the limitations of the models produced. Luckily, I was introduced to Petri nets.

A Petri net uses several components to represent things:

  • Places – these are shown graphically as circles, and are used to store values, known as tokens.
  • Tokens – these are the values that are stored within places. They can move around through the switching of transitions. An integer number of tokens is stored, and this number can be infinite.
  • Transitions – these allow tokens to be transferred, created or destroyed. They can have a time delay attached or not. They operate through a strict logical set of states.
  • Arcs – these connect places to transitions, and vice versa. Places only connect to transitions, and vice versa, and there can be any number of arcs between a given place and transition. If there are more than one, however, these are grouped together into one, and a weighting or multiplicity is attached to the arc, indicating its size.
Now, this may seem perfectly simple. It may seem really quite complex, but thankfully, I am here to help you see how these simple components can end up producing some very interesting things. The diagram below shows a simple Petri net shown before and after a time t. If it helps, think of t as being 10 seconds, so at the point 10 seconds, the diagram changes from the first net to the second.

The diagram, which we can call Jeff, shows a number of places (the circles), with arcs (the arrows) leading to or from a transition (the rectangle). Notice that in Jeff, some of the arcs have a weighting greater than one, shown by a small slash with a number next to it. Each of the places has tokens (the small dots) in them.

The diagram demonstrates the mechanism by which the dynamic capability of Petri nets is achieved: transition switching. A transition will usually have places which input to it (in Jeff, there are three of these), and those which take outputs from it. The switching process works as follows:
  1. Enabling: A transition is enabled when there is a token which can travel down each arc into the transition. For instance, in Jeff, the left-hand net shows the three input places as having two, one and five tokens respectively. The arcs from these places to the transition have weightings of two, one and four respectively. Thus, the arcs have all got tokens which can travel down them, and so the transition is enabled.
  2. Time-delay: Once the transition is enabled, a time-delay may exist which must expire. This delay can be a set value, of, say, 10 seconds, or an hour. Alternatively, it can be randomly sampled from a given distribution of times.
  3. Switching: Once the delay, if it exists, has expired, the switching takes place. This removes the arc-number of tokens from each of the input places, and deposits an arc-number of tokens in each of the output places.
And that’s it. Beyond that, there are some more complications, such as inhibitor arcs, but that’s pretty much it.

So I have explained the mechanism by which Petri nets work, but have not really mentioned what they’re actually for. This may be a problem as my wife is always complaining that I’m useless at explaining things to the layman. But I’ll give it a go.

Consider Jeff again. It could really represent anything you want, but imagine instead that it’s a model of how to make a particular type of biscuit. You need 2 oz of flour, 1 egg and 4 oz of sugar. But you put in 2 ounces of flour, one egg, and five ounces of sugar (it’s not a healthy biscuit, and it may not work in reality) – represented in the input places. You wait a while to cook it in the oven, then afterwards you are left with four biscuits, and an ounce of sugar, because you used too much of that, and the biscuit complained at you.

As another, more relevant, example, consider the case of a component. A component works at the start of the model. It then fails at some time. After another length of time it is repaired. After some more time, it fails again, and so on. What we have here is two different states – working and failed, and two different ways of switching between them. If we have two places, one for each state, we can use a single token to represent which state the component is currently in. If we have two transitions, these can allow the switching between the states, effectively modelling the processes of “failure” and “repair”. This PN will look something like that in the diagram below, which I’ve called Albert.

In Albert, the top transition is enabled by the single token. It will wait for a certain length of time (the time it takes the component to fail) before switching, representing the component as “failed”. Once this is true, the bottom transition is enabled, which again waits for a length of time (the repair time) before switching the component back to “working” again.

This simple example is at the small end of a whole world of possibilities of modelling: using this system we can easily model the process of a mission from phase to phase. We can create Petri net representations of fault trees. We can make these fault trees cause phase failure, and thus mission and MFOP failure. We can activate or deactivate certain components, allowing for redundant systems to be modelled. And so on. The possibilities are endless. No, really, the possibilities really are endless: PNs as they are usually used in everyday life (haha) are Turing complete.

Using PNs, I have managed to create a modelling method for everything considered in my PhD: prognostic systems, sensors, a fleet of aircraft performing MFOPs which contain multiple phased missions, mission abandonment, phase insertion, redundant systems, and so on. These are all packaged together in a rather nifty computer program, which takes inputs on things such as mission data, component failure rates, phase failure logic, enabler data and so on, and creates all these lovely PNs.

What my program then does is to generate random times for component failure, and see what happens when these failures occur – do they cause phase failure? Does it put the aircraft out of action, or just abandon the mission?

If you build up enough simulations on this sort of thing, you get a very good idea of how well the overall platforms perform, and thus where the major problems are.

And that’s my PhD. It’s taken over five years, it’s made me cry and want to flagellate myself, and it’s nearly over. And when it is, you and I can laugh and drink and eat and forget all about it, pretend that it never happened (other than you having to call me “Doctor”), and get on with our lives in happy ignorance of reality.

Tuesday, 3 November 2009

38 - Maintenance-Free Operating Periods

There's just two more of these posts, then you can consider yourselves educated and can talk to me about my doctorate without me having to start the conversation with the words "Right, well, you know military aircraft, yeah? They have to fly lots of missions, yeah?..."

So. You know military aircraft? They have to fly lots of missions. Missions missions missions. All day long. A mission here, a mission there, a mission everywhere. But, as we also know, things can go wrong in missions. Evil Muslim Terrorists can fire Russian Rockets from their Russian Rocket Launchers and destroy planes. Idiot Americans can accidentally Bomb Aylesbury. Wings can fall off. Luxury cars can fall out of the back of the aircraft, landing bonnet-first in a swamp where an Indian man holding a goat on a piece of string stands looking puzzled.

So, some time ago, around 1995-6, the Ministry of Defence (MoD) posited the creation of a new way of measuring the effectiveness of military aircraft, with respect to reliability. This was called a "Maintenance-Free Operating Period", or MFOP for short. The idea was that it's much more useful to the RAF to be able to send out an aircraft to complete lots of missions back-to-back, without the need for any emergency maintenance, and with a high degree of confidence that this will actually work. Once this period (called the MFOP) is finished, the platform undergoes lots of maintenance all at the same time, with parts swapped in and out, inspections made, damage repaired, and so on. This second period is called a Maintenance Recovery Period, or MRP. After that, the plane goes off again to destroy whatever Innocent Civilians has taken the Government's fancy this week.

Before 2004, when I started the PhD, the little research that existed had investigated this concept, and decided that several potential "improvements" to a platform could be made in order to reach the desired MFOPs and confidence levels. These are:
  1. Improving the inherent reliability characteristics of the components in the platform - understand each of their typical failure distributions, parameters, causes of failure and how these can be minimised, and so on.
  2. Put in place systems or components which usually are switched off. These can be used as a back-up to take over from important systems which may fail. (This is known as redundancy).
  3. For electronic components, make use of a relatively new concept called reconfigurability - the ability of avionics to sense a failure of one of their modules, and adapt their configuration to take account of this, and continue operations as normal.
  4. Design platforms and plan missions and repairs such that finding where failures have occurred in systems (diagnostics) is easy, quick and cheap.
  5. Use systems which can predict the future failure of components and the effect these are likely to have on upcoming missions (prognostics).
All fairly boring stuff, I'm afraid. There do exist, as with Phased Missions, one or two very simple mathematical models but these fail to cut too deep into the issues at heart. So my PhD has to, in addition to considering phased missions modelling, model the performance of MFOPs. In a fleet of aircraft.

There's not really a great deal to say about this subject, short of the fact that it's unlikely that I'll be publishing a thesis set to light the reliability world ablaze with amazing discoveries. It's an idea, but one which probably, ultimately, will not work, because people like things the way they are.

The final post will be about Petri nets. That one will have lots of pretty pictures.

Monday, 2 November 2009

37 - Phased Missions

It should, hopefully, be clear to you now that my work involves estimating the probability of systems failing. So far, this has not been too difficult: break things down, put numbers in, get things and numbers out.

Things can rapidly get more complicated, however. Sometimes, systems go through different periods, where certain sub-systems are activated or deactivated at certain times. An example of this would be an aeroplane – the wheels will be up (stowed away) or down (in use) depending on whether the plane is in flight or not. A failure of the landing gear during flight wouldn’t be an issue at that time – it’s only when the plane is coming into land that panic would set in, and, no doubt, some big black dude attempts to get the muthaf***in' snakes off this muthaf***in' plane. So to speak.

Imagine, then, that the plane is performing a mission. This particular aircraft is of the military variety, and it’s flying off to bomb some innocent Iraqi civilians. The different stages of the Innocent Iraqi Civilian Bombing Mission could be:

1. Taxi to runway
2. Take-off
3. Ascent
4. Transit Flight to Innocent Iraqi Town
5. Descent to Bombing Height
6. Bombing of Innocent Iraqi Civilians
7. Ascent to Transit Height
8. Transit Flight Back to Base
9. Descent
10. Landing
11. Taxi to Hangar
12. Dressing-gown, Whisky, Cigar, Long-Haired Cat, Estimated Death Count, Tirade About Dirty Arabs, Job Well Done.

In each of the stages above, the aeroplane will have different systems in use. These different stages are known as "phases". Because of the differing systems, the ways in which the aeroplane failure can be expressed will change from phase to phase. Also, the stresses on the various sub-systems will change, possibly affecting component failure rates. As such, to get an accurate picture of the probability of aeroplane getting through the Innocent Iraqi Civilian Bombing Mission without being shot down by Evil Muslim Terrorists With Russian Rocket Launchers, one must consider each of these stages, or phases, separately.

And so off we trot, putting together fault trees for each phase of the system failure event "Plane In Innocent Iraqi Civilian Bombing Mission Shot Down By Evil Muslim Terrorists With Russian Rocket Launchers", (or PIIICBMSDBEMTWRRL for short). How, then, do we come up with a figure for the success of the overall mission?

Well, one factor is that if any one phase fails, the entire mission fails. But, just to complicate matters, one has to consider whether or not the plane's failure is one where Evil Muslim Terrorists Shot Down The Aircraft And Then Stole All Our Technological Secrets And Killed The Crew, or whether We Forgot Which Country We Were In And Accidentally Destroyed Aylesbury. One is a catastrophic failure, where the platform is lost, the other is a mission failure, where the objectives have not been completed but further missions are possible. The two levels of failure are quite distinct, and may require completely different phase fault trees for each one.

Other interesting factors include inserting phases into the middle of missions, such as when a mid-air refuelling is needed. Hilariously, this very situation occurred with the Nimrod aircraft some time ago. And a catastrophic failure occurred, everyone died, and much hand-wringing began. Or we may need to alter our strategy midflight, because We Accidentally Bombed Aylesbury and so May As Well Bomb Milton Keynes While We're At It. Or the Accidental Bombing of Aylesbury means we have to Abandon Mission and get back to base before anyone realises what's happened. Or the weather got in the way of our Innocent Iraqi Civilian Bombing campaign, and so We Had To Bomb The Afghanis Instead, something which only happens every Thursday.

Considering all these factors can often be very tricky. My PhD is partly to do with evaluating the situations where inserted phases have occurred, or missions have been abandoned, or a probabilistic event (like the weather) forces a change of tack a certain proportion of the time. For a very simple phased mission, mathematical methods already exist to solve them, and have done since the mid-seventies. One of the standard methods assumes a non-repairable system (which can be a very unworldly assumption indeed) and converts all those pretty phase fault trees into a giant behemoth of a mission fault tree. This can then be solved in the usual way. However, this method is a bit big and takes a long time (which engineers hate), so some other methods have been devised to sort through the nitty-gritty and try to get accurate answers more quickly. I won't mention them here, as I'm only giving a brief overview of the problem, but further reading can always take place by reading the relevant papers. Leave a comment if you care.

The penultimate blog in the series is coming next. This is about the other half of what I have to investigate - Maintenance-Free Operating Periods. Yummy.

Monday, 26 January 2009

33 - My PhD - Part 2: Failure Logic

Having explained all about what reliability is, the problem is now how to apply it. We need to know how well systems such as computers, aeroplanes, ships, and so on, will perform in terms of reliability, but often it is far too expensive to find this out by experimentation. We cannot take a large representative sample and test them until they break, as the expense of this is far too prohibitive.

So what can we do? Well, instead of trying to directly find the reliability of the overall system, we consider that the failure of a system is actually due to the failure of the things which make up the system. We then use logic to break the system's failure down into the failures of the constituent components.

For instance, consider a computer. Someone who has no experience with computers turns it on, and your operating system (like Windows) summarily fails to appear on the screen. This, if you like, is the "top level" problem - the major issue which affects the user. If we had a think about how this might have happened, someone with slightly more computery knowledge might suggest the following things:
  1. The monitor isn't turned on
  2. The monitor isn't plugged in
  3. The computer isn't plugged in
  4. The monitor has failed
  5. The computer has failed
  6. The connection between the monitor and the computer has failed
  7. The electrical power supply to the computer's plug has failed
  8. Try turning it off and on again
Apart from the last suggestion, these are various things which would result in a computer not showing your OS. I think it's exhaustive, but maybe there are one or two things that I've missed. Any one of these things (each of which is known as an event) could cause the top level problem, as could any combination of them. This means that there is an "OR" relationship between the top level event and these events. The top event will happen if event 1 or event 2 or event 3, and so on, happens.

We can then apply the same principle to each of these events in turn. Before I move onto this, though, I just want to point out the other "relationship" between a set of events, similar the "OR" one I mentioned above. Consider the top event "Something sets on fire". For this to happen, some things need to happen together:
  1. There needs to be something to burn - natural gas, wood, plastic, anything that can be a fuel for the fire.
  2. There needs to be enough oxygen for the fuel to burn, but not so much that it can't burn (there can be too much oxygen for a fire)
  3. There needs to be a source of ignition - a match, a flame, a spark.
If any one these things is not in place, "something sets on fire" cannot happen. This means that there is an "AND" relationship between the top event and events 1 to 3: "Something sets on fire" if event 1 and event 2 and event 3 happen.

Usually using just these two relationships between events, we can break down a system's failure into the failures of smaller components of the system. If we were to do this using just words, as I have done so far, then this would become quite unwieldy. Thankfully, though, we have methods of presenting information on the failure of a system much more concisely. One of the most popular of these is a fault tree.

A fault tree is just an expression of the sort of logic that I have shown. It is made up of shapes which represent events and gates. An AND gate is shown by the first of the following symbols, while an OR gate is shown by the second of them:

Note the flat line at the bottom of the AND gate, and the curved one for the OR gate. The single line sticking out of the top of each gate links to the single higher-level event (such as the top events I mentioned earlier), while the several lines coming out of the bottom are for each of the inputting, lower-level events which are the causes of the top event.

The events themselves come in three types:
  • Top event - the overall problem which we are trying to solve. Examples include "car fails", "computer fails", "building collapses", and so on. There is only ever one of these in any fault tree. Shown as a rectangle with a description of the event inside.
  • Basic event - when we have broken the top event down into the combinations of smaller and smaller failures, and we reach the lowest level to which we wish to go, those at the lowest level are known as basic events. These are typically failures of small components. For the computer example, consider examples such as "processor fails", "memory chip fails", and so on. Note that these examples will themselves have smaller and smaller causes, such as overheating and so on, but we are not so interested in them as, for the common home user, once they know that a memory stick or a processor has failed, they will simply seek to replace it, without being too bothered about the nature or the cause of the failure. Shown as a circle with a description of the event inside.
  • Intermediate event - any event which combines basic or other intermediate events but is not the top event. Shown as a rectangle with a description of the event inside.
Using just these five simple symbols, we have a remarkably massive ability to explain the failure of a large system in terms of the small things which ultimately cause it. An example fault tree is shown below for the computer example:

(Click to embiggen it)

The triangle underneath "computer fails" is a symbol to indicate that the event there has lower causes, and they will be put on the fault tree, but I haven't got around to it yet/couldn't be arsed.
Apologies if the fault tree doesn't show up as well as I was hoping, but blame for that.

To finish up, then, you've been shown how reliability engineers commonly use logical methods to break down a big problem into smaller and smaller problems, linked by OR and AND relationships. These are commonly displayed on Fault Trees, which have been explained to you (and if you don't understand them, it's your fault, not mine). If you want to read more, do some searching on google and read some articles. I'm not a library. Here's a fairly crap wikipedia article which doesn't explain them terribly well for the layman.

The next blog will explain some mathsy stuff about probabilities. What has been shown here is a qualitative method - we find the causes of a problem, but without assigning likelihoods to any of them. Fault trees provide a nifty quantitative method of finding the probability of the top event by using those of the basic events. Tune in whenever I write it for more interesting information on my current research and job!

Friday, 23 January 2009

32 - My PhD - Part 1: Reliability

I've not posted in ages. Sorry. Been stupidly busy with my PhD.

Because I get asked so much about it, I thought I'd try to provide here (in as many posts as it takes) an explanation of exactly what the hell I'm up to. And besides, it's my blog and I can do what I like. Plus, I have to try to prove that I'm not still obsessing about gays.

So, first things first: Reliability. This is the core, the centre, of my work and its application is quite important. Reliability is a property of an item, such as (drawing inspiration from those things immediately around me) a phone, computer, mug, and so on. If properly defined and investigated, it allows us engineers to assess how likely the phone, computer, oil tanker, spaceship, etc. is to fail after a given amount of time.

Now, for commercial companies that sell phones, computers, etc., it can be a part of the marketing: "Reliable" is a very good selling point for such things. For instance, if you knew that one specific computer in a shop was 99% likely to still be fully operational in five years' time, you'd probably be tempted to buy it over the other ones.

Hopefully you can see that reliability is just a probability that changes over time: while my computer has a 100% chance of working when I buy it, that probability reduces over time, so it could be 95% likely to work in a year's time, 90% in two years, and so on.

If the computer does fail, then we can take it to a shop and get it fixed. If we consider the life of a computer (assuming that I don't just get shot of it after the third time of failure!), then it is a cycle of working - failed - working - failed and so on. The proportion of the time that it spends in the "working" state over its lifetime is known as its availability. The amount of time it spends in the "failed" state over its lifetime is called - unsurprisingly - its unavailability.

You might think that all this is rather trivial and unimportant. Think, though, for a second, about computers, iPods, cameras, which are expensive items for the consumer to buy. They need confidence that the item they buy won't stop working within a month, which is why laws are in place to protect the consumer from this, and companies offer extended warranties. For spaceships, space agencies need confidence that the phenomenally expensive and complex systems they are sending into space won't just pack up before the mission has even got off the ground (fnar fnar). Commercial aircraft operators need to know how to maintain their aircraft so that they won't have 250 deaths splashed across the papers the next day, with all the accompanying bad publicity, wailing relatives and compensation claims. Military aircraft operators need to know how to maintain their aircraft so they will have 250 deaths splashed across the papers the next day. Reliability is very, very important in today's world.

If you still don't believe me, consider the incidences where things have gone wrong - Piper Alpha, Chernobyl for starters, but on smaller scales, the IBM Deathstar, summer Tube breakdowns, various rail accidents. Failures, and systems with high failure rates (that is, the expected number of failures in a given period of time), can have devastating consequences.

Finding out, then, the probability of a system failing after a given period of time is critical. If the reliability is of a good enough standard, then the system can be sold, bought, used, etc. with a high degree of confidence. Similarly, if a company proves that it has done everything reasonably in its power to reduce failures and the effects of those failures, then it cannot be prosecuted if things go wrong (although you might argue that if things go badly wrong, then it probably wasn't doing everything properly anyway).

"Risk" is a figure which combines both the probability of an incident happening, and the consequences of that incident. The two latter figures are quantified in some way and then multiplied together to give a value for risk. For instance, most problems associated with gasholders (terrible article alert) have a moderately high probability but would not cause any fatalities. The big problems (the whole damn thing blowing up, for instance) is very low-risk, has never happened in the UK, in fact, but has high consequences. As long as one figure is low when the other is high, the risk of anything will be quite low. Risk assessment, then, is a case of establishing likely values for the probability of an incident, and the consequences of it should it happen.

Finding out the various reliability figures for a system is not easy, though. It is always possible to find out the values through testing. Usually, thought, for things like computers, trains, spaceships, you can't just make a thousand units and test them all to destruction, and do statistics to find the answer. It's just far too expensive both in terms of time, resources and money. Because of this, we have to resort to less interesting measures of estimating system reliability, which is what I will share with you in the next blog.