And Implications (Called Rules)

3.1

3EXPERT SYSTEMS

These have already been introduced in the lectures on the practical and you have built a small expert system – so this introduction is brief.

3.1The Expert

An expert system aims to capture the knowledge of an expert on some subject, to store it as a set of predicates (called facts):

A is true

B is trueetc

and implications (called rules)

A  C

or fuzzy implications

A C probability p

(if A is true then C has probability p of being true)

for example A might be a sympton (high Temp)

C might be a disease (Flu)

3.2

An expert system usually starts with a specialist - computer scientist team - both highly motivated.

Facts and rules based (usually) on experience are given by the expert and built into a computer system which is successful for some demon-strations. For real problems the system is soon found to be less adequate - and a confidence problem arises.

3.3

If this is overcome, a process of tutorial interaction with the expert studying more and more special cases, will refine the system:

It is very difficult to get the specialist to give up enough time. Special software can be used to reduce the expert's time and make the process acceptable (e.g. the Teiresies system) but these are not always successful; so there are few large working expert systems, although there are many small ones. Those that are successful are very highly specialised and have limited objectives.

It is rare to have more than one expert - they never agree!!

3.4

3.2Inference engine

Expert systems are usually based on "rules of thumb" which are of the form:

if<antecedent>then<consequence> hasprobability p

or if<fact>then <hypothesis> has certainty c

(or similar terminology)

There is a non-trivial number of these rules, more than known by a novice in the subject.

In recent systems (at least) these rules are separate from the program, and treated as static knowledge for the system.

3.5

This program is often called an "inference" engine or machine.

(The term "knowledge" refers to facts and rules together.)

There are two main approaches in Inference Engines

(a)Backward chaining

(b)Forward chaining

(a)Backward chaining This helps making deductions on the truth of a hypothesis or goal:

e.g. Has this patient measels?

ExampleConsider a simple set of rules

Rules

1A B  C

2C D  E

3F  E

4F B  G

5G E  H

6B E  G

3.6

Variable factsA true, B true & D true

Let us assume we want to know if hypothesis G is true.

The inference engine first looks at the rules which might infer the truth of G i.e. 4 & 6.

Using 4 first

G  B  F (writing the normal rule back to front)

But B is true so

G  F

It does not know F so it next tries rule 6

G  B E

 E since B is true

 C Drule 2

3.7

Csince D is true

 A  B rule 1

rue since A and B are true

So C is True, E is True and G is True (result).

If it did not know A, since no rules infer A it asks "Is A true?"

If the response is no.

It would respond by asking about C & E & F in turn, i.e.

"Is C true?"

and G is found true if any answer is yes.

(b)Forward chaining The system deduces all the facts and new rules it can from the knowledge base e.g. what diseases can this patient have? Let us assume A true, B true and D true are part of that knowledge base.

3.8

Rule 1A  B C

So C = True since A and B true

Rule 2C D  E

So E True since C and D true

Rule 3F  E

Nothing more can be deduced

Rule 4F  B  G

So F G since B is true

(new Rule T1)

Rule 5G  E  H

So G  H since E is true

(new Rule T2)

So F  H syllogism using rule T1

(new rule T3)

Rule 6B  E  G

SoG = True since B and E true

SoH = True (since rule T2)

3.9

This is not efficient for solving hypothesis based problems. It is most useful when probability is added to the rules and there is no hypothesis, for example, the possible diseases which a patient might have, with probabilities of each.

One difficulty is that it may generate very large numbers of rules - not easy to store or index. In theory it can discover new rules or facts not previously expected, but in practice very few new rules have been discovered by this method.

3.3Explanation

To improve the man-machine interface facilities are normally included to help explain to the user what the computer is doing, e.g. in the modified example (a) above the user might ask why? in response to a question e.g.

System:Is A true?[in form of an Englishstatement

User:Why?– Has the patient a headache?]

3.10

Then the system would explain, giving all facts and rules relevant to its question: e.g.

The question was asked because if A was true since B also is true, then by rule 1

A  B  C

so C would be true

but D is true, and rule 2 is

C  D  E

so E would be true

but B is true and rule 6 is

B  E  G

so G would be true, which you asked for

All of this expressed in short English statements, would be stored with A, B, etc in a file.

Then it asks the question again: Is A true?

3.11

How?If the user is given a final conclusion, e.g.

G is true

or more likely in practice

G has a high probability (0.8) of being true

then the user might ask "How?" and the system would give all the previous rules and facts used to obtain its conclusion.

3.12

3.3.3 Search Tree or Graph

Solving a problem by Backward or Forward chaining can be illustrated as searches in trees or graphs representing the problem. For example, the goal driven problem in 3.2 has the tree:

From the top to find G there are two possible rules 4 or 6.

Rule 4 requires B (which is true) and F (which is unknown and no rules imply F). So Rule 4 leads nowhere.

3.13

Rule 6 requires B (true) and E which becomes the node of its own tree. The process is continued recursively, in this case to 5 levels to a solution.

Note: this is an AND-OR tree with both AND and OR branching. Most trees or graphs are simpler with only OR branches (e.g. logic in graph of Towers of Hanoi problem, figure 3.5 in text, and in trees or graphs for games like chess).

Note also that the QPS example by Dr Loughlin is based on an AND-OR tree.

For more discussion on this topic see your textbook, Chapter 3.

3.14

3.4Probability (or Certainty)

In practice probability is nearly always an essential part of rules and sometimes facts. e.g.

If <f has probability Pf> then <c has probability Pc where Pc depends on Pf.

We examine a few of the problems:

(a) when f is made up of several facts and

(b) when c has a prior probability.

3.4.1Combining fuzzy facts (symptons)

After combination we can sometimes assume

Pcombined facts = Pc

(i)Consider f to be made up of two facts:

(f1 and f2)

then Pf = Pf1Pf2 (if f1 and f2 are independent of one another i.e.no correlation)

Pf = Pf1 = Pf2(perfect correlation)

3.15

Now usually there is some correlation, and often it is not known, but we can state limits (for positive correlation)

because f is more likely if there is some correlation



Pf1Pf2  Pf  min (Pf1 , Pf2)



because p(f1 and f2)  either Pf1 or Pf2

[See Venn Diagram below]

Venn Diagram

3.16

The value min(Pf1,Pf2) of fuzzy logic is often used, as the probability for Pf (e.g. in prospector, an expert system used in Geology) because Pf is usually closer to Pf1 or Pf2 than it is to their product. (It also is used in Prospector because if there is any possibility of a"find" the exploration company wants to explore it further.)

______

(ii)if f is (f1 or f2)

(a)if no correlation

Viewed in terms of a Venn Diagram

P(f1 or f2) = Pf1 + Pf2 - (overlap of f1 and f2)

The overlap is P(f1 and f2) = Pf1Pf2 if there is no correlation

So P(f1 or f2) = Pf1 + Pf2 - Pf1Pf2

Proof in terms of logic:

Now f1 or f2 =  (  (f1 or f2))

=  (  f1 and  f2) (de Morgan)
3.17

So P(f1 or f2) = 1 - P(  f1 and  f2)

= 1 - (P f1)(P f2)

if no correlation

= 1 - (1 - Pf1)(1 - Pf2)

= Pf1 + Pf2 - Pf1Pf2

(b)if perfect correlation

Pf = Pf1 = Pf2

(c)some correlation

P(f1 or f2)  either Pf1 or Pf2

(see Venn Diagram)

 max (Pf1 , Pf2)

So Pf1 + Pf2 -(Pf1)(Pf2)  Pf max (Pf1 , Pf2)

because f is less likely if there is some correlation

The value max (Pf1 , Pf2) is often used when the correlation is unknown. The Left Hand Side is used to maximise the probability (as in Prospector).

3.18

(iii)Three facts A and B and C with probabilities Pa, Pb and Pc respectively

letY = A and B and C

letX = A and B

soY = X and C

then Pa Pb  Px min (Pa , Pb)

and Px Pc  Py min (Pc , Px)

In the second equation for Px on LHS we use its lower bound; and on RHS its upper bound

soPa Pb Pc Py min (Pc, min (Pa, Pb))

soPa Pb Pc  Py  min (Pa, Pb, Pc).(note symmetry)

(iv)A or B or C

letY = (A or B) or C and let X = A or B

somax (Pa, Pb)  Px  Pa + Pb - Pa Pb

andmax (Px, Pc)  Ppy  Px + Pc - Px Pc

3.19

max (max (Pa,Pb),Pc)  Py  Pa + Pb + Pc - Pa Pb

- Pc Pa - Pc Pb + Pa Pb Pc

so max (Pa, Pb, Pc)  Py  ...... (note symmetry)

3.4.2Prior probability

Let us assume fact f1 (like an epidemic of flu) implies consequence c (a patient having flu) with probability pc; so P(f1 c) = Pc.

Let this be the prior probability of c, called the

"a priori" probability of c, before other facts (symptoms) are considered.

Note:the common notation for P(f1c) is P(c|f1).

Consider another independent fact f2 (a symptom).

Now probability of c and f2 both true is

P(c and f2) = Pf2 P(c|f2)

also = PcP(f2|c) by interchanging c & f2

3.20

so P(c|f2) = Bayes' Theorem

Now the likelihood ratio,

the likelihood that the fact (symptom) f2 is due to consequence c (e.g. flu) compared to not-c

(e.g. not-flu).

This can often be estimated by an expert.

So P(c|f2) = =

Now P(c and f2) = P(c|f2) Pf2

3.21

Figure illustating variation of p(c|f2) with LS

Another similar formula can now be derived for further positive facts f3, f4 etc.

______

We need also to take account of the probability of a fact (symptom) not occuring

(not f3) on c, i.e. p( f3 c) this alters the prior probability in a negative manner.

We can develop the mathematics exactly as before with f2 replaced by  f3 and Pc replaced by a new prior probability

= P(c|f2)Pf2

which equals the probability of f2 and c if f2 has probability Pf2

Let , the negative likelihood

3.22

e.g. likelihood of no mineral salt found if this is a workable seam compared with the same finding without a workable seam.

We find

andP(c'and f3) = P(c'| f3)P f3

3.23

3.4.3Prospecting for seam of mercury

Assume Pc = 0.1 to start (e.g. a workable seam of mercury nearby)

Assume f2 is presence of abundant mercury sulphate and f3 is presence of abundant mercury sulphide

let Pf2 = 0.76 and P  f3 = 0.95 (i.e. lot of sulphate, little sign of sulphide)

Also LS(f2) = 10; LN(  f3) = 0.2 (given by Experts)

= .11 Ans.

3.24

3.4.4Certainty

Because people usually find the concept of probability difficult, most expert systems use "certainty" or a similar measure, varying from

-1 to +1, or (-100 to 100) or, (-5 to +5) [used by Prospector] or similar range. These map onto probability as in the figure.

Human perception is either a little over optimistic or over pessimistic.

3.25

3.4.5Errors

So there are many sources of errors.

1Probabilities pf, pc or equivalent

"certainties" are imprecise.

2So are likelihoods LS, LN.

3The amount of correlation is unknown, so

the formulas are imprecise.

4Errors accumulate as more rules are used in

a solution. After about 4 rules the "conclusion" may become meaningless.

3.26

3.4.6Combinations and Complexity

One approach to this problem is to ask the expert for the probability of all combinations.

However if there are n facts the number of combinations is

+ + etc

   

single pairs 3 facts 4 facts

= 2n which is very large

We say that this has a complexity of order 2n, i.e.0(2n)

e.g.n = 400 2n = 3 10120 !!! cases

The combination of 3 facts has a complexity 0(n3)

So even combinations of 3 facts is impractical

Instead rules of thumb are used as in 3.4.1 (i) and (ii) or independence is assumed (even when it is not correct). Sometimes this gives rise to impossible results (pobability > 1).

3.27

3.5 Some Expert Systems

Lists of expert systems are given in most textbooks. Note the number built at Universities - but this is changing rapidly.

The most successful systems are

(a)DENDRAL (the first system)

for deducing molecular structures from

spectrographs

(b)PROSPECTOR

Mineral Deposit Expert System where

knowledge is very imprecise and probability

therefore important.

(c)MYCIN

for medical diagnosis - based on 450 rules

for infections of the lower abdomen.

Rules are simple as in sectons 3.2 to 3.4

It uses backward chaining.

It uses a highly stunted version of English to

express facts and rules.

For probability it uses ad hoc rules which

work fairly well in practice

(see any text book for details not on this course.)

3.28

EMYCIN (Empty MYCIN)

Logical structure of MYCIN stripped of rules and facts pertinent to infections of lower abdomen

This is an Expert System Shell, the first such Shell

(d)XCON

Used by DEC to decide on balanced

configurations for VAX computers

3.29

3.6Constraints on Expert Systems (see 5.8 also)

(a)Acquisition of knowledge - finding Expert's

time.

(b)Explanation - making the system user

friendly to users and to the expert.

(c)Validation - long process of checking and

correcting.

(d)Special Rules which are Domain specific -

successful systems have a lot of special

code.

(e)Probability - ad hoc rules to deal with

correlation are domain specific.

(f)Domain Boundaries - always imprecise,

always needing extension - (how in pracice

can we deal with 450,000 rules not 450).

3.30

(g)Inadequate languages

LISP - out-of-date, slow, poor interface.

PROLOG - lacking numerical facilities.

Object Oriented Languages not yet developed

Expert System Shell very

limited and slow.

(h)Time - knowledge changes with time

(i)Image Representation

e.g. a hook: an image is better than words -

but how do we process images?

(j)Learning - progress on self-learning systems is

promising but slow (eg. neural nets) and proving difficult as they developed,

(k)Complexity - complexity increases with the

number of rules very quickly,

(e)Size - large systems become unstable and

unpredictable.

3.32

3.10Common Alternative Notation for

Bayes'Law

Symptom f observed

We know (using notation of Chapter 3)

P(c|f) =(1)

Write Py = P(f|c) : probability of symptom f if disease c is present (y)

and Pn = P(f|¬c) : probability of symptom f if disease c is not present (n)

Also let P = Pc, the prior probability of c,

and P'= P(c|f), the probability of c if f is then observed.

Then (1) becomes

P' = (2)

3.33

(Symptom f not observed

If the symptom is not observed (¬f) then (1) becomes

P(c|¬f)=(3)

then if P' = P(c|¬f)

since P(¬f|c) = 1 - P(f|c) = 1 - Py

and P(¬f|¬c) = 1 - P(f|¬c) = 1 - Pn

equation (3) becomes

P'=(4)

3.34

3.7Practical

The lectures on the practical are part of this course.

3.8Quantitative Problem Solving.

(separate lecture given on this).

3.9Reading

See Textbook