veryard projects - innovation for demanding change

abstraction

modelling by taking away

we offer	abstraction	three modes of abstraction	issues	links
*consultancy* *workshops*	Abstraction makes a model more powerful and broad by distancing it a little from the specific business situation we started with. Abstraction clears away some of the specifics, and allows us to see the structure. If abstraction is taken to the extreme, no specifics are left at all. Except for skilled mathematicians, who are trained to understand highly abstract structures with no direct relationship to the real world, most people find such a model incomprehensible. Thus abstraction should be practised in moderation, leaving a sufficient amount of specifics for the model to remain meaningful.	On this page we discuss three modes of abstraction aggregation classification generalization	lumpers and splitters sorting and classification	information management modelling

“Bruhl found some languages full of detail
Words that half mimic action; but
generalization is beyond them, a white dog is
not, let us say, a dog like a black dog.”

[Pound, Cantos XXVIII]

“A wit has said that one might divide mankind into officers, serving maids and chimney sweeps. To my mind this remark is not only witty but profound, and it would require a great speculative talent to devise a better classification. When a classification does not ideally exhaust its object, a haphazard classification is altogether preferable, because it sets imagination in motion.” [Kierkegaard]

Abstraction by Aggregation

veryard projects > modelling > abstraction > aggregation

Aggregation is the putting together of different things, to form a coherent whole. Thus, instead of talking about BUILDING and STREET and TOWN and COUNTY and POSTCODE, these may all be lumped together as ADDRESS. Or instead of talking about a CPU and a keyboard and a disk drive and a monitor, these may be bundled together into COMPUTER.

People may be aggregated into teams or departments; products and services may be aggregated into compound products (thus, for example, when you buy a hifi, a one-year guarantee and repair service may be bundled in as part of the product price).
This form of abstraction is useful when decisions are made at the level of the aggregate or compound. However, sometimes what is required is the opposite of aggregation: the analysis of information down to data atoms. A data atom is the smallest unit of information, free of interpretation or ambiguity, that cannot be derived from any other information.

Information needs are usually compound rather than atomic. Complex information can be built up from data atoms, and conversely how compound information can be decomposed into data atoms.

It is often asserted that a computerized information system should capture information at the atomic level, and then provide various levels of summary and aggregation, depending on the level of interested management, or on the purpose for which the information is required. This is indeed an attractive approach, because the structure of the atomic data is likely to be more stable than the structure of the day-to-day information needs compounded from it, but it is not always practicable. Sometimes the information does not exist in atomic form, and it would be a burden to the business to create or collect it.

Consider the Post Office. The atomic entity is a single letter being posted. It is almost certainly impractical to capture data on each individual letter. However, information is required about the throughput and bottlenecks of letter handling. There are various ways of providing this information. Perhaps the letters are batched into bundles, and data are captured for each bundle. Or perhaps instead of tracking every single letter, a random sample of letters is selected and tracked in detail. Thus the simple and atomic entity type LETTER (with millions of occurrences per day) is not modelled; instead the model includes complex entity types such as LETTER BUNDLE or LETTER TRACKING SAMPLE (with far fewer occurrences per day, allowing monitoring and control processes to be carried out effectively).

The trouble with such aggregated or sampled entity types is that they are arbitrary. Whereas the entity type LETTER would provide stability to the information model, because it is fundamental to the business of the Post Office, the compound entity types are not stable. The Post Office could want to change its bundling mechanism, or its sampling mechanism, and thereby invalidate the definition of the entity type in question. Therefore a system designed on the basis of LETTER BUNDLE or LETTER TRACKING SAMPLE is less flexible, more vulnerable to changes in the business procedures, than a system designed on the basis of LETTER.

An alternative is to design a computerized information system to break the data down into data atoms. This is already done with some marketing systems, where you start with the total sales figures, and then use statistics from market research to break these figures down. Clearly such a breakdown will be an approximation, but perhaps other situations could be conceived that would be wholly accurate.

The Raw and the Cooked

Abstraction by Classification

veryard projects > modelling > abstraction > classification

One of the ways we simplify and make sense of the world is by dividing people and things into classes. This reduces the amount of information we have to collect, maintain and consider. Classification of some sort is a necessary fact of life. We want to be able to discriminate between capable and incapable, safe and dangerous, polite and rude, even perhaps good and evil. But sometimes classification is arbitrary; and there may be as many classifications as there are interested parties.

Classification of physical objects is a useful form of abstraction. British Rail may classify its buildings into Stations, Offices, Workshops, and so on. This could result in a classifying entity type BUILDING TYPE, related to the entity type BUILDING. This would for example enable policy decisions (such as frequency of repainting) to be made once for each type of building, instead of once for each building, thus reducing the number of decisions that have to be made.

Classification of people appears to offer similar benefits. If a teacher assumes that all eight-year-old boys are the same, if a recruitment officer assumes that all black women engineering graduates are the same, or if an advertising draftsman assumes that all consumers of chocolate are the same, this saves the trouble of considering each individual separately.

But the classification of people is ethically problematic.

Sorting and Classification
Ethics of Discrimination
Personality Classification

Abstraction by Generalization

veryard projects > modelling > abstraction > generalization

Generalization is the putting together of similar things, by selectively ignoring their differences. For example, photocopiers are not the same as computers, but a model might usefully lump them together as OFFICE EQUIPMENT ITEM.

Generalization is a useful way of reducing the number of classes or types in a model. Generalization is unavoidable in building an information model, since without any generalization at all, each type would only have one occurrence.

The key question is not whether to generalize at all, but how much to generalize, and where to stop generalizing.

Lumpers and Splitters

veryard projects > modelling > abstraction > lumpers and splitters

Some people see the similarities between things more easily than they see the differences, thus they want to lump the objects being modelled into relatively few classes, to gain generality.

Others tend to see the differences more readily than the similarities, so they want to split the model, to divide the objects between a larger number of more narrowly defined classes, to gain precision.

For example, according to the lumpers, a subcontractor could be basically the same as any other supplier, and is therefore the same classes, while the splitters would probably argue that there are significant differences between different groups of suppliers, justifying separate classes in the model.

A lumper is always keen to generalize, and produces models with a small number of broadly defined objects. A splitter is reluctant to generalize, and produces models with a large number of narrowly defined objects.

A lumper tends to learn too quickly, because having learned a solution to one problem, s/he wants to apply the same solution to all kinds of problems. A splitter tends to learn slowly, because s/he finds it hard to copy solutions from one problem to another.

Some lumpers attempt to justify their position by elevating generalization to an ideological principle. Wherever it is possible to define a common type, it is necessary to do so. Definitions should always be reused and generalized whenever possible.
This offends against pragmatism, since the costs of generalization are sometimes greater than the benefits.

Thus some care is needed in interpreting any guidance that methodologies may provide in such matters. If the developers of a particular methodology have concentrated on the dangers of splitteracy, then lumpers may be encouraged to get as generalized as they can. And if the methodology is full of warnings against excessive generalization, then splitters may feel justified in taking their approach to the extreme. If a methodology warns you against any extreme, that is no reason to go to the opposite extreme.

On many projects, there is an unavoidable tension between lumpers and splitters, a trade-off between generality and precision. There may be frequent arguments along such lines, although each argument should not be allowed to take much time. Such tension is healthy, because it prevents the model becoming either too detailed and specific, or too vague and universalized. It follows that a modelling team should balance lumpers with splitters. As a consultant, the author has had to play the lumper role on some occasions and the splitter role on other occasions, according to the temperaments of the other team members, to achieve the right balance.

In some cases, if the splitters disagree among themselves, then the lumper will win by default. For example, two splitters may both feel that there is an important difference between HOTEL and MOTEL. But when they cannot agree exactly what the difference is, then the lumper steps in and creates a single combined entity type. This is safer than basing a model structure on a disputed distinction, and still allows each splitter to create any partitions or subtypes that can be clearly defined, and whose relevance can be demonstrated.

The mechanism of class inheritance or entity subtypes can sometimes be used to maintain a compromise, since it allows these two opposite approaches to be (partially) reconciled. The similarities can be analysed and documented at the supertype, superclass level, the differences at the subclass, subtype level. Thus both lumper and splitter may be satisfied. (But even this mechanism does not always succeed in bringing peace to a strife-torn modelling team.)

Another possibility is to offer the lumper and splitter two different models, with a clearly defined mapping between them. This is appropriate where the lumper and splitter are associated with different business areas. For example, the accounts model of the business tends to be ‘lumpy’, with all sorts of things lumped into general entity types such as ASSET. Whereas other areas of the business want to view different kinds of asset entirely separately. So we may have different data structures in the two different models, at different levels of abstraction, but constrained by some definite mapping between the two.

Finally, let us consider an analogy. Two artists sit down to sketch a scene. One tries to capture everything in a few freehand strokes of charcoal. The other uses a fine pencil to try and capture the scene in meticulous detail. Which is the better artist? Is this a sensible question? The first artist is a lumper, the second is a splitter. But if we magnify a portion of the second artist’s sketch, it looks as rough as that of the first. Even the splitter cannot go on splitting to infinity, and for that matter, even the lumper cannot go on lumping.

veryard projects > modelling > abstraction