MATHEMATICAL MODEL FOR THE
CONSTRUCTION OF CLADOGRAMS.
DRAFT COPY ONLY.
3/15/2008.
G. William Moore, MD, PhD.
From the Pathology and Laboratory Medicine Service,
Veterans Affairs Maryland Health Care System, Baltimore, Maryland;
Department of Pathology, The Johns Hopkins Medical Institutions,
Baltimore, Maryland; and
Department of Pathology, University of Maryland
School of Medicine, Baltimore, Maryland.
http://www.gwmoore.org/mathclad.htm
Preliminaries:
http://www.gwmoore.org/mathcl00.pdf
Chapter 1:
http://www.gwmoore.org/mathcl01.pdf
Chapter 2:
http://www.gwmoore.org/mathcl02.pdf
Chapter 3:
http://www.gwmoore.org/mathcl03.pdf
Chapter 4:
http://www.gwmoore.org/mathcl04.pdf
Chapter 5:
http://www.gwmoore.org/mathcl05.pdf
Chapter 6:
http://www.gwmoore.org/mathcl06.pdf
Chapter 7:
http://www.gwmoore.org/mathcl07.pdf
Chapter 8:
http://www.gwmoore.org/mathcl08.pdf
Chapter 9:
http://www.gwmoore.org/mathcl09.pdf
Chapter 10:
http://www.gwmoore.org/mathcl10.pdf
Send comments and correspondence to: Dr. G. William Moore,
George.Moore4@va.gov
See also:
http://www.netautopsy.org/gwmcv.htm
United States Government Work, uncopyrighted, public-domain,
supported by National Institutes of Health Predoctoral Traineeship,
1967-1971, North Carolina State University at Raleigh, NC. This document
does not necessarily represent the views or policies of any United States
Government agency. This document is provided "as is", without warranty
of any kind, express or implied, including but not limited to the warranties
of merchantability, fitness for a particular purpose and noninfringement.
In no event shall the authors be liable for any claim, damages or other
liability, whether in an action of contract, tort or otherwise, arising from,
out of, or in connection with the document or the use or other dealings
made with the document. Published as:
Moore GW.
A Mathematical Model for the Construction of Cladograms.
North Carolina State University at Raleigh, NC.
Institute of Statistics. Mimeograph Series No. 731 (1971).
Under the direction of H. R. Van der Vaart.
ABSTRACT.
A dendrogram is any tree-like diagram for representing relationships
among sets of organisms. A phenogram is any dendrogram which has been
calculated from initial data according to a computer algorithm.
Such an algorithm is called a phenogram algorithm. A cladogram is any
dendrogram whose progressively thicker branches correspond to more ancient
common ancestors. There are currently over a dozen different phenogram
algorithms in the literature, which may give different dendrograms for the
same input data. This dissertation addresses itself to the problem of which
phenogram algorithm, if any, yields the true cladogram as its result.
The biological phenomenon of cladogram formation in nature is stated
in terms of the rigorous, axiomatic foundations of Williams' biocosm.
Simple models for ancestry, pedigrees, reproductive isolation,
and tree-like diagrams are developed for the statement of the problem.
Sources of input data, especially data from molecular characters,
are discussed. Three evolutionary hypotheses for the long-term behavior
of these characters are presented, and seven phenogram algorithms
are evaluated in tems of these evolutionary hypotheses.
The unweighted-pair-group phenogram algorithm of Sokal and Michener
is found to require the weakest evolutionary hypothesis.
A MATHEMATICAL MODEL
FOR THE CONSTRUCTION OF CLADOGRAMS.
by
George William Moore.
A thesis submitted to the Graduate Faculty of
North Carolina State University at Raleigh
in partial fulfillment of the
requirements for the Degree of
Doctor of Philosophy
BIOMATHEMATICS PROGRAM
RALEIGH
1971
APPROVED BY
__________________________________ __________________________________
__________________________________ __________________________________
__________________________________
Chairman of the Advisory Committee
BIOGRAPHY.
The author was born in 1945, in Detroit, Michigan. He graduated
from Highland Park High School, Highland Park, Michigan, in 1963.
He entered the University of Michigan in September, 1963,
and received a Bachelor of Science degree with high honors
in Cellular Biology in June, 1967.
While an undergraduate, he was elected for membership
in the national honor societies of Phi Kappa Phi and Phi Beta Kappa.
In September, 1967, he was awarded a National Institutes of Health
Predoctoral Traineeship, and entered the Graduate School of North Carolina
State University. During his graduate career, the author worked as a teaching
assistant in the Departments of Zoology and Mathematics.
ACKNOWLEDGMENTS.
I wish to express my appreciation to Dr. Mary B. Williams for her guidance
and helpful suggestions during the course of this investigation and in the
preparation of this manuscript. I also wish to acknowledge the assistance
given me by the other members of my graduate committee:
Dr. H. R. van der Vaart, Dr. Henry Schaffer, Dr. Harvey Charlton, and
Dr. Donald Huisingh.
TABLE OF CONTENTS.
Abstract.
Title page.
Biography.
Acknowledgments.
LIST OF TABLES.
LIST OF FIGURES.
1. INTRODUCTION.
1.1 Purpose of this Dissertation.
1.2 How to Read This Dissertation.
1.3 Chapter Organization.
2. MATHEMATICAL MODELS IN BIOLOGY.
2.1 Structural Versus Empirical Models.
2.2 The Axiomatic Method in Biology.
2.3 Biological Versus Mathematical Definitions.
2.4 Model Theory in Biology.
2.5 Flow of Reasoning in This Dissertation.
2.6 Informal Set Theory.
2.7 Set Relationships.
2.8 Special Sets.
2.9 A Precaution.
2.10 Set Operations.
2.11 Mathematical Relations.
3. PRINCIPLES OF GENEALOGY.
3.1 First Principles.
3.2 Ancestor-or-Equal-to
3.3 Pedigrees.
3.4 Common Pedigrees.
3.5 The Monoparental Biocosm.
4. OPERATIONAL TAXONOMIC UNITS AND ISOLATION.
4.1 Operational Taxonomic Units.
4.2 The Superpartition.
4.3 Isolation.
4.4 Sampling from Isolated Sets.
4.5 Cladistic Partitions.
5. DENDROGRAMS AND CLADOGRAMS.
5.1 The Dendrogram.
5.2 Cardinality of a Binary Dendrogram.
5.3 The Dendrogram as a Monoparental Biocosm.
5.4 The Cladogram.
6. SOURCES OF INPUT DATA.
6.1 The Phenetic Matrix.
6.2 Comparison of Character States.
6.3 Direct Techniques for Obtaining a Phenetic Matrix.
6.4 The Immunodiffusion Technique.
6.5 Solution for the Phenetic Matrix.
6.6 Inversion of the Primary Solution Matrix.
7. DIVERGENT EVOLUTION IN MOLECULAR CHARACTERS.
7.1 Trouble with Genetic Load.
7.2 Low Selection at the DNA Level.
7.3 Low Selection at the Protein Level.
7.4 Low Welection and Divergent Evolution.
7.5 The Parsimony Hypothesis.
7.6 Divergence Hypothesis.
7.7 The Superphenetic Matrix.
7.8 Formulas for a Superphenetic Matrix.
7.9 Nesting Levels in a Dendrogram.
7.10 Weighted Average Superphenetic Matrix.
7.11 Choice of Cladogram Junctures.
7.12 Divergence-in-Mean.
7.13 Divergence and Uniform Evolution.
8. THE STRUCTURE OF PHENOGRAM ALGORITHMS.
8.1 Iteration Structures.
8.2 Refinements of a Partition.
8.3 Agglomerative and Divisive Iteration Structures.
8.4 Phenogram Algorithms Generate Dendrograms.
8.5 Cladogram Functions.
9. EVALUATION OF PHENOGRAM ALGORITHMS.
10. LIST OF REFERENCES.
11. APPENDICES.
12. ANNOTATIONS, EXPLANATORY NOTES.
1. INTRODUCTION.
1.1 Purpose of this Dissertation.
In 1963, Sokal and Sheath published their controversial text,
Principles of Numerical Taxonomy (PNT).
On pages 189-194, PNT discusses the construction of tree-like diagrams
for representing taxonomic relationships among sets of organisms,
or taxonomic units. A dendrogram is any tree-like diagram.
A phenogram is any dendrogram which has been calculated from data
about the taxonomic units according to a computer algorithm.
PNT presents six different phenogram algorithms, or methods
for computing a dendrogram from input data. The authors then evaluate
these methods on the basis of their intuition and experience.
Their evaluation reflects a rather prevalent attitude among biologists
who use computers that if one runs enough data, one is able to judge
the method.
For a biologist who understands his material and has a good feeling
for what the computer does, this attitude is by and large correct.
Trouble starts when this biologist confronts a new data source.
Since he doesn't really understand what makes the technique "work",
he may waste a great deal of time getting a "feel" for his new-material.
Often this is a consequence not of carelessness on the biologist's part,
but because biological principles themselves are logically "mushy"
in a few key places. This dissertation employs a manner of inquiry
which overcomes the uncertainties in the intuitive approach:
it derives phenogram algorithms from rigorously stated biological
first principles, using the methods of ordinary mathematical proof.
A cladogram is a dendrogram whose progressively thicker
branches correspond to progressively more ancient common ancestors.
It is distinguished from a phenogram in that a phenogram is the result
of a well-defined sequence of calculations, whereas a cladogram is a fact
of nature. The cladogram problem is the problem of specifying a process
of calculation by which input data about taxonomic units can be used
to infer a branching sequence of progressively more ancient common ancestors.
For a biologist, the cladogram problem consists of two major questions:
what kind of data to obtain and which phenogram algorithm to feed
these data into. The purpose of this dissertation is to explore these
two questions in detail. A model for cladogram formation in nature
is set up using the "biocosm" formalism of
Williams (1970); a formalism
for computer phenogram algorithms is set up using the ordinary tools
of set theory. Then the evolutionary model is used to rank the performance
of seven different phenogram algorithms. I conclude that
Sokal and Michener's (1958) "unweighted
pair-group method" is the best choice
for information about molecular characters summarized in matrix form.
1.2 How to Read this Dissertation.
This dissertation uses methods of ordinary mathematical reasoning
to evaluate-the ability of seven phenogram algorithms to infer the true
cladogram subject to three evolutionary hypotheses. Any biologist who has
read Sokal and Sneath's (1963) PNT and has
a general interest in numerical taxonomy might find this dissertation
of sufficient interest to skim, skipping all formalized definitions,
theorems, and proofs.
The mathematical portion of this dissertation is much more tedious
than the descriptive text, and is not worth the expenditure of effort
for most biological readers. Even a reader who intends to study
the mathematical results might want to skip the proofs and the lemmas
(specialized, nonintuitive theorems introduced solely to facilitate
later proofs). A high school course in plane geometry, some college-level
algebra, and a good deal of patience are the only prerequisites.
Background in elementary set theory w-ould be helpful, but not essential.
All the mathematical and set theory terminology is explained
in this dissertation, and collected for reference in
Appendix 11.1.
Each mathematical statement in this dissertation is prefaced by an informal resume of its contents and purpose, as an aid to this reader.
Finally, proofs are provided for all but the most obvious theorems,.to verify the validity of each theorem.
1.3 Chapter Organization
This dissertation progresses in three major phases. The first phase
develops enough mathematical machinery for the next two phases.
The second phase discusses which data are appropriate for the solution
of the cladogram problem. The third phase develops the seven phenogram
algorithms of this dissertation and evaluates their performance
with respect to three different evolutionary hypotheses.
Chapters two through five constitute the first phase of this dissertation. Chapter two develops the general area of mathematics which will be employed in this dissertation: the theory of sets and relations. Chapter three develops well-known concepts of genealogy
in a notation appropriate for subsequent chapters of this dissertation.
Chapter four develops the concepts of Operational Taxonomic Units (OTUs)
and isolation between 0'I'Us. Chapter five develops the concepts
of dendrogram and cladogram.
Chapters six and seven constitute the second phase of this dissertation.
Chapter six discusses several sources of molecular data commonly used
for the cladogram problem, and shows how these sources can be reduced
to a phenetic matrix. Chapter seven suggests that molecular data
are an appropriate source for phenetic matrixes, because of the low level
of selection ("non-parw9nian evolution") on individual molecular characters.
Chapter seven suggests three evolutionary hypotheses for the evolution
of molecular characters, and demonstrates some simple consequences
of these hypotheses for use in later chapters.
Chapters eight and nine constitute the third phase of this dissertation.
Chapter eight develops a mathematical machinery for phenogram algorithms,
and chapter nine uses this machinery to evaluate each of seven phenogram
algorithms with respect to the three evolutionary hypotheses described
in chapter seven. Chapter nine is followed by the list of references
and three appendices.
Appendix 11.1 is a glossary of all primitives and defined terms employed in this dissertation. Appendix 11.2 is a chart which details the flow, of reasoning employed in this dissertation. Appendix 11.3 is a computer flowchart for the unweighted pair-group (Sokal and Michener, 1958) phenogram algorithm.
2. MATHEMATICAL MODELS IN BIOLOGY
2.1 Structural versus Empirical Models
Biology is a young science in its use of mathematical models.
Few areas of biology depend upon mathematics as more than
superficial aid for summarizing observations. Genetics
is one of the few major areas of biology some of whose first principles
are desscribed by a genuine mathematical model. Most mathematical models
in biology do an incomplete job. They are either
(i) empirical models, or
(ii) structural models
but rarely both.
An empirical model is a mathematical model which has been constructed
largely to fit a particular class of data or observations.
Characteristically, this kind of model is used for fitting curves
through b~iological data under assumptions such as "linearity",
"normal distribution", etc., without much biological basis
for knowing whether these conditions indeed hold. An empirical model
often solves for population or environmental parameters which
have little or no basic biological meaning. Often, an empirical model
gives the experimenter a false sense of meaning, when he has in fact
done little more than translated one pile of numbers into another,
smaller pile.
A structural model falls at the other extreme. This kind of model starts
from "obvious" first principles and works forward toward observations.
Unfortunately, there are many perfectly plausible candidates
for the status of "first principles" which are false, and because
the formulation of a structural model is often so remote
from actual observations, it is very difficult to separate the plausible
and true statements from the plausible and false statements. Consequently,
a combination of quite plausible basic assumptions may lead to ridiculous
conclusions. Another common shortcoming among structural models is that
they often lead to conclusions which either cannot be observed
experimentally or are of no interest to the biologist.
Biology is still very much an experimental science, and most good biologists
reserve their credence for results closely supported by good data,
the more the better. There is a widespread failure among biologists
to recognize that a good mathematical model should not only fit observations,
but also arise from truly biological first principles. In this dissertation,
I develop Williams' (1970) structural model, and demonstrate how this
structural model leads to a well-known , empirical model which is already
in use among practicing biologists. The axioms for
Williams' (1970)
model are quite simple and straightforward. Stated informally:
'no organism is a parent of itself and ancestry is unidirectional.
From these beginnings plus the axioms of set theory, I construct
a system for evaluating seven phenogram algorithms, some of which
are in current use by numerical taxonomists.
The achievement of this dissertation is that it starts from
purely structural principles and derives computing algorithms
which are already known to work. Until recent years,
the construction of phenograms from biological data
has depended almost exclusively upon empirical models.
A bewildering variety of phenogram algorithms confronts the
interested biologist (
Sørenson, 1948;
Sneath, 1957;
Sokal and Michener, 1958;
Rogers and Tanimoto, 1960;
Lockhart and Hartman, 1963;
Edwards and Cavalli-Sforza, 1964;
Camin and Sokal, 1965;
Fitch and Margoliash, 1967;
Moore et al., 1969;
Dayhoff, 1969;
Kluge and Farris, 1969),
but there are virtually no guidelines as to the conditions
under which these algorithms can be trusted to give meaningful results
beyond the intuitive judgment of the investigator. There is even
a widespread feeling that just because "the computer does it",
the result is somehow more objective than in conventional biological studies
(Sokal and
Sneath, 1963, p. 49;
Rogers et al, 1967).
The first attempts to understand the conditions
under which phenogram algorithms yield a cladogram
were computer simulations
(Camin and Sokal, 1965;
Farris, 1970),
Computer simulations employ "made-up" data consistent with known properties,
and evaluate a phenogram algorithm on the basis of its performance
with these made-up data.
This is a step ahead of the experiental approach of
Sokal and Sneath (1963, pp. 189-194),
who base their judgment on data from the real, biological world,
with unknown properties. Computer simulation can be used to reject
an algorithm if the algorithm fails to reconstruct the original cladogram.
But computer simulation can never be used to accept an algorithm,
because it is always possible that the algorithm would fail
for an untried data set, even though it had been successful
on all data sets tried theretofore. The only way to show that
an algorithm always works is by mathematical proof.
In
1968, Estabrook
presented a method for restricting the choice
of dendrograms to a small subset which must include the true
cladogram. (see also
Hendrickson (1968)
). Estabrook solved the cladogram problem subject to the "maximum parsimony" hypothesis
see
Chapter Seven)
and the following additional restrictions:
(i) the (discrete) character states for each OTU are known;
(ii) the ancestral sequence of character states is known,
and is irreversible; and
(iii) the ultimate ancestral state happened only once.
Estabrook's work marks a major turning point in the literature
of numerical taxonomy, because it employs ordinary methods
of mathematical proof (comparable to the methods of this dissertation),
rather than the "heuristic" (i.e., inspirational, but not necessarily
reliable) approaches of other authors. There are several shortcomings
in Estabrook's paper. First, the restrictions placed upon the
true cladogram are not satisfied in most taxonomic investigations;
ordinarily, the investigator does not know~ the ancestral sequence
of character states, and is not even sure that this sequence is irreversible. Second, Estabrook does not demonstrate the uniqueness of his result; it is perfectly possible that there are tw,o or more distinct dendrograms satisfying Estabrook conditions for a particular data set -- the true eladogram and other, quite different dendrograms.
For reasons detailed in chapter seven, this dissertation does not employ
a maximum parsimony hypothesis of evolution; rather, it develops
three "divergence" hypotheses which all correspond to the intuitive idea
that a more ancient ancestral separation for a pair. of OTUs results
in a greater dissimilarity value for that pair of OTUs.
This dissertation demonstrates that any dendrogram generated
by the appropriate phenogram algorithm is a true cladogram,
and that this cladogram is unique. The phenogram algorithm
which emerges as the best (that is, the unweighted pair-group method of
Sokal and Michener (1958)
) is guaranteed to arrive at the correct solution after a relatively small,
finite number of steps.
2.2. The Axiomatic Method in Biology.
The basic mathematical tool which is employed in this dissertation
is the method of axiomatics. Axiomatics is different from conventional
mathematical derivations in the biological literature in that it is
entirely self-contained. Except for the axioms of set theory
(which could be included if necessary), the only assumptions
in this dissertation are summarized as two, simple axioms.
All other statements in this dissertation have been proved
in terms of these two axioms us~ng only the methods of ordinary
mathematical proof. Similarly, the only undefined terms,
or primitives, in this dissertation are "the set of all organisms"
and the relation "is a parent of". All other terms in this dissertation
have been defined in terms of these two primitives using only
the grammar of logic. The strict, deductive structure of an
axiomatic theory allows one to move easily from a theorem
to the prior statements which were used to prove it (and similarly
from a defined term to the prior terms by which it was defined).
Since every theorem is proved in terms of prior theorems
and the unproved axioms, and similarly since every term is defined
in terms of prior terms and the undefined primitives, an axiomatic theory
is noncircular.
The methods of this dissertation are those of
naive axiomatics,
that is, the standards of logical statement and proof employed
by most mathematicians most of the time. Naive axiomatics
is contrasted to the more cumbersome
formalized axiomatics
of
Woodger's (1937) The Axiomatic Method in Biology.
2.3.
2.3 Biological versus Mathematical Definitions.
It is important for the biological reader to distinguish between
his intuitive concept of definition and the mathematical concept
of definition. In biology, a new-term is defined in terms
of previously defined terms and concepts in general use.
A biological definition is rarely required to correspond
exactly
to what the author really means. The usual practice is for the definition
to "come close" to what the author really means. Then the author proceeds
to enumerate the "exceptions" to the definition. In mathematics,
a definition means exactly what it says and says exactly what it means.
The term which is being defined and the phrase which is used to define
it are completely interchangeable in any statement subsequent
to the definition. This is often not possible for biological definitions.
Because of the presence of exceptions in biological definitions,
arbitrary substitution of the term-being-defined and the defining-terms
may readily lead to a logical absurdity.
Woodger (1952, pp. 219-252),
for example, has taken tw,o "respectable" biological definitions of species,
translated them into logical terms, and proved they lead to an absurdity.
The reason he can do this is that the biological definitions he uses
were never really meant to be
exactly
true, merely
approximately
true.
Definitions in this dissertation are mathematical definitions.
The biological reader should be continually on the alert
for this distinction. Each defined term in this dissertation
means nothing more nor less than the sequence of terms of which
the definition consists. I have tried to define terms which correspond
to intuitive biological notions, 'but in cases where I have failed
to capture the intuitive biology, the mathematical content
of the definition takes precedence. In chapter four, for example,
I define "is isolated from" to correspond roughly to the
biologist's concept of reproductive isolation.
Because of shortcomings in the biological notion, I have found
it necessary to adopt a more stringent definition of my own.
This mathematical definition may or may not correspond to a particular
biologist's notion of isolation, but when I employ the term,
"is isolated from" in subsequent discussion, I make reference
to my definition only. If a biologist "disagrees" with my concept
of "is isolated from" (i.e., does not feel, that it corresponds
to his intuitive notion of isolation), then any theorem which
I prove using "is isolated from" may be false if he substitutes
his intuitive notion of isolation for my notion of "is isolated from";
my theorems are guaranteed to be true only for the concept as it is defined.
2.4.
Model Theory in Biology.
Model theory (Robinson, 1965) is a branch
of mathematics concerned with the correspondence of mathematical models
to the "real warld". A model theory theory is an axiomatic theory,
such as will be developed in this dissertation; biologists ordinarily call
this a "model". A model theory model is some system
in the real world to which the model theory theory corresponds;
biologists ordinarily call this simply the "real world".
Whenever we construct a model theory theory corresponding
to some biological process, we would like this model theory theory
to have a model theory model in the real, biological world.
The problem in actually executing this plan is that there are
differing views as to what the real, biological world actually is:
(i)
Some statements about the biological world are agreed upon by all
because they are tautologous. For example, Axiom I (chapter three)
states that "no organism is a parent of itself". All biologists agree
to this (so long as "organism" and "is a parent of" are understood
in the conventional sense) because it is inherent'in what we mean by
the term "parent".
(ii)
Some statements about the biological world are true
because they can be proven by experiment. For example, the statement
"some viruses lack DNA" can be proven true by finding a single virus
which lacks DNA: say, the Rous sarcoma virus (an RNA, virus).
(iii)
Unfortunately, many biological statements are neither inherently
tautologous nor verifiable by experiment. For example, the statement
"all viruses have either DNA or RNA or both" is accepted by most biologists,
but it cannot be proved until every single virus particle (including every
foreseeable particle which might be called a"virus") is analyzed.
(iv)
A final shortcoming of "biological reality" as biologists think
of it is that many important techniques required for biology
are not really part of the traditional subject matter of biology.
For example, a statement of the form "if x is true and y is true,
then x is true" is not really a part of biology, although any biologist
would accept it. This statement belongs to logic. In order to achieve
anything more than the most trivial conclusions, we need not only logic,
but also some of its more sophisticated derivatives (set theory,
real number theory, linear algebra).
This dissertation studiously avoids
unprovable biological statements, such as those of type
(iii),
but since you can't get something for nothing, the dissertation
doesn't give you any very useful conclusions which apply
unconditionally.
For example, this
dissertation turns out to be fairly useless to any biologist not willing
to accept some form of divergent evolution. All of the theorems in
Chapter nine
have the form (roughly speaking):
"if evolution is reasonably divergent, then such-and-such algorithm
gives a true cladogram". The biologist who is willing to commit
him/herself to one of the hypotheses of divergent evolution
developed in this dissertation can use the theorems proved
in Chapter nine;
the biologist who doesn't accept any of these hypotheses
of divergence will find a chapter full of theorems which (though true)
don't apply to his/her version of biological reality.
The model theory model for the axiomatic theory of this dissertation
is any biological world in which
Axiom I
and Axiom II
and the axioms of set theory are true.
Axioms I and II should be obvious to any biologist;
the axioms of set theory should be acceptable to biologists.
The only "doubtful" axioms are ones which apply to infinite sets
(conspicuously, the
Zorn Lemma,
chapter three);
since no statement in this dissertation requires infinite sets
(although many permit it), and since the finite counterparts
of the infinite-set-axioms are undisputed, the set theory part
of the real world should be acceptable to biologists. The most
serious hazard in the use of set theory is that of misunderstanding
(review Section 2.3;,
see Section 2.9, "A Precaution").
Suppose a theorem turns out to be false in the real world.
According to model theory, the only way a theorem can be false
in the real world is if one or more of the axioms used to prove
that theorem is false (assuming that the proofs are free of
logical errors). The deductive structure of an axiomatic theory
permits one to locate a false axiom or collection of axioms
simply by working back from a false theorem which was proved
from that axiom or axioms. The beauty of an axiomatic theory
is that even if some theorem turns out to be false, then
it is possible to rescue value from the axiomatic theory
by locating the false axiom and changing it.
2.5.
Flow of Reasoning in this Dissertation
Theorems are proved in this dissertation for three purposes:
(i)
to indicate the correspondence of the axiomatic theory to known situations
in the real world;
(ii)
to prove the formal validity of the several computer phenogram algorithms
under appropriate hypotheses of divergent evolution (the ultimate purpose
of this dissertation); or
(iii)
to act as stepping stones toward thp proof of theorems with purposes
(i)
or
(ii).
A great many theorems fall into class
(iii),
and thus are difficult to justify or explain intuitively at the time
they are being proved. I have endeavored to place such theorems
where they fit best, but often this is a rather lame effort.
The least intuitive theorems are called "lemmas", and are proved
immediately before they are used with a remark to the reader
that their content may be obscure.
Appendix 11.2
provides a map
of the entire flow of reasoning employed in this dissertation.
2.6.
Informal Set Theory.
What is set theory? It is just that--the theory of "sets" or "bunches"
or "collections" of objects, no matter what the objects may be.
A bunch of grapes is a set. A pile of leaves is a set.
A collection of coins is a set. However, there is no need for things
to be organized into physically contiguous bunches to be considered a set.
We can talk about the set of all automobiles in the United States,
or the set of all red automobiles, or the set of all red 1959 Chevrolets.
It is not likely that any of these sets will ever be assembled
in a single, neatly-stacked bunch, yet we can still talk about them
as "sets" in abstract discussions. We can even talk about sets
like the "set of all married bachelors" or the "set of all female widowers".
These sets are a little bit different than the previous ones, because
they don't contain any members at all: Such sets are called "the empty set"
or "the null set", and are just as valid for us to talk about
as sets which contain very many members.
A convenient way to represent a set is as a line enclosure
about the members of the set. Thus, the "set of the first
two sons of Abraham" (see Figure 2.1) might be represented
as follows:

Figure 2.0.

Figure 2.1.
These enclosure diagrams are called Euler-Venn diagrams.
(Strictly speaking, we have not literally disinterred
Isaac and Ishmael and drawn a line about them; instead,
we have used their names. We shall employ this convention
throughout the dissertation.) On the printed line,
it is not very convenient to draw these enclosures,
so we use "curly bracket" notation. The set of the first two sons
of Abraham is the set, {Isaac,Ishmael}. Two conventions
should be kept in mind when dealing with set notation:
the order in which the members are listed doesn't matter;
and repeated members don't count. Thus, the set of the
first two sons of Abraham might equally well be represented as:
{Ishmael,Isaac}
or: {Ishmael,Isaac,Isaac,Ishmael,Ishmael,Isaac}
and it would still be set theoretically equivalent to our original notation.
For sets having lots of members, it isn't even very convenient to list
all the members. For example, the "set of all living human beings"
would be much too cumbersome to write out every time we needed it,
even if we could get a list of all those three billion names.
Therefore, we use a notation called "definition by abstraction".
The set of all living human beings is written:
{x : x is a living human being}
is read "the set of all x such that x is a living human being".
The letter x is called a "dummy variable",
because it has no significance other than as an internal defining device.
We could equally well use a building block:
{□ : □ is a living human being}
or a Chinese character:
{中 : 中 is a living human being}
and we would still be talking about exactly the same seet.
Definition by abstraction is especially convenient for infinite sets,
where the curly bracket notation would be actually impossible.
For example:
{x : x is an even number}
Finally, we can always give a name to particular sets in order
to avoid having continually to rewrite entire sets, either
in curly-bracket or definition-by-abstraction notation.
For example, we set up the following shorthand for use in this chapter:
U = {x : x is Abraham or a descendant of Abraham}
A={Abraham}
B={Isaac,Jacob,Esau}
C = {x : x is Isaac or a son of Isaac}
D = {x : x is a descendant of Abraham}
E = {Abraham,Isaac,Jacob}
2.7.
2.7 Set Relationships.
Figure 2.2 shows each of the sets U, A, B, C, D, and E,
in the Euler-Venn notation.
These sets are not unrelated
to one another:
two of them are exactly the same,
and almost all of them share one or more elements in common.
Set U is called the "universe",
because it contains all the members which we will be dealing
with in our discussion of sets A, B, C, D, and E.

Figure 2.2.
Figure 2.2.
Sets:
Abraham and his descendants.
Set
U
is the set of Abraham and all his descendants.
Sets
A, B, C, D,
and
E,
are subsets of
U.
Dots (
...
)
indicate additional members which are not named explicitly.
The simplest operation in set theory is the
membership operation.
We say, for example, that "Isaac is a member of set B".
The notation for this operation is:
Isaac ∈ B.
The opposite of membership is
nonmembership.
We say "Abraham is not a member of the set B",
and employ the notation:
Abraham ~∈ B.
Two sets are equal
if they have exactly the same members.
For example, set B equals set C, denoted:
B = C,
because every member of set B is a member of set C,
and every member of set C is a member of set B.
One set
is a subset of
another if every member of the first is also a member of the second.
For example,
set B is a subset of set D, denoted:
B ⊆ D,
because every member of set B is also a member of set D.
It is also true that:
B ⊆ C,
because every member of set B is also a member of set C.
The relationship begtween sets B and D is a special subset relation,
namely,
is a proper subset of.
We say that
set B is a subset of set D, denoted:
B ⊂ D,
because every member of set B is also a member of set D,
and
there is at least one member of set D which is not a member of set B.
Clearly, it is
not
true that
B ⊂ C;
every member of set B is also a member of set C,
but there is no member of set C which is not also a member of set B.
Throughout this dissertation, we shall use N() to denote the
cardinality, or number of members, of the set in parentheses.
For example:
N(A) = 1
N(B) = 3
N(Ø) = 0.
2.8.
Special Sets.
The set is given a special name, because it has only one member:
it is called a singleton. Any member of the universe can be made
into a singleton simply by enclosing it in curly brackets.
Here are some of the singletons which can be created
from our universe, U:
{Abraham}
{Isaac}
{Ishmael}
{Jacob}
{Esau}
A singleton is
not
the same thing as the single member it contains.
For example, it is not true that Abraham is the same as
{Abraham}.
In fact, each of the following singletons are
different
from one another:
{Abraham}
{{Abraham}}
{{{Abraham}}}
{{{{Abraham}}}}
The first set is the
"set of Abraham";
The second set is the
"set of the set of Abraham"; etc.
One of the most important sets in set theory
has no members at all.
It is called the
empty set
or the
null set,
and is denoted by either one of the following notations:
{} or Ø
In set theory, all null sets are equal (because they all have
exactly the same members, namely, no members at all).
Some examples of the null set are:
{x : x is a married bachelor}
{x : x is a female widower}
{x : x is a man who walked on the moon in 1900 AD}
Figure 2.2. Sets: Abraham and his descendants.
Set U is the set of Abraham and all his descendants.
Sets A, B, C, D, and E are subsets of U.
Dots (...) indicate additional members which are not named explicitly.
2.9.
A Precaution.
One of the hazards of the English language is a very imprecise
little word: the word "in".
This word is probably reponsible for the infinite frustrations
of the beginning student of set theory, and may even be responsible
for the inability of taxonomists to apply
2.10.
Set Operations.
2.11.
Mathematical Relations.
3. PRINCIPLES OF GENEALOGY.
3.1 First Principles.
3.2 Ancestor-or-Equal-to
3.3 Pedigrees.
3.4 Common Pedigrees.
3.5 The Monoparental Biocosm.
4. OPERATIONAL TAXONOMIC UNITS AND ISOLATION.
4.1 Operational Taxonomic Units.
4.2 The Superpartition.
4.3 Isolation.
4.4 Sampling from Isolated Sets.
4.5 Cladistic Partitions.
5. DENDROGRAMS AND CLADOGRAMS.
5.1 The Dendrogram.
5.2 Cardinality of a Binary Dendrogram.
5.3 The Dendrogram as a Monoparental Biocosm.
5.4 The Cladogram.
6. SOURCES OF INPUT DATA.
6.1 The Phenetic Matrix.
6.2 Comparison of Character States.
6.3 Direct Techniques for Obtaining a Phenetic Matrix.
6.4 The Immunodiffusion Technique.
6.5 Solution for the Phenetic Matrix.
6.6 Inversion of the Primary Solution Matrix.
7. DIVERGENT EVOLUTION IN MOLECULAR CHARACTERS.
7.1 Trouble with Genetic Load.
7.2 Low Selection at the DNA Level.
7.3 Low Selection at the Protein Level.
7.4 Low Welection and Divergent Evolution.
7.5 The Parsimony Hypothesis.
7.6 Divergence Hypothesis.
7.7 The Superphenetic Matrix.
7.8 Formulas for a Superphenetic Matrix.
7.9 Nesting Levels in a Dendrogram.
7.10 Weighted Average Superphenetic Matrix.
7.11 Choice of Cladogram Junctures.
7.12 Divergence-in-Mean.
7.13 Divergence and Uniform Evolution.
8. THE STRUCTURE OF PHENOGRAM ALGORITHMS.
8. THE STRUCTURE OF PHENOGRAM ALGORITHMS.
8.1 Iteration Structures.
8.2 Refinements of a Partition.
8.3 Agglomerative and Divisive Iteration Structures.
8.4 Phenogram Algorithms Generate Dendrograms.
8.5 Cladogram Functions.
9. EVALUATION OF PHENOGRAM ALGORITHMS.
10. LIST OF REFERENCES.
Arnheim N, Prager EM, Wilson AC.
Immunological prediction of sequence
J Biol Chem. 1969;244:2085-2094.
Gregg JR.
The Language of Taxonomy.
New York:
Columbia University Press.
1954.
Gregg JR.
Finite Linnaean structures.
Bull math biophys. 1967;29-191-206.
1. Arnheim N, Prager EM, Wilson AC.
Immunological prediction of sequence differences among proteins.
Chemical comparison of chicken, quail, and pheasant lysozymes.
J Biol Chem. 1969 Apr 25;244(8):2085-2094.
PMID: 4889463.
PubMed Entry
2. Barnabas J, Goodman M, Moore GW.
Evolution of hemoglobin in primates and other therian mammals.
Comp Biochem Physiol B. 1971 Jul 15;39(3):455-482.
PMID: 5001181.
PubMed Entry
3. Bastian H.
And Then Came Man.
New York: Viking Press. 1964;:.
4. Bott R, Mayberry JP.
Matrices and Trees.
In: Morgenstern O (ed). Economic Activity Analysis.
New York: John Wiley and Sons, Inc. 1954.
5. Britten RJ, Kohne DE.
Repeated sequences in DNA.
Hundreds of thousands of copies of DNA sequences
have been incorporated into the genomes of higher organisms.
Science. 1968 Aug 9;161(841):529-540.
PMID: 4874239.
PubMed Entry
6. Brues AM.
7. Buck RC, Hull DL.
8. Burnet FM.
A certain symmetry: histocompatibility antigens
compared with immunocyte receptors.
Nature. 1970 Apr 11;226(5241):123-126.
PMID: 5461774.
PubMed Entry
11. APPENDICES.
11. ADDITIONAL READINGS.
ANNOTATIONS. EXPLANATORY NOTES.
1/12/2008, by G. William Moore, MD, PhD
This webpage is an attempted reconstruction of my PhD thesis.
I have corrected some obvious typographical errors, and I hope not introduced
too many new ones. I employ boldface for emphasis, rather than
underline . I have changed slash-through (not available on UNICODE)
to ~ (tilde) for negations. Where possible, I have added PubMed
hyperlinks to references. The explanatory notes provided herewith include
etymologic explanations, additional references, and other notes showing
progress in this and related health-science areas over the past
three decades.
Hebrew:
abrhm: Abraham.
Hebrew:
yxahk: Isaac.
Hebrew:
uxa: Esau.
Hebrew:
Hagar.
Ishmael.
Sarah.
Rebekah.
Keturah.
Zimran.
Jokshan.
Medan.
Midian.
Ishbak.
Shuah.
Nebaioth.
Kedar.
Abdeel.
Mibsam.
Mishma.
Dumah.
Massa.
Hadad.
Tema.
Jetur.
Naphish.
Kedema.
Greek:
κλαδακι
: kladaki: sprig, twig.
Greek:
κλαδεμδ
: kladema: pruning.
Greek:
δενδρο,
δεντρο
: dendro, dentro: tree.
Greek:
φαινομαι
: fainomai: to appear like, to be visible.
algorithm: named after Al-Khawáarizmi,
mediaeval Arab mathematician.
Greek:
γραμμα
gramma: letter.
Greek:
γραφη
grafy: writing. scripture.
Greek:
αγια γραφη
: Agia Grafy: Holy Scripture.
Greek:
δια
: dia: by, for.
6.6 Inversion of the Primary Solution Matrix.
Using the Leontief Matrix. V. Leontief, Nobel Laureate in Economics, 197x.
7.11 Choice of Cladogram Junctures.
Greek:
κλαδακι:
kladaki: sprig, twig.
Dr. Mary B. Williams.
Dr. H. R. van der Vaart [d. 2002].
Dr. Henry Schaffer.
Dr. Harvey Charlton.
Dr. Donald Huisingh.
Prof. N. Rashevsky [d. 198x].
Last updated, 3/15/2008, by G. William Moore, MD, PhD