MATHEMATICAL MODEL FOR THE
CONSTRUCTION OF CLADOGRAMS.
DRAFT COPY ONLY.
3/15/2008.
G. William Moore, MD, PhD.


From the Pathology and Laboratory Medicine Service, Veterans Affairs Maryland Health Care System, Baltimore, Maryland;
Department of Pathology, The Johns Hopkins Medical Institutions, Baltimore, Maryland; and
Department of Pathology, University of Maryland School of Medicine, Baltimore, Maryland.

http://www.gwmoore.org/mathclad.htm
Preliminaries: http://www.gwmoore.org/mathcl00.pdf
Chapter 1: http://www.gwmoore.org/mathcl01.pdf
Chapter 2: http://www.gwmoore.org/mathcl02.pdf
Chapter 3: http://www.gwmoore.org/mathcl03.pdf
Chapter 4: http://www.gwmoore.org/mathcl04.pdf
Chapter 5: http://www.gwmoore.org/mathcl05.pdf
Chapter 6: http://www.gwmoore.org/mathcl06.pdf
Chapter 7: http://www.gwmoore.org/mathcl07.pdf
Chapter 8: http://www.gwmoore.org/mathcl08.pdf
Chapter 9: http://www.gwmoore.org/mathcl09.pdf
Chapter 10: http://www.gwmoore.org/mathcl10.pdf


Send comments and correspondence to: Dr. G. William Moore, George.Moore4@va.gov
See also: http://www.netautopsy.org/gwmcv.htm

United States Government Work, uncopyrighted, public-domain, supported by National Institutes of Health Predoctoral Traineeship, 1967-1971, North Carolina State University at Raleigh, NC. This document does not necessarily represent the views or policies of any United States Government agency. This document is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of, or in connection with the document or the use or other dealings made with the document. Published as:

Moore GW.
A Mathematical Model for the Construction of Cladograms.
North Carolina State University at Raleigh, NC. Institute of Statistics. Mimeograph Series No. 731 (1971). Under the direction of H. R. Van der Vaart.

ABSTRACT.



A dendrogram is any tree-like diagram for representing relationships among sets of organisms. A phenogram is any dendrogram which has been calculated from initial data according to a computer algorithm. Such an algorithm is called a phenogram algorithm. A cladogram is any dendrogram whose progressively thicker branches correspond to more ancient common ancestors. There are currently over a dozen different phenogram algorithms in the literature, which may give different dendrograms for the same input data. This dissertation addresses itself to the problem of which phenogram algorithm, if any, yields the true cladogram as its result.

The biological phenomenon of cladogram formation in nature is stated in terms of the rigorous, axiomatic foundations of Williams' biocosm. Simple models for ancestry, pedigrees, reproductive isolation, and tree-like diagrams are developed for the statement of the problem. Sources of input data, especially data from molecular characters, are discussed. Three evolutionary hypotheses for the long-term behavior of these characters are presented, and seven phenogram algorithms are evaluated in tems of these evolutionary hypotheses. The unweighted-pair-group phenogram algorithm of Sokal and Michener is found to require the weakest evolutionary hypothesis.

A MATHEMATICAL MODEL
FOR THE CONSTRUCTION OF CLADOGRAMS.
by
George William Moore.
A thesis submitted to the Graduate Faculty of
North Carolina State University at Raleigh
in partial fulfillment of the
requirements for the Degree of
Doctor of Philosophy
BIOMATHEMATICS PROGRAM
RALEIGH
1971
APPROVED BY
__________________________________ __________________________________
__________________________________ __________________________________
__________________________________
Chairman of the Advisory Committee



BIOGRAPHY.


The author was born in 1945, in Detroit, Michigan. He graduated from Highland Park High School, Highland Park, Michigan, in 1963.

He entered the University of Michigan in September, 1963, and received a Bachelor of Science degree with high honors in Cellular Biology in June, 1967. While an undergraduate, he was elected for membership in the national honor societies of Phi Kappa Phi and Phi Beta Kappa.

      In September, 1967, he was awarded a National Institutes of Health Predoctoral Traineeship, and entered the Graduate School of North Carolina State University. During his graduate career, the author worked as a teaching assistant in the Departments of Zoology and Mathematics.

ACKNOWLEDGMENTS.


      I wish to express my appreciation to Dr. Mary B. Williams for her guidance and helpful suggestions during the course of this investigation and in the preparation of this manuscript. I also wish to acknowledge the assistance given me by the other members of my graduate committee: Dr. H. R. van der Vaart, Dr. Henry Schaffer, Dr. Harvey Charlton, and Dr. Donald Huisingh.

TABLE OF CONTENTS.


Abstract.
Title page.
Biography.
Acknowledgments.
LIST OF TABLES.
LIST OF FIGURES.


1. INTRODUCTION.

1.1 Purpose of this Dissertation.

1.2 How to Read This Dissertation.

1.3 Chapter Organization.



2. MATHEMATICAL MODELS IN BIOLOGY.

2.1 Structural Versus Empirical Models.

2.2 The Axiomatic Method in Biology.

2.3 Biological Versus Mathematical Definitions.

2.4 Model Theory in Biology.

2.5 Flow of Reasoning in This Dissertation.

2.6 Informal Set Theory.

2.7 Set Relationships.

2.8 Special Sets.

2.9 A Precaution.

2.10 Set Operations.

2.11 Mathematical Relations.



3. PRINCIPLES OF GENEALOGY.


3.1 First Principles.


3.2 Ancestor-or-Equal-to


3.3 Pedigrees.


3.4 Common Pedigrees.


3.5 The Monoparental Biocosm.



4. OPERATIONAL TAXONOMIC UNITS AND ISOLATION.

4.1 Operational Taxonomic Units.

4.2 The Superpartition.

4.3 Isolation.

4.4 Sampling from Isolated Sets.

4.5 Cladistic Partitions.



5. DENDROGRAMS AND CLADOGRAMS.

5.1 The Dendrogram.

5.2 Cardinality of a Binary Dendrogram.

5.3 The Dendrogram as a Monoparental Biocosm.

5.4 The Cladogram.



6. SOURCES OF INPUT DATA.

6.1 The Phenetic Matrix.
6.2 Comparison of Character States.
6.3 Direct Techniques for Obtaining a Phenetic Matrix.
6.4 The Immunodiffusion Technique.
6.5 Solution for the Phenetic Matrix.
6.6 Inversion of the Primary Solution Matrix.


7. DIVERGENT EVOLUTION IN MOLECULAR CHARACTERS.

7.1 Trouble with Genetic Load.
7.2 Low Selection at the DNA Level.
7.3 Low Selection at the Protein Level.
7.4 Low Welection and Divergent Evolution.
7.5 The Parsimony Hypothesis.
7.6 Divergence Hypothesis.
7.7 The Superphenetic Matrix.
7.8 Formulas for a Superphenetic Matrix.
7.9 Nesting Levels in a Dendrogram.
7.10 Weighted Average Superphenetic Matrix.
7.11 Choice of Cladogram Junctures.
7.12 Divergence-in-Mean.
7.13 Divergence and Uniform Evolution.


8. THE STRUCTURE OF PHENOGRAM ALGORITHMS.

8.1 Iteration Structures.
8.2 Refinements of a Partition.
8.3 Agglomerative and Divisive Iteration Structures.
8.4 Phenogram Algorithms Generate Dendrograms.
8.5 Cladogram Functions.


9. EVALUATION OF PHENOGRAM ALGORITHMS.



10. LIST OF REFERENCES.



11. APPENDICES.



12. ANNOTATIONS, EXPLANATORY NOTES.


1. INTRODUCTION.


1.1 Purpose of this Dissertation.



In 1963, Sokal and Sheath published their controversial text, Principles of Numerical Taxonomy (PNT). On pages 189-194, PNT discusses the construction of tree-like diagrams for representing taxonomic relationships among sets of organisms, or taxonomic units. A dendrogram is any tree-like diagram. A phenogram is any dendrogram which has been calculated from data about the taxonomic units according to a computer algorithm. PNT presents six different phenogram algorithms, or methods for computing a dendrogram from input data. The authors then evaluate these methods on the basis of their intuition and experience. Their evaluation reflects a rather prevalent attitude among biologists who use computers that if one runs enough data, one is able to judge the method.

For a biologist who understands his material and has a good feeling for what the computer does, this attitude is by and large correct. Trouble starts when this biologist confronts a new data source. Since he doesn't really understand what makes the technique "work", he may waste a great deal of time getting a "feel" for his new-material. Often this is a consequence not of carelessness on the biologist's part, but because biological principles themselves are logically "mushy" in a few key places. This dissertation employs a manner of inquiry which overcomes the uncertainties in the intuitive approach: it derives phenogram algorithms from rigorously stated biological first principles, using the methods of ordinary mathematical proof.

A cladogram is a dendrogram whose progressively thicker branches correspond to progressively more ancient common ancestors. It is distinguished from a phenogram in that a phenogram is the result of a well-defined sequence of calculations, whereas a cladogram is a fact of nature. The cladogram problem is the problem of specifying a process of calculation by which input data about taxonomic units can be used to infer a branching sequence of progressively more ancient common ancestors. For a biologist, the cladogram problem consists of two major questions: what kind of data to obtain and which phenogram algorithm to feed these data into. The purpose of this dissertation is to explore these two questions in detail. A model for cladogram formation in nature is set up using the "biocosm" formalism of Williams (1970); a formalism for computer phenogram algorithms is set up using the ordinary tools of set theory. Then the evolutionary model is used to rank the performance of seven different phenogram algorithms. I conclude that Sokal and Michener's (1958) "unweighted pair-group method" is the best choice for information about molecular characters summarized in matrix form.

1.2 How to Read this Dissertation.



This dissertation uses methods of ordinary mathematical reasoning to evaluate-the ability of seven phenogram algorithms to infer the true cladogram subject to three evolutionary hypotheses. Any biologist who has read Sokal and Sneath's (1963) PNT and has a general interest in numerical taxonomy might find this dissertation of sufficient interest to skim, skipping all formalized definitions, theorems, and proofs.

The mathematical portion of this dissertation is much more tedious than the descriptive text, and is not worth the expenditure of effort for most biological readers. Even a reader who intends to study the mathematical results might want to skip the proofs and the lemmas (specialized, nonintuitive theorems introduced solely to facilitate later proofs). A high school course in plane geometry, some college-level algebra, and a good deal of patience are the only prerequisites. Background in elementary set theory w-ould be helpful, but not essential. All the mathematical and set theory terminology is explained in this dissertation, and collected for reference in Appendix 11.1. Each mathematical statement in this dissertation is prefaced by an informal resume of its contents and purpose, as an aid to this reader.

Finally, proofs are provided for all but the most obvious theorems,.to verify the validity of each theorem.



1.3 Chapter Organization



This dissertation progresses in three major phases. The first phase develops enough mathematical machinery for the next two phases. The second phase discusses which data are appropriate for the solution of the cladogram problem. The third phase develops the seven phenogram algorithms of this dissertation and evaluates their performance with respect to three different evolutionary hypotheses.

Chapters two through five constitute the first phase of this dissertation. Chapter two develops the general area of mathematics which will be employed in this dissertation: the theory of sets and relations. Chapter three develops well-known concepts of genealogy in a notation appropriate for subsequent chapters of this dissertation. Chapter four develops the concepts of Operational Taxonomic Units (OTUs) and isolation between 0'I'Us. Chapter five develops the concepts of dendrogram and cladogram.

Chapters six and seven constitute the second phase of this dissertation. Chapter six discusses several sources of molecular data commonly used for the cladogram problem, and shows how these sources can be reduced to a phenetic matrix. Chapter seven suggests that molecular data are an appropriate source for phenetic matrixes, because of the low level of selection ("non-parw9nian evolution") on individual molecular characters. Chapter seven suggests three evolutionary hypotheses for the evolution of molecular characters, and demonstrates some simple consequences of these hypotheses for use in later chapters.

Chapters eight and nine constitute the third phase of this dissertation. Chapter eight develops a mathematical machinery for phenogram algorithms, and chapter nine uses this machinery to evaluate each of seven phenogram algorithms with respect to the three evolutionary hypotheses described in chapter seven. Chapter nine is followed by the list of references and three appendices.

Appendix 11.1 is a glossary of all primitives and defined terms employed in this dissertation. Appendix 11.2 is a chart which details the flow, of reasoning employed in this dissertation. Appendix 11.3 is a computer flowchart for the unweighted pair-group (Sokal and Michener, 1958) phenogram algorithm.

2. MATHEMATICAL MODELS IN BIOLOGY


2.1 Structural versus Empirical Models



Biology is a young science in its use of mathematical models. Few areas of biology depend upon mathematics as more than superficial aid for summarizing observations. Genetics is one of the few major areas of biology some of whose first principles are desscribed by a genuine mathematical model. Most mathematical models in biology do an incomplete job. They are either
(i) empirical models, or
(ii) structural models
but rarely both.

An empirical model is a mathematical model which has been constructed largely to fit a particular class of data or observations. Characteristically, this kind of model is used for fitting curves through b~iological data under assumptions such as "linearity", "normal distribution", etc., without much biological basis for knowing whether these conditions indeed hold. An empirical model often solves for population or environmental parameters which have little or no basic biological meaning. Often, an empirical model gives the experimenter a false sense of meaning, when he has in fact done little more than translated one pile of numbers into another, smaller pile.

A structural model falls at the other extreme. This kind of model starts from "obvious" first principles and works forward toward observations. Unfortunately, there are many perfectly plausible candidates for the status of "first principles" which are false, and because the formulation of a structural model is often so remote from actual observations, it is very difficult to separate the plausible and true statements from the plausible and false statements. Consequently, a combination of quite plausible basic assumptions may lead to ridiculous conclusions. Another common shortcoming among structural models is that they often lead to conclusions which either cannot be observed experimentally or are of no interest to the biologist.

Biology is still very much an experimental science, and most good biologists reserve their credence for results closely supported by good data, the more the better. There is a widespread failure among biologists to recognize that a good mathematical model should not only fit observations, but also arise from truly biological first principles. In this dissertation, I develop Williams' (1970) structural model, and demonstrate how this structural model leads to a well-known , empirical model which is already in use among practicing biologists. The axioms for Williams' (1970) model are quite simple and straightforward. Stated informally: 'no organism is a parent of itself and ancestry is unidirectional. From these beginnings plus the axioms of set theory, I construct a system for evaluating seven phenogram algorithms, some of which are in current use by numerical taxonomists.

The achievement of this dissertation is that it starts from purely structural principles and derives computing algorithms which are already known to work. Until recent years, the construction of phenograms from biological data has depended almost exclusively upon empirical models. A bewildering variety of phenogram algorithms confronts the interested biologist ( Sørenson, 1948; Sneath, 1957; Sokal and Michener, 1958; Rogers and Tanimoto, 1960; Lockhart and Hartman, 1963; Edwards and Cavalli-Sforza, 1964; Camin and Sokal, 1965; Fitch and Margoliash, 1967; Moore et al., 1969; Dayhoff, 1969; Kluge and Farris, 1969), but there are virtually no guidelines as to the conditions under which these algorithms can be trusted to give meaningful results beyond the intuitive judgment of the investigator. There is even a widespread feeling that just because "the computer does it", the result is somehow more objective than in conventional biological studies (Sokal and Sneath, 1963, p. 49; Rogers et al, 1967).

The first attempts to understand the conditions under which phenogram algorithms yield a cladogram were computer simulations (Camin and Sokal, 1965; Farris, 1970), Computer simulations employ "made-up" data consistent with known properties, and evaluate a phenogram algorithm on the basis of its performance with these made-up data. This is a step ahead of the experiental approach of Sokal and Sneath (1963, pp. 189-194), who base their judgment on data from the real, biological world, with unknown properties. Computer simulation can be used to reject an algorithm if the algorithm fails to reconstruct the original cladogram. But computer simulation can never be used to accept an algorithm, because it is always possible that the algorithm would fail for an untried data set, even though it had been successful on all data sets tried theretofore. The only way to show that an algorithm always works is by mathematical proof.

In 1968, Estabrook presented a method for restricting the choice of dendrograms to a small subset which must include the true cladogram. (see also Hendrickson (1968) ). Estabrook solved the cladogram problem subject to the "maximum parsimony" hypothesis see Chapter Seven) and the following additional restrictions:
(i) the (discrete) character states for each OTU are known;

(ii) the ancestral sequence of character states is known, and is irreversible; and

(iii) the ultimate ancestral state happened only once.
Estabrook's work marks a major turning point in the literature of numerical taxonomy, because it employs ordinary methods of mathematical proof (comparable to the methods of this dissertation), rather than the "heuristic" (i.e., inspirational, but not necessarily reliable) approaches of other authors. There are several shortcomings in Estabrook's paper. First, the restrictions placed upon the true cladogram are not satisfied in most taxonomic investigations; ordinarily, the investigator does not know~ the ancestral sequence of character states, and is not even sure that this sequence is irreversible. Second, Estabrook does not demonstrate the uniqueness of his result; it is perfectly possible that there are tw,o or more distinct dendrograms satisfying Estabrook conditions for a particular data set -- the true eladogram and other, quite different dendrograms.

For reasons detailed in chapter seven, this dissertation does not employ a maximum parsimony hypothesis of evolution; rather, it develops three "divergence" hypotheses which all correspond to the intuitive idea that a more ancient ancestral separation for a pair. of OTUs results in a greater dissimilarity value for that pair of OTUs. This dissertation demonstrates that any dendrogram generated by the appropriate phenogram algorithm is a true cladogram, and that this cladogram is unique. The phenogram algorithm which emerges as the best (that is, the unweighted pair-group method of Sokal and Michener (1958) ) is guaranteed to arrive at the correct solution after a relatively small, finite number of steps.

2.2. The Axiomatic Method in Biology.



The basic mathematical tool which is employed in this dissertation is the method of axiomatics. Axiomatics is different from conventional mathematical derivations in the biological literature in that it is entirely self-contained. Except for the axioms of set theory (which could be included if necessary), the only assumptions in this dissertation are summarized as two, simple axioms. All other statements in this dissertation have been proved in terms of these two axioms us~ng only the methods of ordinary mathematical proof. Similarly, the only undefined terms, or primitives, in this dissertation are "the set of all organisms" and the relation "is a parent of". All other terms in this dissertation have been defined in terms of these two primitives using only the grammar of logic. The strict, deductive structure of an axiomatic theory allows one to move easily from a theorem to the prior statements which were used to prove it (and similarly from a defined term to the prior terms by which it was defined). Since every theorem is proved in terms of prior theorems and the unproved axioms, and similarly since every term is defined in terms of prior terms and the undefined primitives, an axiomatic theory is noncircular.

The methods of this dissertation are those of naive axiomatics, that is, the standards of logical statement and proof employed by most mathematicians most of the time. Naive axiomatics is contrasted to the more cumbersome formalized axiomatics of Woodger's (1937) The Axiomatic Method in Biology.

2.3. 2.3 Biological versus Mathematical Definitions.



It is important for the biological reader to distinguish between his intuitive concept of definition and the mathematical concept of definition. In biology, a new-term is defined in terms of previously defined terms and concepts in general use. A biological definition is rarely required to correspond exactly to what the author really means. The usual practice is for the definition to "come close" to what the author really means. Then the author proceeds to enumerate the "exceptions" to the definition. In mathematics, a definition means exactly what it says and says exactly what it means. The term which is being defined and the phrase which is used to define it are completely interchangeable in any statement subsequent to the definition. This is often not possible for biological definitions. Because of the presence of exceptions in biological definitions, arbitrary substitution of the term-being-defined and the defining-terms may readily lead to a logical absurdity. Woodger (1952, pp. 219-252), for example, has taken tw,o "respectable" biological definitions of species, translated them into logical terms, and proved they lead to an absurdity. The reason he can do this is that the biological definitions he uses were never really meant to be exactly true, merely approximately true.

Definitions in this dissertation are mathematical definitions. The biological reader should be continually on the alert for this distinction. Each defined term in this dissertation means nothing more nor less than the sequence of terms of which the definition consists. I have tried to define terms which correspond to intuitive biological notions, 'but in cases where I have failed to capture the intuitive biology, the mathematical content of the definition takes precedence. In chapter four, for example, I define "is isolated from" to correspond roughly to the biologist's concept of reproductive isolation. Because of shortcomings in the biological notion, I have found it necessary to adopt a more stringent definition of my own. This mathematical definition may or may not correspond to a particular biologist's notion of isolation, but when I employ the term, "is isolated from" in subsequent discussion, I make reference to my definition only. If a biologist "disagrees" with my concept of "is isolated from" (i.e., does not feel, that it corresponds to his intuitive notion of isolation), then any theorem which I prove using "is isolated from" may be false if he substitutes his intuitive notion of isolation for my notion of "is isolated from"; my theorems are guaranteed to be true only for the concept as it is defined.

2.4. Model Theory in Biology.



Model theory (Robinson, 1965) is a branch of mathematics concerned with the correspondence of mathematical models to the "real warld". A model theory theory is an axiomatic theory, such as will be developed in this dissertation; biologists ordinarily call this a "model". A model theory model is some system in the real world to which the model theory theory corresponds; biologists ordinarily call this simply the "real world". Whenever we construct a model theory theory corresponding to some biological process, we would like this model theory theory to have a model theory model in the real, biological world.

The problem in actually executing this plan is that there are differing views as to what the real, biological world actually is:
(i) Some statements about the biological world are agreed upon by all because they are tautologous. For example, Axiom I (chapter three) states that "no organism is a parent of itself". All biologists agree to this (so long as "organism" and "is a parent of" are understood in the conventional sense) because it is inherent'in what we mean by the term "parent".

(ii) Some statements about the biological world are true because they can be proven by experiment. For example, the statement "some viruses lack DNA" can be proven true by finding a single virus which lacks DNA: say, the Rous sarcoma virus (an RNA, virus).

(iii) Unfortunately, many biological statements are neither inherently tautologous nor verifiable by experiment. For example, the statement "all viruses have either DNA or RNA or both" is accepted by most biologists, but it cannot be proved until every single virus particle (including every foreseeable particle which might be called a"virus") is analyzed.

(iv) A final shortcoming of "biological reality" as biologists think of it is that many important techniques required for biology are not really part of the traditional subject matter of biology. For example, a statement of the form "if x is true and y is true, then x is true" is not really a part of biology, although any biologist would accept it. This statement belongs to logic. In order to achieve anything more than the most trivial conclusions, we need not only logic, but also some of its more sophisticated derivatives (set theory, real number theory, linear algebra).
This dissertation studiously avoids unprovable biological statements, such as those of type (iii), but since you can't get something for nothing, the dissertation doesn't give you any very useful conclusions which apply unconditionally. For example, this dissertation turns out to be fairly useless to any biologist not willing to accept some form of divergent evolution. All of the theorems in Chapter nine have the form (roughly speaking): "if evolution is reasonably divergent, then such-and-such algorithm gives a true cladogram". The biologist who is willing to commit him/herself to one of the hypotheses of divergent evolution developed in this dissertation can use the theorems proved in Chapter nine; the biologist who doesn't accept any of these hypotheses of divergence will find a chapter full of theorems which (though true) don't apply to his/her version of biological reality.

The model theory model for the axiomatic theory of this dissertation is any biological world in which Axiom I and Axiom II and the axioms of set theory are true. Axioms I and II should be obvious to any biologist; the axioms of set theory should be acceptable to biologists. The only "doubtful" axioms are ones which apply to infinite sets (conspicuously, the Zorn Lemma, chapter three); since no statement in this dissertation requires infinite sets (although many permit it), and since the finite counterparts of the infinite-set-axioms are undisputed, the set theory part of the real world should be acceptable to biologists. The most serious hazard in the use of set theory is that of misunderstanding (review Section 2.3;, see Section 2.9, "A Precaution").

Suppose a theorem turns out to be false in the real world. According to model theory, the only way a theorem can be false in the real world is if one or more of the axioms used to prove that theorem is false (assuming that the proofs are free of logical errors). The deductive structure of an axiomatic theory permits one to locate a false axiom or collection of axioms simply by working back from a false theorem which was proved from that axiom or axioms. The beauty of an axiomatic theory is that even if some theorem turns out to be false, then it is possible to rescue value from the axiomatic theory by locating the false axiom and changing it.

2.5. Flow of Reasoning in this Dissertation



Theorems are proved in this dissertation for three purposes: (i) to indicate the correspondence of the axiomatic theory to known situations in the real world; (ii) to prove the formal validity of the several computer phenogram algorithms under appropriate hypotheses of divergent evolution (the ultimate purpose of this dissertation); or (iii) to act as stepping stones toward thp proof of theorems with purposes (i) or (ii). A great many theorems fall into class (iii), and thus are difficult to justify or explain intuitively at the time they are being proved. I have endeavored to place such theorems where they fit best, but often this is a rather lame effort. The least intuitive theorems are called "lemmas", and are proved immediately before they are used with a remark to the reader that their content may be obscure. Appendix 11.2 provides a map of the entire flow of reasoning employed in this dissertation.

2.6. Informal Set Theory.



What is set theory? It is just that--the theory of "sets" or "bunches" or "collections" of objects, no matter what the objects may be. A bunch of grapes is a set. A pile of leaves is a set. A collection of coins is a set. However, there is no need for things to be organized into physically contiguous bunches to be considered a set. We can talk about the set of all automobiles in the United States, or the set of all red automobiles, or the set of all red 1959 Chevrolets. It is not likely that any of these sets will ever be assembled in a single, neatly-stacked bunch, yet we can still talk about them as "sets" in abstract discussions. We can even talk about sets like the "set of all married bachelors" or the "set of all female widowers". These sets are a little bit different than the previous ones, because they don't contain any members at all: Such sets are called "the empty set" or "the null set", and are just as valid for us to talk about as sets which contain very many members.

A convenient way to represent a set is as a line enclosure about the members of the set. Thus, the "set of the first two sons of Abraham" (see Figure 2.1) might be represented as follows:


Figure 2.0.





Figure 2.1.




These enclosure diagrams are called Euler-Venn diagrams. (Strictly speaking, we have not literally disinterred Isaac and Ishmael and drawn a line about them; instead, we have used their names. We shall employ this convention throughout the dissertation.) On the printed line, it is not very convenient to draw these enclosures, so we use "curly bracket" notation. The set of the first two sons of Abraham is the set, {Isaac,Ishmael}. Two conventions should be kept in mind when dealing with set notation: the order in which the members are listed doesn't matter; and repeated members don't count. Thus, the set of the first two sons of Abraham might equally well be represented as:
{Ishmael,Isaac}
or: {Ishmael,Isaac,Isaac,Ishmael,Ishmael,Isaac}
and it would still be set theoretically equivalent to our original notation. For sets having lots of members, it isn't even very convenient to list all the members. For example, the "set of all living human beings" would be much too cumbersome to write out every time we needed it, even if we could get a list of all those three billion names. Therefore, we use a notation called "definition by abstraction". The set of all living human beings is written:
{x : x is a living human being}
is read "the set of all x such that x is a living human being". The letter x is called a "dummy variable", because it has no significance other than as an internal defining device. We could equally well use a building block:
{□ : □ is a living human being}
or a Chinese character:
{中 : 中 is a living human being}
and we would still be talking about exactly the same seet. Definition by abstraction is especially convenient for infinite sets, where the curly bracket notation would be actually impossible. For example:
{x : x is an even number}


      Finally, we can always give a name to particular sets in order to avoid having continually to rewrite entire sets, either in curly-bracket or definition-by-abstraction notation. For example, we set up the following shorthand for use in this chapter:
U = {x : x is Abraham or a descendant of Abraham}
A={Abraham}
B={Isaac,Jacob,Esau}
C = {x : x is Isaac or a son of Isaac}
D = {x : x is a descendant of Abraham}
E = {Abraham,Isaac,Jacob}


2.7. 2.7 Set Relationships.



Figure 2.2 shows each of the sets U, A, B, C, D, and E, in the Euler-Venn notation. These sets are not unrelated to one another: two of them are exactly the same, and almost all of them share one or more elements in common. Set U is called the "universe", because it contains all the members which we will be dealing with in our discussion of sets A, B, C, D, and E.


Figure 2.2.

Figure 2.2. Sets: Abraham and his descendants. Set U is the set of Abraham and all his descendants. Sets A, B, C, D, and E, are subsets of U. Dots ( ... ) indicate additional members which are not named explicitly.

The simplest operation in set theory is the membership operation. We say, for example, that "Isaac is a member of set B". The notation for this operation is:
Isaac ∈ B.
The opposite of membership is nonmembership. We say "Abraham is not a member of the set B", and employ the notation:
Abraham ~∈ B.
Two sets are equal if they have exactly the same members. For example, set B equals set C, denoted:
B = C,
because every member of set B is a member of set C, and every member of set C is a member of set B. One set is a subset of another if every member of the first is also a member of the second. For example, set B is a subset of set D, denoted:
B ⊆ D,
because every member of set B is also a member of set D. It is also true that:
B ⊆ C,
because every member of set B is also a member of set C. The relationship begtween sets B and D is a special subset relation, namely, is a proper subset of. We say that set B is a subset of set D, denoted:
B ⊂ D,
because every member of set B is also a member of set D, and there is at least one member of set D which is not a member of set B. Clearly, it is not true that B ⊂ C; every member of set B is also a member of set C, but there is no member of set C which is not also a member of set B.

      Throughout this dissertation, we shall use N() to denote the cardinality, or number of members, of the set in parentheses. For example:
N(A) = 1
N(B) = 3
N(Ø) = 0.


2.8. Special Sets.



The set is given a special name, because it has only one member: it is called a singleton. Any member of the universe can be made into a singleton simply by enclosing it in curly brackets. Here are some of the singletons which can be created from our universe, U:
{Abraham}
{Isaac}
{Ishmael}
{Jacob}
{Esau}
A singleton is not the same thing as the single member it contains. For example, it is not true that Abraham is the same as {Abraham}. In fact, each of the following singletons are different from one another:
{Abraham}
{{Abraham}}
{{{Abraham}}}
{{{{Abraham}}}}
The first set is the "set of Abraham"; The second set is the "set of the set of Abraham"; etc.

      One of the most important sets in set theory has no members at all. It is called the empty set or the null set, and is denoted by either one of the following notations:
{} or Ø
In set theory, all null sets are equal (because they all have exactly the same members, namely, no members at all). Some examples of the null set are:
{x : x is a married bachelor}
{x : x is a female widower}
{x : x is a man who walked on the moon in 1900 AD}


Figure 2.2. Sets: Abraham and his descendants. Set U is the set of Abraham and all his descendants. Sets A, B, C, D, and E are subsets of U. Dots (...) indicate additional members which are not named explicitly.



2.9. A Precaution.



One of the hazards of the English language is a very imprecise little word: the word "in". This word is probably reponsible for the infinite frustrations of the beginning student of set theory, and may even be responsible for the inability of taxonomists to apply

2.10. Set Operations.





2.11. Mathematical Relations.





3. PRINCIPLES OF GENEALOGY.


     

3.1 First Principles.


3.2 Ancestor-or-Equal-to


3.3 Pedigrees.


3.4 Common Pedigrees.


3.5 The Monoparental Biocosm.


4. OPERATIONAL TAXONOMIC UNITS AND ISOLATION.


     

4.1 Operational Taxonomic Units.


4.2 The Superpartition.


4.3 Isolation.


4.4 Sampling from Isolated Sets.


4.5 Cladistic Partitions.


5. DENDROGRAMS AND CLADOGRAMS.


     

5.1 The Dendrogram.


5.2 Cardinality of a Binary Dendrogram.


5.3 The Dendrogram as a Monoparental Biocosm.


5.4 The Cladogram.


6. SOURCES OF INPUT DATA.
6.1 The Phenetic Matrix. 6.2 Comparison of Character States. 6.3 Direct Techniques for Obtaining a Phenetic Matrix. 6.4 The Immunodiffusion Technique. 6.5 Solution for the Phenetic Matrix. 6.6 Inversion of the Primary Solution Matrix.

7. DIVERGENT EVOLUTION IN MOLECULAR CHARACTERS.
7.1 Trouble with Genetic Load. 7.2 Low Selection at the DNA Level. 7.3 Low Selection at the Protein Level. 7.4 Low Welection and Divergent Evolution. 7.5 The Parsimony Hypothesis. 7.6 Divergence Hypothesis. 7.7 The Superphenetic Matrix. 7.8 Formulas for a Superphenetic Matrix. 7.9 Nesting Levels in a Dendrogram. 7.10 Weighted Average Superphenetic Matrix. 7.11 Choice of Cladogram Junctures. 7.12 Divergence-in-Mean. 7.13 Divergence and Uniform Evolution.

8. THE STRUCTURE OF PHENOGRAM ALGORITHMS.
8. THE STRUCTURE OF PHENOGRAM ALGORITHMS. 8.1 Iteration Structures. 8.2 Refinements of a Partition. 8.3 Agglomerative and Divisive Iteration Structures. 8.4 Phenogram Algorithms Generate Dendrograms. 8.5 Cladogram Functions.

9. EVALUATION OF PHENOGRAM ALGORITHMS.


10. LIST OF REFERENCES.


Arnheim N, Prager EM, Wilson AC.
Immunological prediction of sequence
J Biol Chem. 1969;244:2085-2094.

Gregg JR.
The Language of Taxonomy.
New York: Columbia University Press. 1954.

Gregg JR.
Finite Linnaean structures.
Bull math biophys. 1967;29-191-206.

1. Arnheim N, Prager EM, Wilson AC.
Immunological prediction of sequence differences among proteins. Chemical comparison of chicken, quail, and pheasant lysozymes.
J Biol Chem. 1969 Apr 25;244(8):2085-2094.
PMID: 4889463.
PubMed Entry

2. Barnabas J, Goodman M, Moore GW.
Evolution of hemoglobin in primates and other therian mammals.
Comp Biochem Physiol B. 1971 Jul 15;39(3):455-482.
PMID: 5001181.
PubMed Entry

3. Bastian H.
And Then Came Man.
New York: Viking Press. 1964;:.

4. Bott R, Mayberry JP.
Matrices and Trees.
In: Morgenstern O (ed). Economic Activity Analysis.
New York: John Wiley and Sons, Inc. 1954.

5. Britten RJ, Kohne DE.
Repeated sequences in DNA. Hundreds of thousands of copies of DNA sequences have been incorporated into the genomes of higher organisms.
Science. 1968 Aug 9;161(841):529-540.
PMID: 4874239.
PubMed Entry

6. Brues AM.

7. Buck RC, Hull DL.

8. Burnet FM.
A certain symmetry: histocompatibility antigens compared with immunocyte receptors.
Nature. 1970 Apr 11;226(5241):123-126.
PMID: 5461774.
PubMed Entry

11. APPENDICES.


11. ADDITIONAL READINGS.


ANNOTATIONS. EXPLANATORY NOTES.
1/12/2008, by G. William Moore, MD, PhD


This webpage is an attempted reconstruction of my PhD thesis. I have corrected some obvious typographical errors, and I hope not introduced too many new ones. I employ boldface for emphasis, rather than underline . I have changed slash-through (not available on UNICODE) to ~ (tilde) for negations. Where possible, I have added PubMed hyperlinks to references. The explanatory notes provided herewith include etymologic explanations, additional references, and other notes showing progress in this and related health-science areas over the past three decades.
Hebrew: abrhm: Abraham.
Hebrew: yxahk: Isaac.
Hebrew: uxa: Esau.
Hebrew: Hagar. Ishmael. Sarah. Rebekah. Keturah. Zimran. Jokshan. Medan. Midian. Ishbak. Shuah. Nebaioth. Kedar. Abdeel. Mibsam. Mishma. Dumah. Massa. Hadad. Tema. Jetur. Naphish. Kedema.

Greek: κλαδακι : kladaki: sprig, twig.
Greek: κλαδεμδ : kladema: pruning.
Greek: δενδρο, δεντρο : dendro, dentro: tree.
Greek: φαινομαι : fainomai: to appear like, to be visible.
algorithm: named after Al-Khawáarizmi, mediaeval Arab mathematician.
Greek: γραμμα gramma: letter.
Greek: γραφη grafy: writing. scripture.
Greek: αγια γραφη : Agia Grafy: Holy Scripture.
Greek: δια : dia: by, for.
6.6 Inversion of the Primary Solution Matrix. Using the Leontief Matrix. V. Leontief, Nobel Laureate in Economics, 197x.
7.11 Choice of Cladogram Junctures.
Greek: κλαδακι: kladaki: sprig, twig.
Dr. Mary B. Williams.
Dr. H. R. van der Vaart [d. 2002].
Dr. Henry Schaffer.
Dr. Harvey Charlton.
Dr. Donald Huisingh.
Prof. N. Rashevsky [d. 198x].

Last updated, 3/15/2008, by G. William Moore, MD, PhD