WHAT IS PATHOLOGY INFORMATICS?


Copyright © 2001 G. William Moore, MD, PhD.
http://www.medparse.com/whatpinf.htm

G. William Moore, MD, PhD [1,2,3],

From: Pathology and Laboratory Medicine Service (113), Baltimore VA Maryland Health Care System [1], Baltimore, MD.
Department of Pathology, University of Maryland School of Medicine [2], Baltimore, MD.
Department of Pathology, The Johns Hopkins Medical Institutions [3], Baltimore, MD.



0. TABLE OF CONTENTS.


1. ABSTRACT.
2. INTRODUCTION.
3. AUTOPSY DATA.
4. SURGICAL PATHOLOGY AND CYTOPATHOLOGY DATA.
5. CLINICAL LABORATORY INFORMATION SYSTEMS.
6. TISSUE IMAGES.
7. INTERNET AUTOPSY REGISTERS.
8. STATISTICS, EPIDEMIOLOGY.
9. PATIENT IDENTIFICATION, ENCRYPTION, HIPAA.
10. INTERNET PROGRAMMING: HTML, PERL.
11. COMPUTATIONAL LINGUISTICS.
12. LOGIC, SET THEORY, FUZZY SETS.
13. COMPUTATIONAL COMPLEXITY.
14. GENOMICS, PROTEOMICS.
15. IMAGE ARCHIVING.
16. IMAGE ANALYSIS.
17. TISSUE MICROARRAYS.
18. CONTROLLED VOCABULARIES, SYNTAX.
19. CANONICAL FORMS.
20. MEDICAL ONTOLOGIES.
21. MEDICAL ETHICS.
22. DISCUSSION.
23. REFERENCES.
24. APPENDIX A. GLOSSARY.
25. APPENDIX B. INTRODUCTION TO NUMBER THEORY.
26. APPENDIX C. INTRODUCTION TO SET THEORY.



1. ABSTRACT.





2. INTRODUCTION.



      The PURPOSE OF THIS BOOK is to form a nucleus of exchange between pathologists and informaticians; to inform one another about the respective traditions and doctrines of each others' fields; and to point toward a synthesis of both fields into a genuine, self-sufficient intellectual entity of PATHOLOGY INFORMATICS

      PATHOLOGY is the study of the etiology and pathogenesis of disease. MEDICAL INFORMATICS is the study of the accumulation, archiving, and management of large amounts of human medical information. PATHOLOGY INFORMATICS is an application of the special insights and experiences of pathology to medical informatics.

      Traditionally regarded as a stepchild of medical informatics, it is becoming clear that there are special traditions and doctrines that are more-or-less unique to pathology. Pathologists have an experience and training distinct from their colleagues in other medical fields. Most practicing pathologists are generalists, to a greater extent than most internists or surgeons. Even specialists in pathology at large university centers often rotate through autopsy or general surgical pathology services. Thus pathologists have a sense of pathologic processes as they manifest themselves morphologically throughout the entire body.

      Most practicing pathologists serve as consultants, who do not actually meet the patient directly. Although this limitation is a hindrance in some respects, it has traditionally made the pathologist keenly aware of medical teamwork, and the accompanying need to unambiguously identify the patient, produce legible reports in a timely manner, and maintain patient confidentiality.

      These two features of pathology practice have an impact on pathology informatics.



3. AUTOPSY DATA.



      An AUTOPSY, or POSTMORTEM EXAMINATION, is a medical examination of the human body after death. There are four general purposes of the resulting AUTOPSY REPORT.
(1) Determine the immediate and underlying causes of death, and the pathophysiologic processes leading up to them;

(2) Serve as a teaching resource for medical students, hospital residents, and pathology residents;

(3) Serve as an adminstrative resource for quality control of untoward events and trends among hospital deaths;

(4) Serve as a research resource and tissue resource, particularly for large studies involving comparison with other autopsy reports.
Pathology informatics has several important roles in managing autopsy reports:
(1) Indexing autopsy cases for teaching and research. Measuring the quality of the index by RECALL and PRECISION, tools used in library science for assessing the recovery of publications.

(2) The autopsy is a snapshot of the patient at a single moment in time. It is a more complete examination than any medical survey made during the patient's life. However, there is no temporal component, except from prior records on the patient. This involves a unique reasoning process.

(3) Distribution of autopsy records is not subject to all the constraints placed on records among living patients. However, this scenario is rapidly changing (vide infra).

(4) Each autopsy has approximately 20 tissue blocks, in principle available for special stains and DNA investigations.




4. SURGICAL PATHOLOGY AND CYTOPATHOLOGY DATA.



     



5. CLINICAL LABORATORY INFORMATION SYSTEMS.



      Pathologists in the clinical laboratory work under two significant constraints: they process huge numbers of laboratory tests, often beyond a million per year; and they do not actually meet the patient directly, so that the laboratory specialist is keenly aware of medical teamwork, and the accompanying need to unambiguously identify the patient, produce legible reports in a timely manner, and maintain patient confidentiality.

      Much of clinical laboratory work is automated, and has served as a leader in computerization over the past two decades.



6. TISSUE IMAGES.



      The diagnostic interpretation of tissue images is the exclusive domain of the pathologist. These interpretations are based upon:
(1) what you see on the image;

(2) what you know about the patient's clinical history;

(3) what you know about the natural history of disease.


      It is possible to create electronic images of tissues, and to archive these images and distribute them over the internet for research and educational purposes.



7. INTERNET AUTOPSY REGISTERS.



      There are two public autopsy registers currently available on the internet: The Johns Hopkins Autopsy Resource; and the Goethe University Autopsy Register. There are numerous additional autopsy registers, available exclusively through password protection. These deidentified, public autopsy registers allow researchers to prototype autopsy questions, without having to obtain complex adminstrative permissions to examine patient records directly.

      THE JOHNS HOPKINS AUTOPSY RESOURCE (JHAR). Over the past three decades, there have been proposals in the pathology literature for inter-institutional sharing of pathology data (Carter et al, 1981; Peery, 1978; Wagner, 1996; Mullick, 1997; Moore et al, 1996). The Johns Hopkins Autopsy Resource (JHAR) is an Internet website, founded as an institutional database in 1980, and posted publicly in 1995, that lists over 50,000 autopsy facesheets, on patients born over a span of two centuries. An autopsy facesheet is the summary of final diagnoses, typically appearing as the first page in an autopsy report. The JHAR corresponds to an estimated one million tissue blocks, predominantly formalin-fixed and paraffin-embedded. Over 1300 publications in scholarly journals have resulted from the cases listed in the JHAR, and all citations, many with PubMed hyperlinks, are available on the website. Studies based upon data mining the JHAR include case reports (Hutchins et al, 1996), large autopsy case series (Arcidi et al, 1981; Vigorita et al, 1980; Moore and Hutchins, 1982), and even linguistic studies of medical text (Moore et al, 1988; Moore et al, 1989).

      In order to achieve patient confidentiality in the JHAR, each autopsy facesheet consists of a demographic line, followed by diagnoses. The only public demographic information provided is: age in decades, race, sex, decade of autopsy, and a key-number, which can be used to decrypt the patient identification, if necessary. Confidentiality is protected by the double-brokered encryption system of patient identifiers, which requires the participation of both the JHAR administrator and officials of the Department of Pathology of The Johns Hopkins Medical Institutions to re-identify the individual patients (Berman et al, 1996). As an additional security measure, the key-number may correspond to multiple patients, with the number-of-patients for a given key-number known only to the JHAR administration. Diagnoses in the original autopsy facesheet have been stripped of names of persons, locations, and institutions; and diagnoses have been autocoded into generic medical language as well as into UMLS codes in XML format. The only mechanism for obtaining additional information regarding an individual JHAR autopsy facesheet is to correspond with the database administrator, who forwards the correspondence to the appropriate official at The Johns Hopkins Medical Institutions. The Johns Hopkins Medical Institutions responds in accordance with policies set by its Institutional Review Board.

      THE GOETHE UNIVERSITY AUTOPSY REGISTER (GUAR).........



8. STATISTICS, EPIDEMIOLOGY.



      There are numerous special constraints on studying human pathology materials statistically. They include:
(1) Heterogeneity of data.

(2) Missing values due to technical or ethical constraints.

(3) Non-existence of quantitative, intellectual, or mathematical models for many pathophysiologic processes.

(4) Non-existence of settled statistical doctrines for managing inevitable missingvalues.

(5) Huge volume of heterogeneous data.

(6) Dynamic nature of data.

(7) Incomplete or imprecise data.

(8) Noisy data.

(9) Missing attribute values.
(a) One approach to remedy this problem is to substitute missing values with most likely values;

(b) Another approach is to replace the unknown value with all possible values for that attribute.

(c) Still another approach is intermediate: specify a likely range of values, or likely distribution of values.


(10) Redundant, insignificant data, or inconsistent data.

(11) Summarization.

(12) Clustering or segmentation.

(13) Regression models.

(14) Classification.

(15) Concept description.

(16) Dependency analysis. Determination of medical relationships among fields in a database.

(17) Link analysis, or associations.

(18) Time-Sequence analysis.

(19) Prediction. Predict a specific value for an attribute of the object.

(20) Exploratory data analysis. Graphical models are often used.


      The existing methods of statistics include:
(1) Sample experiment;
(2) Role of statistics in evaluating research results;
(3) Problem of variability;
(4) Population versus samples;
(5) Hypothesis testing;
(6) Decision theory;
(7) Analysis of variance;
(8) Contingency tables;
(9) Correlation;
(10) Multivariate analysis.


      The major concepts of statistics include: CENTRAL TENDENCY, DISPERSION, and HYPOTHESIS TESTING.
CENTRAL TENDENCY. (Arithmetic) mean, Median, Mode, Geometric mean, Harmonic mean.

DISPERSION. Range, Standard deviation, percentiles.

HYPOTHESIS TESTING.




9. PATIENT IDENTIFICATION, ENCRYPTION, HIPAA.



      There are two, distinct problems in keeping patient records confidential, so-called (inadvertent) COMPUTATIONAL DISCLOSURE. One is to encrypt specific patient identifiers, such as name, birthdate, home address, social security number, etc. The other is to conceal enough information in the patient's detailed medical record that the patient's identity may not be revealed BY INFERENCE. There are a flurry of new regulations guiding the use of patient records for exchange between institutions, or for research purposes. These regulations are a response to the rising traffic in electronic medical records, and the corresponding real likelihood of accidental disclosure. Examples abound in the popular media:
(1) Closure of Johns Hopkins research for a week after discovery of irregularities in patient consent forms, following the death of a healthy patient in a research study.

(2) A British nurse whose psychiatric history was disclosed in a journal article, through a detailed publication of her medical history.

(3) A banker who foreclosed on outstanding loans of all his customers whom he knew had cancer, based upon a list he obtained as a hospital trustee.


      Understanding DATA ENCRYPTION has experienced a new urgency among medical professionals, on the basis of several recent developments. First, there has been an explosive growth in the exchange of electronic medical information, often without careful consideration of the consequences to patient privacy.

      Second, the public's response to this fear of disclosure of their private medical information has been the introduction of new legislation attached to the HEALTH INFORMATION PORTABILITY AND ACCOUNTABILITY ACT (HIPAA) also known as the KENNEDY-KASSEBAUM BILL.

      It is clear that clinical-medical laboratory professionals must be aware of these advances, if for no other reason than to approve and contribute to the purchase of software and security systems.

      In CLASSICAL CRYPTOGRAPHY, there is a MESSAGE, a SENDER, and a RECEIVER. It is assumed that any communication between sender and receiver may be easily read by a hostile person, or ATTACKER. The usual objective of cryptography is to encode the sender's message in such a way that the attacker cannot understand it. In more complex cryptographic models, methods are used to authenticate the identity of the sender, to prevent the attacker from altering the message unbeknownst to the receiver, and to prevent the sender from later denying that he/she sent a particular message on a particular date and time. The following book is an absolutely fabulous introduction to cryptography, which can profitably be read by amateurs and professionals alike:
Schneier B.
Applied Cryptography, Second Edition. Protocols, Algorithms, and Source Code in C.
New York: John Wiley & Sons, 1996.
ISBN: 0-471-11709-9, 758 pages.


      ENCRYPTION. The initial message prepared by the sender is written as PLAINTEXT, which the sender converts into CIPHERTEXT before the message is transmitted. The process of converting plaintext into ciphertext is called ENCRYPTION. The encryption process requires an ENCRYPTION ALGORITHM and a KEY. the process of recovering plaintext from ciphertext is called DECRYPTION.

      In classical cryptography, the key is exchanged secretly between sender and receiver over secured communication, or through a trusted intermediary. The accepted view among professional cryptographers it that the encryption algorithm should be published, whereas the key must be kept secret. The purpose of publishing the encryption algorithm is to place it before the academic cryptography community, which will discover its flaws. Better that the flaws in the encryption algorithm be first discovered in academia than when the message is secretly decoded by the attacker.

      SAMPLE ENCRYPTION CALCULATION. Both the initial plaintext and the resulting ciphertext may contain words or numbers or both, but is ultimately convertible into a sequence of numerals, which can be processed by computer and distributed through public communications, including the Internet. For simplicity of discussion, we can speak of an initial plaintext expressed as a sequence of decimal numerals. For example, let the letters of the alphabet be represented as two-digit numbers from A=00 to Z=25 (ignore blank-spaces for now). Then the plaintext for THE QUICK BROWN FOX becomes numeralized as 19070416200802100117142213051423, as follows:
       THEQUICKBROWNFOX
       T  H  E  Q  U  I  C  K  B  R  O  W  N  F  O  X
      19 07 04 16 20 08 02 10 01 17 14 22 13 05 14 23
Analogously, we may form a simple key consisting, say, of the consecutive letters of the alphabet: ABCDEFGHIJKLMNOPQRSTUVWXYZABCD....
  ABCDEFGHIJKLMNOPQRS
  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U....
 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20....
A simple encryption algorithm might consist of adding the plaintext to the encryption-key, using MODULO-26 ARITHMETIC. That is, if the sum of any two numbers obtained by ordinary addition is 26 or greater, then you subtract 26 from the ordinary sum to obtain the modulo-26 sum. Thus, 05+12=17 by both ordinary and modulo-26 arithmetic, but 15+12=27 by ordinary arithmetic but 15+12=01 by modulo-26 arithmetic. Hence, the ciphertext for THEQUICKBROWNFOX is 19080619241308170901240725180212, as follows:
      19 07 04 16 20 08 02 10 01 17 14 22 13 05 14 23
 (+)  00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15   (modulo-26)
_____________________________________________________
      19 08 06 19 24 13 08 17 09 01 24 07 25 18 02 12
The ciphertext may then be decrypted by the receiver, using the decryption-key AZYXWVUTSRQPONMLKJIHGFEDCBAZYX... and modulo-26 arithmetic, as follows:
      19 08 06 19 24 13 08 17 09 01 24 07 25 18 02 12
 (+)  00 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11  (modulo-26)
_____________________________________________________
      19 07 04 16 20 08 02 10 01 17 14 22 13 05 14 23


      The ONE-TIME PAD is a nearly-perfect METHOD OF ENCRYPTION, invented in 1917 by Major Joseph Mauborgne and Gilbert Vernam. The method is also known as the SECRET LOOKUP TABLE among clinical laboratory specialists, and is the only method of encryption sanctioned by the U. S. HEALTH INFORMATION PORTABILITY AND ACCOUNTABILITY ACT (HIPAA) (2001, as amended). In this method, the sender and receiver agree upon a common secret text, which forms the key. Each key letter is used exactly once, and then discarded forever. Ideally, the key should be completely random. For example, if the receiver and sender employ the text for George Orwell's 1984 as their one-time pad, then the encryption-key becomes:
  ITWASABRIGHTCOLDDAYINAPRILANDTHECLOCKSWERESTRIKINGTHIRTEENWINSTON....
Analogously, the decryption-key becomes:
  SHEAIAZJSUTHYMPXXACSNALJSPANXHTWYPMYQIEWJW
Encryption of the plaintext, THEQUICKBROWNFOX, yields the ciphertext, BAAQMIDCJXFPPTZA, as follows:
       THEQUICKBROWNFOX
 (+)   ITWASABRIGHTCOLD
_______________________
       BAAQMIDCJXFPPTZA
that is:
      19 07 04 16 20 08 02 10 01 17 14 22 13 05 14 23  (plaintext)
 (+)  08 19 22 00 18 00 01 17 08 06 07 19 02 14 11 03  (encryption-key)
_____________________________________________________
      01 00 00 16 12 08 03 01 09 23 21 15 15 19 25 00  (ciphertext)
       B  A  A  Q  M  I  D  B  J  X  V  P  P  T  Z  A
Decryption of the ciphertext, BAAQMIDBJXVPPTZA, returns the plaintext, THEQUICKBROWNFOX, as follows:
       BAAQMIDBJXVPPTZA
 (+)   SHEAIAZJSUTHYMPX
_______________________
       THEQUICKBROWNFOX
that is:
      01 00 00 16 12 08 03 01 09 23 21 15 15 19 25 00  (ciphertext)
 (+)  18 07 04 00 08 00 25 09 18 20 19 07 24 12 15 23  (decryption-key)
_____________________________________________________
      19 07 04 16 20 08 02 10 01 17 14 22 13 05 14 23  (plaintext)
The next message between sender and receiver begins with the key,
 DAYINAPRILANDTHECLOCKSWERESTRIKINGTHIRTEENWINSTONSMITHHISCHINNUZZLED....
The one-time-pad has the advantage of simplicity and relative security, but it requires the maintenance of a large key, which sender and receiver must maintain in synchrony.

      The U. S. DATA ENCRYPTION STANDARD (USDES), equivalent to the DATA ENCRYPTION ALGORITHM (DEA) of the AMERICAN NATIONAL STANDARDS INSTITUTE (ANSI), and DEA-1 of the INTERNATIONAL STANDARDS ORGANIZATION (ISO), has been a worldwide standard for data encryption for over two decades. In the May 15, 1973, U. S. Federal Register, the U. S. NATIONAL INSTITUTE FOR STANDARDS AND TECHNOLOGY (NIST, formerly U. S. NATIONAL BUREAU OF STANDARDS), issued a public request for a data encryption algorithm, which ultimately evolved into USDES.

      In USDES, the plaintext is converted into a sequence of bits (0s and 1s), in blocks of 64 bits apiece, padded with trailing 0s in case the message is not an even multiple of 64. For example, in the Latin-1 character set for the AMERICAN STANDARD CODE FOR INFORMATION INTERCHANGE (ASCII), which is the International Standards Organization (ISO) standard ....., an 8-bit-sequence, in which 00100001 is A, 00100010 is B, 00100011 is C, 00010000 is blank-space, 00010001 is !, etc.

      In USDES, a 64-bit block, corresponding to plaintext, is entered into the algorithm, and a corresponding 64-bit block, corresponding to ciphertext, is returned by the algorithm. The same algorithm and key are used for both encryption and decryption, with minor changes. The key length is 56 bits. The key is expressed as a 64-bit number, but every eighth bit is an internal arithmetic check, and not used by the encryption algorithm. The algorithm is completely public. All security resides within the key.

      Until year 2000, security keys beyond 56 bits were illegal in the USA. An exporter of software for large security keys risked prosecution under the same laws as exporters of U. S. nuclear weapons. Philip K. Zimmermann, the author of PRETTY GOOD PRIVACY, fought a valiant legal battle to publicize the ridiculousness of this law, whereby U. S. commercial secrets could be easily broken into by our sophisticated foreign competitors, but not vice versa. Fortunately, this legal loophole has been corrected.

      After initial permutation of the plaintext, the block is broken into a left and right half, 32 bits apiece. Then there are 16 rounds of identical operations, in which data are combined within the key. The resulting right and left halves are joined, and a final permutation concludes the calculation. The patent belongs to IBM. Nowadays, various implementations of USDES are in widespread use, including as a built-in function in PERL. (However, I tried to use this on my free, public-domain version of PERL, and I got back a paranoid message. Apparently, the author of this version did not wish to risk imprisonment in U. S. Federal prisons.)

      In USDES, the plaintext is converted into a sequence of bits (0s and 1s), in blocks of 64 bits apiece, padded with trailing 0s in case the message is not an even multiple of 64. A 64-bit block of plaintext enters USDES, and a 64-bit block of ciphertext is returned by the algorithm. The same algorithm and key are used for both encryption and decryption, with minor changes. The key length is 56 bits. (The key is expressed as a 64-bit number, but every eighth bit is an internal arithmetic (parity) check, and not actually used by the encryption algorithm.) The algorithm is completely public. All security resides within the key.
      In USDES, the two fundamental component operations are PERMUTATION and EXCLUSIVE-OR. In a permutation step, the order of the bits in the 64-bit sequence is rearranged, an operation which is easily reversed. In an exclusive-or step, denoted ^ in PERL, each bit is subject to the operation x ^ y = z, where z=1 if x=1 or y=1 but not both; whereas z=0 if both (x=0 and y=0) or both (x=1 and y=1).
For example:
      1011001110000101010110110011100001010101101100111000010101010101
  ^   1111100001010001010111111000010100010101111110000101000101010101
______________________________________________________________________
      0100101111010100000001001011110101000000010010111101010000000000
The exclusive-or operation is likewise easily reversed. That is, if t is the plaintext, k is the key, c is the ciphertext, and t ^ k = c, then c ^ k = t. Both permutation and exclusive-or operations have the desirable mathematical properties that:
(1) the ciphertext is exactly the same size as the plaintext; and
(2) each of these operations is easily reversed.
The USDES consists of a complex sequence of permutations and exclusive-ors.

      BLOWFISH ENCRYPTION was invented by Bruce Schneier in 1993, who published the source code, and released the algorithm into the public domain. There are no restrictions on the use or distribution of Blowfish. After a four years of intensive testing in the public arena, Blowfish's only known fault is that there are some weak keys. Blowfish is present in dozens of commercial products including Symantec's Norton YOUR EYES ONLY and McAfee's PCCRYPTO. Symantec and McAfee are the two leading vendors of computer-virus protection software.

      DESIGN FEATURES OF BLOWFISH ENCRYPTION. The algorithm is USDES-like, but much faster and easier to understand and to implement on small computers. The algorithm is so simple that it could be explained to an undergraduate mathematics major. The algorithm can be written in PERL and/or JavaScript, both of which are relatively simple programming languages. The algorithm does not use large-integer arithmetic. Blowfish was constructed to meet the following design criteria.
1. Fast.
2. Compact.
3. Simple.
4. Variably Secure.
Blowfish is optimized for applications in which the key does not change often. It is significantly faster than USDES when implemented on a 32-bit microprocessor with large data caches, such as the Pentium.

      Blowfish is a 64-bit-block algorithm with a variable-length key. The algorithm consists to two parts:
1. KEY EXPANSION converts a key of up to 448 bytes into several subkey 32-bit arrays: the permutation-array, denoted Pi (i=1,...18); and the string-array, denoted Si,j (i=1,...4) and (j=0,...255).
2. DATA ENCRYPTION consists of a FEISTEL FUNCTION, F(), iterated for 16 rounds.
To encrypt, divide the initial plaintext into a left and right half, 32 bits apiece, denoted L0 and R0. Then for each i, i=1,...16:
let Li be set equal to Li-1 ^ Pi; let Ri be set equal to F(Li) ^ Pi; then swap Li and Ri.
Finally, swap Li and Ri (i.e., undo the 16th swap). Then:
Let Rfinal be set equal to R16^P17; Let Lfinal be set equal to L16^P18;
Finally, rejoin the right and left halves, denoted Lfinal and Rfinal.
      Decryption is the same as encryption, except that P1,... P18 are used in reverse order.
      Subkeys are calculated as follows:

1. Initialize P'1,... P'18, and then S'1,0,... S'4,255, in order, using the hexadecimal expansion of pi.
2. Obtain the secret-permutation array, P1,... P18, by letting Pi be set equal to P'i ^ Ki.
3.
4.


      ASYMMETRIC ENCRYPTION. In 1976, Diffie, Hellman, and Merkle introduced the paradigm of public-key or asymmetric encryption. In classical, symmetric encryption, the sender and receiver use essentially the same key, or paired keys which are obvious (symmetric) transformations of one another. In asymmetric encryption, the receiver creates a public-key and a private-key. The public-key may be distributed without restriction. The private-key is known only to the receiver, so that there are no problems incumbent upon giving a key to a possibly untrustworthy sender. The basic mathematical principle underlying asymmetric encryption is that it is an enormous computational task to factor a number which is the product of two large prime numbers.

      The most popular asymmetric encryption algorithm is the RSA public-key encryption algorithm (named after its authors: Ron Rivest, Adi Shamir, and Leonard Adleman). Prof. Rivest's security and cryptography webpage is available at URL:
http://theory.lcs.mit.edu/~rivest/crypto-security.html
Every significant public-key encryption algorithm has been patented. The U.S. patent for the RSA public-key encryption algorithm expired on September 20, 2000. However, you should not take this fact as a go-ahead to freely use the above methodology. Patent-holders are notoriously jealous of a successful invention, and it is likely that variants of the original idea have been repatented in another form that will get you in trouble if you tread upon it. The pharmaceutical industry is notorious for this sort of thing. You should first consult a patent attorney. Better yet, purchase a software license from a reputable vendor, and let the experts argue about who owns what.

      In asymmetric encryption, the public-key is the product of two, very large prime numbers, whereas the private-key is the factors themselves. The underlying principle of asymmetric encryption is the mathematical conjecture (i.e., unproved mathematical assertion) that it is an enormous computational task to factor a number which is the product of two large prime numbers. This conjecture has never been proved mathematically, but likewise in the past two decades of serious investigation by the world's best cryptanalysts, nobody has successfully challenged this mathematical assertion.

      SAMPLE CALCULATION FOR RSA ENCRYPTION. The RSA method is breathtakingly elegant in its simplicity. Beyond the need for obtaining large prime numbers and performing large-integer arithmetic, the concept is so simple that it can be explained to a bright college undergraduate in mathematics. The big problem with public use of RSA is that it is patented. The essential steps in asymmetric encryption by the RSA Method are:

PUBLIC KEY:
n = product of two prime numbers, p and q.
e is relatively prime to (p-1)*(q-1).

PRIVATE KEY:
d = (e-1) mod((p-1)(q-1)).

ENCRYPTION:
c = (te) mod n.

DECRYPTION:
t = (cd) mod n.
where n is the (public) product, e is the public (=encryption) key, d is the private (=decryption) key, t is the plaintext, and c is the ciphertext.

      The term x modulo n, or x mod n, denotes the (whole number) remainder of the division of x by n. Modulo arithmetic, or so-called 'clock arithmetic', is the mathematical method by which we determine, say, that five hours after ten o'clock, it is three o'clock. That is, the ordinary clock is a modulo-12 device, and [(5+10) mod 12] equals 3. Similarly, the second-hand and minute-hand on the clock are modulo-60 devides, and the military clock is a modulo-24 device. Modulo arithmetic has the fantastic advantage that integer arithemetic can be performed on huge integers with absolute accuracy, without having intermediate calculations exceed a predetermined size, namely, the square of the modulus. Modulo arithmetic is one of the pillars of modern cryptography.

      After determining prime numbers p and q, then calculating n, e, and d, one discards p,q. The receiver distributes numbers (n,e) publicly, whereas d is kept secret and known only to the receiver. The receiver needs numbers (n,d) to decrypt his messages.

      The paradigm of asymmetric encryption may be illustrated by a simple example that can be verified on a hand calculator. (Actually, the hand calculator is a bit tedious; it is probably faster to write a program in QBasic, Visual Basic, or PERL, if you know these languages.) In the example, let p=31 and q=37. These are not large prime numbers, but they serve as a didactic example. Then n= 31*37 = 1147.

      The next task is to select e, which must be relatively prime (i.e., not share a common factor larger than one) with ((p-1)*(q-1)) = 30*36 = 1080. For this simple example, one may simply try out all the possible values of e less than sqrt(1081) (the so-called SIEVE OF ERATOSTHENES). That is, one requires a value of e such that there is a whole-number d such that d * e = 1081. For really big primes, there are more efficient ways to obtain d,e. In the present example, e=23, d=47, and
d * e = 23 * 47 = 1081 = (1) mod 1080 = (1) mod ((p-1)*(q-1)).
That is, '1 mod 1080' denotes that the remainder of 1081 divided by 1080 is 1. Since d * e = 1 can be written equivalently as d = (e-1), we can assert that:
d = (23-1) mod(30*36) = 47.


      Let the plaintext message be the number t=13. Then we may encrypt the plaintext message, t, according to the formula in Table 1 as:
c = (t3) mod n = (1323) mod 1147 = 520.
That is:
t1 = 13, and (t1) mod 1147 = 13.
t2 = 169, and (t2) mod 1147 = 169.
t3 = 2197, so that (t3) mod 1147 = 1050.
The calculation may be continued on a small calculator by noting that:
[t4 mod 1147] equals [t * [(t3) mod 1147] mod 1147]
[t5 mod 1147] equals [t * [(t4) mod 1147] mod 1147]
....
c = [1323 mod 1147] equals [t * [(t22) mod 1147] mod 1147]
c = (1323) mod 1147 = 520.
In this manner, the whole-numbers in the intermediate calculations never exceed 11472 in size.

      When the receiver obtains the ciphertext, c=520, it may be decrypted by the formula (Table 1):
t = (cd) mod n = (52047) mod 1147 = 13.
Note that not even the sender can decrypt the initial message, t, after it has been encrypted into ciphertext, c.

      LEGAL STATUS OF ASYMMETRIC CRYPTOGRAPHY. At this time, every significant public-key encryption algorithm is patented, and several legal challenges to these patents to date have all been decided in favor of the patent holders. The U.S. patent for the RSA public-key encryption algorithm expired on September 20, 2000. However, you should not take this fact as a go-ahead to freely use the above methodology. Patent-holders are notoriously jealous of a successful invention, and it is likely that variants of the original idea have been repatented in another form that will get you in trouble if you tread upon it. The pharmaceutical industry is notorious for this sort of thing. You should first consult a patent attorney. Better yet, purchase a software license from a reputable vendor, and let the experts argue about who owns what.



10. INTERNET PROGRAMMING: HTML, PERL.



     



11. COMPUTATIONAL LINGUISTICS.



      Pathology reports in particular, and many parts of the medical record in general, have their most important information made and kept in the form of text documents. Although many pathology departments have maintained such electronic reports for decades, the decreased costs and ease of recovery of such records has stimulated the clinical specialties to maintain electronic records as well.

      The computerization of text medical records has only exacerbated our awareness of their flaws, as the records become available to more and more persons-observers on the medical team. The records are often poorly organized, with poor spelling and unintelligible abbreviations.

      Over the past forty years, the field of COMPUTATIONAL LINGUISTICS has addressed itself to understanding text documents in foreign languages. There are some quantitative studies with apparent application to pathology reports.

      COMPUTATIONAL LINGUISTICS. was invented as a modern intellectual discipline with the appearance of Noam Chomsky's thesis on the linguistics of Hebrew [Chomsky, 1949]. The central paradigm of computational linguistics is that all human languages may be represented as an interconnected collection of production rules [Bundy, 1997], in so-called Backus Naur Form (BNF), originally used by computer scientists to describe computer languages, such as ALGOL [Naur, 1960]. The development of this field over the past half-century has been dominated by Chomsky's intellectual leadership and academic trends at large universities with computational linguistics departments. The mainstream computational linguistics literature devotes an inordinate amount of attention to interpretation of peculiar natural-language sentences, that would appear strange even to a native speaker and incomprehensible to a non-native speaker (Newmeyer, 1996). By contrast, scant attention has been paid to the quantitative and statistical behavior of technically precise sentences, say, in surgical pathology free-text, which employ a restricted vocabulary, and are intended to be unambiguous in written form, even when read by an adversary (e.g., the proverbial plaintiff's pathologist in a malpractice action).

      The negative impact of these general trends in computational linguistics has been summarized by Nagao [1992], who has devoted his career to machine translation of narratives in electrical engineering from Japanese-to-English:
"Linguistic theories ... do not cover varieties of exceptional expressions which practical machine translation systems have to handle. A machine translation system, which is still imperfect and will never be completed, is exposed to very crude tests when the system construction reaches a certain stage. At that stage of development, the system is given a comparatively simple sentence for translation, with structures that can be analyzed by a grammar given to the system. After completion, people other than those who developed the system are asked to translate a variety of texts such as newspaper articles, science magazines, patent documents, contract documents, and commercial letters. Because the documents have not been adequately tested at the development stage, users are disappointed by the poor translation results produced by the system. Many of the failures of the system come from the fact that the dictionary and the grammar are not sufficient to accept such unexpected input sentences."
In our view, it would have been more fruitful for emerging technical fields such as pathology informatics if the computational linguists had devoted more attention to quantitative studies, and had attempted to delineate the quantitative behavior of domain-specific translations. It should not suffice to show that an unusual counterexample spoils a translation system; but rather that a frequent class of counterexamples spoils a translation system.

      DESCRIPTION OF THE ENCODER. The encoder is designed to embed standardized coding language, such as UMLS, into coherent free-text sections of surgical pathology reports extracted from production information systems. The extracted files will be presented to the encoder with XML tags in place denoting free text areas for processing. The encoder will add additional XML tags containing standardized identifiers for the concepts contained in the text. An XML-tagged file contains data in specific, tagged locations, so that the data can readily be recovered for indexing and for quantitative studies. For example, an XML-tagged record containing "adenocarcinoma would tie the UMLS concept unique identifier for adenocarcinoma (C0001418) to the word "adenocarcinoma" in the record. Such XML-tagged records, when compared to a gold standard, can readily be used to calculate false negative/false positive rates required for sensitivity/specificity analysis. This approach also has the advantage of maintaining the integrity of the original pathology report text: all metadata added to the report to create the tagged record is internal to XML tags. The original report text can be regenerated by stripping the tags from the record.

      The examples given here employ UMLS as the target language, but the principles should be the same for any coding language that is rich enough in pathology concepts and grammatical relationships to contain the essential ideas of surgical pathology. The XML tagging format shown above is compatible with multiple coding schemes because it allows specification of a coding scheme by name (scheme) and the value of the code for the appropriate concept (value). Although the encoder has been tested initially on the Johns Hopkins surgical pathology free-text corpus (Appendix 6), it is expected that the general principles will be applicable across institutions, to other medical free-text, and to query free-text as well. The query engine is then expected to match coded queries with coded pathology diagnoses.

      The encoder employs methods of machine translation (MT) and quantitative natural language processing (QNLP) (Nagao, 1992; Manning and Schuetze, 2000; Moore and Berman, 2000). In QNLP, it is acceptable to mistranslate a few sentences, but not too many sentences. The encoder has detailed bookkeeping mechanisms, which are required by the underlying QNLP methods, and which will eventually assist in quantitative performance studies. First, the encoder attempts to obtain a grammatical parse for each sentence in the surgical pathology text-corpus, or in a given free-text query. If the sentence passes the grammatical parser, then the encoder matches each word or term in the sentence to its corresponding UMLS code. A preliminary list of words, multiple-word terms (so-called collocations), their corresponding parts-of-speech and UMLS codes has been prepared for the high-frequency words and collocations in Johns Hopkins text-corpus (Appendix 6).

      In QNLP, a 'word' is operationally defined as a string of letters and numerals, bounded on either side by blanks or punctuation (Kucera and Francis, 1967). A list of words in descending order of word-frequency is called a "Zipf distribution." The most-frequent word has rank one; the second-most-frequent word has rank two, etc. According to Zipf's Law, rank, r, is inversely proportional to frequency, f (Zipf, 1949). Although Zipf's Law is inexact in detail (Mandelbrot, 1954; Moore et al, 1988), it encapsulates the general truth that a few hundred high-frequency words account for over half the word-occurrences in a large text-corpus. These high-frequency words are typically articles, conjunctions, complementizers, interrogatives, prepositions, pronouns, auxiliary verbs, etc., and serve as boundaries, or "barriers," on either side of medically-significant words, or "keywords" (Tersmette et al, 1988; Moore et al, 1989; Nelson et al, 1995; Wilbur, 2000). Punctuation marks are operationally defined as barrier words. For example, the collocation 'squamous cell carcinoma' is likely to occur in a free-text corpus, bounded on either side by barrier words. Thus, a list of barrier words can be used to discover nearly all the collocations in a surgical pathology or query-log free-text corpus.

      For the encoder, each word and each collocation has been assigned to a part-of-speech (possibly ambiguous), and to a UMLS code. In many instances, UMLS code-assignments have an exact character-for-character match in the official UMLS database (U. S. National Library of Medicine, 2000) or correspond to obvious synonyms. High-frequency terms are assigned manually by an experienced coder; mid-frequency and low-frequency terms are assigned either from previously published lists, or algorithmically. Each sentence-to-be-translated is pointed to its corresponding part-of-speech sequence. Then a series of formulas (production rules, expressed in so-called Backus-Naur-Form (BNF)) attempt to reduce stepwise the initial part-of-speech sequence down to a null-sentence. This is the same process by which a computer compiler is designed to interpret a syntactically correct computer program. (The essential difference between computer grammars and human grammars is that human grammars 'leak'.) Failure to reduce the part-of-speech sequence to a null-sentence suggests that the sentence is ungrammatical. Success supplies a formula that points the source sentence to positions for UMLS codes in an XML-tagged file.

      For example:
   [adenocarcinoma   of   colon   metastatic   to   lung]
   [       N                    P       N          A            P     N   ]
yields a sentence-pattern of [NPNAPN] , where A=adjective, N=noun, P=preposition. The encoder determines, either stepwise or in a single step, that [NPNAPN] is a grammatical sentence-pattern. For each grammatical sentence, the encoder assigns UMLS codes, for example:
  [ ADENOCARCINOMA     OF       COLON   METASTATIC     TO          LUNG   ]
  [    C0001418     C0332285  C0009368   C0027627    C0332286    C0024109 ]


      Finally, the encoder creates an XML tag sequence containing the codes:
  <code section scheme="UMLS">
    <c type="morph" value="C0001418>adenocarcinoma
      >c type="topo" value="C0009368">colon
        <c type="morph" value="C0027627">metastatic
          <c type="topo" value="C0024109">lung
          </c>
        </c>
      </c>
    </c>
  </code-section>
where <coding tag> tag denotes a section of coding data that is added by the encoder after each parsed text section of the document. The scheme attribute specifies the particular coding scheme to be used in the section, allowing the XML tagging format to be used with multiple coding schemes. A single document could also contain tags representing individual concepts; their type (morphology or topology, here) and value attributes represent the category of the code (or coding axis in SNOMED) and the code value. Concept tags also contain the text word or phrase from which the concept was derived and optionally contain other concept tags. The pattern of containment expresses hierarchical relationships. The coding sections do not alter the free text of the parsed sections and therefore the text and structure of the original document is always available if needed.



12. LOGIC, SET THEORY, FUZZY SETS.



      Once pathology reports are reduced to standardized form with standarized syntax, one is left with the problem of determining the INFERENCES that one may draw from a particular statement. For example, an irregular dark macule in an African American child has a different implication from the same macule on a fair-skinned elderly man with a lifetime of sun exposure.

      The particular problem of applying classical logic to pathology information is the necessity to REASON UNDER UNCERTAINTY. This reasoning process is the realm of FUZZY LOGIC.



13. COMPUTATIONAL COMPLEXITY.



      The practical, and sometimes unspoken, problem of computerization of large bodies of pathology information is that of COMPUTATIONAL COMPLEXITY. Despite all the advances in computer speed over the past several decades, there are still some computational problems beyond the reach of today's or tomorrow's computers. There are, in fact, some problems that can never be solved! (So-called Gödel problems, mostly of interest to theoretical mathematicians.)

      Many computational problems in pathology informatics can be referenced to a LIST OF n OBJECTS. When the list is short (i.e., n is small), the method of computation doesn't matter very much. For example, anyone can sort a list of 100 social security numbers in a few minutes, and it is easy to write a computer program that does the same thing. However, to sort the 300,000,000 social security numbers in the USA, different computational methods can have different computational times in many orders of magnitude, the difference between a few minutes and days, months, or years. Failure to grasp this problem has led to the failure of many laboratory information systems that worked well in demonstration form on small data sets, and the subsequent dismissal of pathologists, laboratory managers, computing center directors, etc.

      Computer scientists classify computational complexity algorithms LOG, LINEAR, LOG-LINEAR, QUADRATIC, POLYNOMIAL, EXPONENTIAL, SUPER-EXPONENTIAL, INSOLUBLE. That is:
(1) LOG, soluble in LOGn steps. For example, if you wish to FETCH a single record from a SORTED LIST of n records, then this can be accomplished in LOGn steps. The easiest way to understand this concept is to consider LOG2n steps, i.e., logarithm-to-the-base-2. Recall that: LOG22 = 1; LOG24 = 2; LOG28 = 3; LOG216 = 4; LOG232 = 5.... This is the same as saying that: 21 = 2; 22 = 4; 23 = 8; 24 = 16; 25 = 32.... Now suppose that you have a sorted list of 16 names, and you are looking for HARRY. In the first step, split the list in half. Harry belongs to the first half-list.
 Abner
 Bill
 Charley
 David
 Edward
 Frank
 George
 Harry
_________________
 Ike
 John
 Karl
 Larry
 Milton
 Norman
 Oliver
 Patrick
In the second step, split the remaining list in half. Harry belongs to the second half-list.
 Abner
 Bill
 Charley
 David
_________________
 Edward
 Frank
 George
 Harry
In the third step, split the remaining list in half. Harry belongs to the second half-list.
 Edward
 Frank
_________________
 George
 Harry
In the fourth step, split the remaining list in half. Harry belongs to the second half-list.
 George
_________________
 Harry
That is, LOG216 = 4 steps, compared to 16 steps for an unsorted list (see next item, (2)).

      If you think that it's silly to go to all this mathematical sophistication to fetch from sorted lists, consider this: the U. S. Social Security Administration has a list of some 300,000,000 social security numbers, that it must, in principle, be able to fetch. Since 228=268,435,456 and 229=536,870,912. we are talking about the difference between 29 steps and 300,000,000 steps. Suppose that the computer software can perform 1000 steps per second. Then we are talking about the difference between 0.03 seconds and 300,000 seconds = 5,000 minutes = 83 hours = 3.5 days.

      Likewise, a moderate-sized clinical pathology laboratory performs about 1,000,000 laboratory tests annually, not to mention a total patient census of, perhaps, 300,000, plus inpatient-admission-dates, encounter-dates, outpatient-visit-dates,..... Since 210=1,048,576, we are talking about the difference between 10 steps and 1,000,000 steps, 0.01 seconds and 1,000 seconds= 16 minutes. Perhaps not much, but these resources add up quickly, and there is no excuse for wasting resources when good solutions are already known.

(2) LINEAR,
soluble in n steps. For example, if you wish to FETCH a single record from an UNSORTED LIST of n records, then this can be accomplished in n steps. In the worst case, the algorithm must examine every single record before it hits the desired record.

(3) LOG-LINEAR,
soluble in nLOGn steps. For example, if you wish to SORT a LIST of n records, all the quality algorithms (QUICKSORT, HEAPSORT, etc.) employ a variant of the list-division method described above.

(3) QUADRATIC,
soluble in n2 steps.

(4) POLYNOMIAL,
soluble in nk steps, where k is a fixed integer, hopefully less than 4 or 5.

(5) NP-COMPLETE,
mathematically uncertain, but at best (4) and at worst (6). There are many interesting problems in pathology informatics, involving decision trees and routing diagrams, that fall into category (5).

(6) EXPONENTIAL,
soluble in 2n steps.

(7) SUPER-EXPONENTIAL,
soluble in greater than 2n steps. Includes Pressburger algebra.

(8) INSOLUBLE,
not soluble in an infinite number of steps. Includes such Gödel-unprovable statements as: the consistency of ordinary arithmetic; the Generalized Continuum Hypothesis; and the Axiom of Choice.
Unfortunately, there are many interesting problems in pathology informatics, involving decision trees and routing diagrams, that fall into category (5). Although the status of (5) is currently uncertain, despite the depradations of numerous brilliant and determined mathematicians over the past several decades, even if it is solved and turns out to be (6), then this solution offers only despair to hopeful pathology informaticians.

      If, as I believe, the status of (5) will remain either uncertain or solved in the pejorative direction in the foreseeable future, the mandate to the pathology informatics community is the same: large computational problems must be reduced in magnitude, based in part upon reasonable assumptions from pathology informatics.



14. GENOMICS, PROTEOMICS.



     



15. IMAGE ARCHIVING.



     



16. IMAGE ANALYSIS.



     



17. TISSUE MICROARRAYS.



      A TISSUE MICROARRAY is a checkerboard, or matrix, of minute cylindrical tissue samples (about 1-2 mm), from hundreds of different tissue samples, assembled on a single paraffin block. Five micron tissue sections from these tissue microarray blocks can be applied in the analysis of multiple DNA-gene and RNA-gene expression

      APPLICATIONS OF TISSUE MICROARRAYS. Bubendorf et al (1999) have constructed a tissue microarray containing tissue samples from different stages of human prostate cancer progression, in order to survey genetic alterations that may contribute to hormone-refractory tumors and to metastatic disease.

      CONSTRUCTION OF A TISSUE MICROARRAY. A tissue microarray instrument is used to create holes in a recipient paraffin block, and to acquire tissue cores from the donor block, by a thin-walled needle, with a specified inner diameter (0.5 to 2 mm), using an XY precision guide. After construction of the array block, multiple 5 micrometer sections are cut with a microtome. H&E-stained sections are used for histologic verification of the desired tissue on the arrayed samples.

      ADVANTAGES OF A TISSUE MICROARRAY. A tumor tissue microarray allows the investigation of molecular profiling and patterns of amplification of multiple genes, in samples representing the entire spectrum of disease progression, including distant cancer metastases. The advantage of a tissue microarray is that it facilitates standardized analysis of multiple genes in the same tumor. Tissue microarray methods produce replicate staining conditions, using the same technology, the same kinds of probes, and similar interpretation criteria. For example, in only five staining experiments, Bubendorf et al (1999) were able to screen tissue material by FLUORESCENCE IN-SITU HYBRIDIZATION (FISH), obtained from over 350 specimens, resulting in over 1400 evaluable FISH results. The ability to achieve reliable detection of gene amplifications from formalin-fixed tissues substantially extends the range of possible applications for tissue microarray technology.

      There are numerous potential tissue archives available in hospitals and academic medical institutions worldwide. In particular, The Johns Hopkins Autopsy Resource (Moore et al, 1996) is a listing of over 50,000 human autopsy summaries, corresponding to an estimated one million tissue blocks. The contents of The Johns Hopkins Autopsy Resource may be examined at URL:
http://www.netautopsy.org

      PITFALLS OF A TISSUE MICROARRAY. A possible limitation of tissue microarray technology is that the minute tissue samples acquired from the original tissues may not always be representative of the entire tumor, considering that most cancers are heterogeneous. This limitation might potentially introduce a sampling bias, although the initial studies of Bubendorf et al (1999) suggest that this bias is small.

      Furthermore, punching tissue from multiple sites from each original tumor can significantly reduce the sampling problem. However, at the present time, one should consider that tissue microarray technology as a rapid, high-throughput survey method to highlight the biologically most prevalent or clinically most promising genes and molecular markers, followed by more detailed studies

      A VIRTUAL TISSUE MICROARRAY is a computerized checkerboard consisting of images of Hematoxylin and Eosin (H&E) stained tissues, arrayed on a checkerboard and displayed on an Internet Browser. Each displayed H&E-image corresponds to an actual tissue sample in a decentralized tissue archive. Furthermore, each H&E-image has a corresponding medical history, that includes clinicopathologic information, such as tissue source, disease process (non-neoplastic, hyperplasia, cancer, etc.), chemotherapy (yes/no), radiotherapy (yes/no), patient survival time, survival with disease, survival disease-free, etc.

      PATHOLOGY INFORMATICS ASPECTS OF TISSUE MICROARRAYS. Points in a TMA can be processed statistically like individual slides. The biggest problem is missingvalues, due to dropoff of individual tissue-pieces, which are not handled very well by existing statistical formulae. The other salient problem is the NP-complete problem of computational algorithms used to draw inferences from TMA data, due to the large amounts of data generated by this methodology.



18. CONTROLLED VOCABULARIES, SYNTAX.



      CONTROLLED VOCABULARIES. Efforts to standardize pathology data have proceeded on three fronts: standardized coding; standard formatting of pathology reports; and standards for publishing collections of reports (Moore and Berman, 2000). The three dominant coding systems in anatomic pathology are: SNOMED, Read, and UMLS. The Systematized Nomenclature of Human and Veterinary Medicine (SNOMED International, or SNOMED III) is a copyrighted product of the College of American Pathologists and one of the recognized standard nomenclatures in medicine. SNOMED is used in most pathology laboratories that employ standardized coding systems. The Read Classification is employed by the National Health Service in Great Britain, and is owned by the the British crown. SNOMED, Read, MeSH, and over fifty other classifications are subsumed within the Unified Medical Language System (UMLS) Metathesaurus of the U. S. National Library of Medicine.

      The UMLS is by far the largest of all medical concept systems. UMLS is the best tool for research studies in controlled medical vocabularies. UMLS serves as an indexing tool for PubMed, a collection of over 11 million medical citations available on the Internet to the general public. It has been estimated that only 1-3% of medical concepts in general medical text are not present in UMLS. UMLS provides a uniform, integrated distribution format from over fifty biomedical vocabularies and classifications. The 1999 UMLS Metathesaurus contains 625,530 biomedical concepts and 1,362,823 different concept-names. Each Concept Unique Identifier (CUI) consists of C followed by seven decimal digits, with an accompanying synonym-name. Different synonym-names have the same CUI. UMLS is updated annually and made available to registered users, with a complete listing of CUIs and synonym-names. Since the UMLS is available cost-free to researchers worldwide for research projects, it makes the most sense to develop pathology informatics applications in UMLS.

      EXTENDED MARKUP LANGUAGE, XML. For example:
   [adenocarcinoma   of   colon   metastatic   to   lung]
   [       N                    P       N          A            P     N   ]
yields a sentence-pattern of [NPNAPN] , where A=adjective, N=noun, P=preposition. The VHP-encoder determines, either stepwise or in a single step, that [NPNAPN] is a grammatical sentence-pattern. For each grammatical sentence, the encoder assigns UMLS codes, for example:
  [ ADENOCARCINOMA     OF       COLON   METASTATIC     TO          LUNG   ]
  [    C0001418     C0332285  C0009368   C0027627    C0332286    C0024109 ]


      Finally, the VHP-encoder creates an XML tag sequence containing the codes:
  <code section scheme="UMLS">
    <c type="morph" value="C0001418>adenocarcinoma
      >c type="topo" value="C0009368">colon
        <c type="morph" value="C0027627">metastatic
          <c type="topo" value="C0024109">lung
          </c>
        </c>
      </c>
    </c>
  </code-section>
where <coding tag> tag denotes a section of coding data that is added by the encoder after each parsed text section of the document. The scheme attribute specifies the particular coding scheme to be used in the section, allowing the XML tagging format to be used with multiple coding schemes. A single document could also contain tags representing individual concepts; their type (morphology or topology, here) and value attributes represent the category of the code (or coding axis in SNOMED) and the code value. Concept tags also contain the text word or phrase from which the concept was derived and optionally contain other concept tags. The pattern of containment expresses hierarchical relationships. The coding sections do not alter the free text of the parsed sections and therefore the text and structure of the original document is always available if needed.



19. CANONICAL FORMS.



      In mathematics, a CANONICAL FORM is a preferred notation that encapsulates all equivalent forms of the same concept. For example, the canonical form for one-half is 1/2, and there is an algorithm for reducing the infinity of equivalent expressions, or "aliases," namely 2/4, 3/6, 4/8, 5/10, ... , down to 1/2. Agreement upon a canonical form is one of the features of any mature intellectual discipline. For example, the importance of a canonical form became apparent to the English-language-dictionary writers of the eighteenth century, who realized that one couldn't write a dictionary without consistent orthography. The variable orthography, say, of fourteenth-century English poet, Geoffrey Chaucer, could not be supported by the need to have each individual English word appear and be defined at only one place in the dictionary. Investigators working with raw medical text have reached the same conclusion.

      Unfortunately, in anatomic pathology, even simple concepts, consisting entirely of correctly spelled words, have no canonical form. For example, even such a simple idea as "adenocarcinoma of colon metastatic to lung" has no consistent form of expression. The distinct medical terms, of course, all have a unique spelling and meaning; but the following expressions (and many others, easy to imagine) are all equivalent in a surgical pathology report:
 Colon adenocarcinoma metastatic to lung 
 Colonic adenocarcinoma metastatic to lung 
 Large bowel adenocarcinoma metastatic to lung  
 Large intestine adenocarcinoma metastatic to lung 
 Large intestinal adenocarcinoma metastatic to lung  
 Colon's adenocarcinoma metastatic to lung
 Adenocarcinoma of colon with metastasis to lung 
 Adenocarcinoma of colon with lung metastasis
 Adenocarcinoma of colon with pulmonary metastasis 
      Of course, some of these forms would likely never be encountered in either a surgical pathology report or a query (e.g., colon's adenocarcinoma metastatic to lung), but both the surgical pathology encoder and the query engine should nonetheless be able to cope with oddball sentences consisting of all correctly-spelled words, at least to the point of suggesting to the user what he/she is really asking for. If there is no canonical form for equivalent ideas in biomedicine, then how are indexes and statistical tables constructed, when these query methods depend upon equivalent concepts being assembled together? (Masarie et al, 1991)



20. MEDICAL ONTOLOGIES.



      1. An ONTOLOGY is a (Platonic) description of essential reality, i.e., what actually is, as opposed to what one can see (observation, accident), or what one can know (epistemiology) (12). The term ontology was coined by two German philosophers, Göckel and Lorhard, in 1613, and first appeared in English in 1721. Quine (13) views ontology as the metaphysical commitments or presuppositions embodied in the different natural sciences. For example, the belief that a cancer can metastasize would be an ONTOLOGICAL COMMITMENT. In the philosophy and practice of science, ontology goes under various names: essence, reality, Mind of God, nature, gold standard, or mathiverse (14). In medical informatics, ontology has come to mean a structured list of concepts, typically prepared by an expert or panel of experts.

      2. With the ease of posting structured lists on the Internet, and with EXTENDED MARKUP LANGUAGE (XML) (38,39) as an emerging standard for such lists, it is likely that the next decade will witness an explosion of public medical ontologies, both amateur and professional.

      3. The importance of ontologies has been recognized by the U. S. Defense Advanced Research Projects Agency (DARPA), the original sponsor of the Internet, which has proposed guidelines for a formal ontology AGENT MARKUP LANGUAGE, that employs the ONTOLOGY INFERENCE LAYER (15,16).

      4. A simple ontology is illustrated by the observation at autopsy that CHRONIC-PASSIVE-CONGESTION-LIVER (CPCL = C0700148-C0721399 in UMLS codes) (40,41). often accompanies HEART HYPERTROPHY (HH = C0795691-C0333959). In an approximate sense, HH causes CPCL (42). Thus, one might expect a hypothetical collection, say, of 10,000 cases to distribute as follows:
              HEART HYPERTROPHY (HH=C0795691-C0333959)
 ____________________________________________
|            |  NOT-HH |     HH   |  Total  |
|___________________________________________|
|  NOT-CPCL  |   7,000 |    1,000 |  8,000  |
|___________________________________________|
|    CPCL    |       0 |    2,000 |  2,000  |
|___________________________________________|
|   Total    |   7,000 |    3,000 | 10,000  |
|___________________________________________|
 CHRONIC-PASSIVE
 -CONGESTION-LIVER
 (CPCL=C0700148-C0721399).


      5. That is, most cases are negative for both features; HH anticipates CPCL in some cases; but there should be only rare cases with CPCL but without HH. Therefore, in the language of first-order propositional logic, CPCL IMPLIES HH, or NOT-CPCL IOR HH.

      6. Such correlations (2x2 CONTINGENCY TABLES), could be edited for redundancy and nonsense correlations (43).

      7. As necessary, a collection of such 2x2 contingency tables could be ORDERED BY IMPORTANCE, based upon the frequencies of cases appearing in the lower right corner of the table.

      8. Medical ontologies can be used to break into de-identified, public databases BY INFERENCE, which are otherwise well-protected. It will become the obligation of the data-holder to anticipate any inferential break-ins, and pre-empt them, by removing vulnerable data (5,44).



21. MEDICAL ETHICS.



      Two principles of medical ethics that have come down to us from Hippocrates are:
(1) First, do no harm.
(2) Confidentiality of patient medical information.




22. DISCUSSION.



      One of the burdens of our profession (compared, say, to presidents and rock stars) is that we must often explain what we do for a living to strangers, acquaintances, and, alas, sometimes to colleagues in distant specialties of medicine (ref, Mod Pathol, 1998). Some people think that we are Quincy (a television medical examiner, two decades ago) or Kay Scarpetta (a medical examiner in novels by Patricia Cornwell).



23. REFERENCES.





24. APPENDIX A. GLOSSARY.





25. APPENDIX B. INTRODUCTION TO NUMBER THEORY.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.


      INTRODUCTION. Regrettably, one of the areas of mathematics that I studiously avoided as a graduate student was NUMBER THEORY, figuring that this was an area for only the purest of theoretical mathematicians, and could never have any value in applied areas like biology or medicine. I was absolutely wrong, but at least the depth of my folly unfolded only decades later.

      NUMBER THEORY is the mathematical study of the basic properties of numbers, predominantly whole numbers (1,2,3,....). Included in this field are such basic concepts as: prime numbers, modulo arithmetic, mathematical induction, combinatorics.......

      A PRIME NUMBER is any whole number greater than one that is divisible without remainder only by itself and one. The first few prime numbers are:
 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 57, 59, 61, 67, 71, 73, 79....


      The SIEVE OF ERATOSTHENES is a method for verifying that a whole number is indeed a prime number, attributed to the ancient Greek mathematician, ERATOSTHENES. Consider any whole number, n, and its square root, sqrt(n). Divide n by all the prime numbers less than sqrt(n).

      MATHEMATICAL INDUCTION is a method for demonstrating that a mathematical statement is true for all n. First, demonstrate that the statement is true for n=1; Then, demonstrate that if the statement is true for n=k, the statement must be true for n=k+1.

COMBINATORICS.

GREATEST COMMON DIVISOR.

      FERMAT'S LAST THEOREM. This is a famous theorem about DIOPHANTANE EQUATIONS, which are equations consisting entirely of whole numbers, originally explored by the ancient Greek mathematician, DIOPHANTUS (Singh, 2000). In a theorem proposed in 1610 by the amateur French mathematician, PIERRE DE FERMAT, and not proved until 1998 by SIR ANDREW WILES, it stated that the equation
xk + yk = zk.
has no whole-number solutions for k>2. (Fermat claimed to have a proof, but he died before publishing it.)

      FERMAT'S LITTLE THEOREM. Not nearly as famous as Fermat's Last Theorem, this is the assertion that....

QUADRATIC MODULO RESIDUES.

PRIME NUMBER GENERATORS.

EUCLID'S FACTORIZATION. Proof due to Euclid.

CHINESE REMAINDER THEOREM. Stated and proved by first-century Chinese mathematician, Sun-Tse.

INFINITY OF PRIME NUMBERS. Proof due to Euclid.

DENSITY OF PRIME NUMBERS.

      MODULO ARITHMETIC. The term x modulo n, or x mod n, denotes the (whole number) remainder of the division of x by n. Modulo arithmetic, or so-called 'clock arithmetic', is the mathematical method by which we determine, say, that five hours after ten o'clock, it is three o'clock.

      That is, the ordinary clock is a MODULO-12 DEVICE, and [(5+10) mod 12] equals 3. Similarly, the second-hand and minute-hand on the clock are modulo-60 devides, and the military clock is a modulo-24 device. Modulo arithmetic has the fantastic advantage that integer arithmetic can be performed on huge integers with absolute accuracy, without having intermediate calculations exceed a predetermined number of digits, namely, the digits in the square of the modulus. Modulo arithmetic is one of the pillars of modern cryptography.

INVERSE MODULO ARITHMETIC.

      DIFFICULTY OF FACTORING A PRODUCT OF LARGE PRIME NUMBERS. In asymmetric encryption, the public-key is the product of two, very large prime numbers, whereas the private-key is the factors themselves. This method is immune to larger, more powerful computers, because as soon as an attacker buys a faster computer to factor a prime product, the sender buys a comparably faster computer to create new, larger, prime products. I suppose that the method could lead to a sort of ARMS RACE among computer systems.

      The underlying principle of asymmetric encryption is the mathematical conjecture (i.e., unproved mathematical assertion) that it is an enormous computational task to factor a number which is the product of two large prime numbers. THIS CONJECTURE HAS NEVER BEEN PROVED MATHEMATICALLY, but likewise in the past two decades of serious investigation by the world's best cryptanalysts, nobody has successfully challenged this mathematical assertion. The world banking system, as well as many other high-tech security systems, depend upon this underlying principle of asymmetric encryption. IF IT IS WRONG, then the world's financial system would quiver, at least for a few days, until other security measures were put into place. If it is wrong, you can prove it, and you tell anybody else, you will probably be assassinated by powerful financial interests. This is the plot of a Heinlein science-fiction story from about a half-century ago. (In the story, it was a man who figured out how to beat the life-insurance system.)



26. APPENDIX C. INTRODUCTION TO SET THEORY.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.


       1. MATHEMATICAL LOGIC has its roots in Aristotle and in the European Middle Ages, but assumed its modern form in the hands of George Boole (17), who is rightfully memorialized in the term BOOLEAN SEARCHES used for searching the computerized medical literature.

       2. The MATHEMATICAL THEORY OF SETS was invented in the late nineteenth century, based upon concepts from mathematical logic, as a means for addressing some pressing philosophical issues in mathematics, such as the meaning of limits in calculus, and different classes of infinity. It has been suggested that these mathematical disciplines have potential serious applications in medicine, but so far the published examples (including in this report) are fairly trivial.

       3. A MATHEMATICAL SET, S, is any collection of objects for which it can be said that a particular object, s, either BELONGS TO S, (s is a member of S), denoted s C S, or DOES NOT BELONG TO S, (s is not a member of S), denoted s ~C S. One cannot simply collect a set of just anything. In particular, the SET OF ALL SETS is a famous paradox, whose recognition by a young Bertrand Russell nearly demolished the celebrated magnum opus and career of Prof. Gottlob Frege. THERE CANNOT BE A SET OF ALL SETS.

       4. A set may be characterized either as a listing of its actual members within curly brackets, {}, so-called roster notation, or by the method of its creation. For example, {2,3,5,7,11,13,17,19} and the SET OF ALL PRIMES LESS THAN 20 are the same set.

       5. The most important set of all is the NULL SET or EMPTY SET, i.e., the set that contains no members, usually denoted Ø, or {}.

       6. There are two important things to remember about a mathematical set (32,33):
6a. A SET IS CHARACTERIZED entirely in terms of its members, i.e., EXTENSIONALLY. A set may NOT be uniquely characterized INTENSIONALLY, i.e., by the manner of its creation. If two sets are created differently, but end up with the same membership, then they are the same set. For example, the set of humans living on the moon in 1850 and the set of Chevrolets built during 1850 are the same set, namely, the null set, Ø.

6b. A SET IS DIFFERENT FROM WHAT IT CONTAINS. that is, s and {s} are different. This property of sets leads to the Russell-Frege paradox: does the set of all sets belong to itself or not? (Is Epimenides a liar or not? (see Appendix C)). The Russell-Frege paradox can be resolved by defining two types of sets: ordinary sets and classes. This double definition involves a lot of extra mathematical bookkeeping.
Smith (12) believes that this classical formulation for sets is fundamentally flawed for describing ontologies, and has proposed using an alternate formulation, known as MEREOLOGY.

       7. There are six commonly used concepts in set theory:
7a. Set membership, denoted C . We say that s is a member of S, denoted C S; or s is not a member of S, denoted ~C S;

7b. The empty set, Ø, the set containing no members. There exists no s such that s C Ø.

7c. Set Union, U. The set A U B is the set of all members that belong either to set A or to set B or to both. Set union is analogous to inclusive-or (IOR) in first order propositional logic.

7d. Set Intersection, /\. The set A /\ B is the set of all members that belong both to set A and to set B. Set intersection is analogous to logical AND in first order propositional logic.

7e. Set Subtraction, -. The set A - B is the set of all members that belong to set A but NOT to set B. Set intersection is analogous to logical NOT in first order propositional logic.

7f. Subset. A C B. We say that the set A C B, if the set of all members that belong to set A also belong to set B. Set intersection is analogous to IMPLIES in first order propositional logic.


       8. GÖDEL'S PROOF was published in 1931 (Gödel, 1931), but only mathematicians and philosophers had ever heard of him until Hofstadter's popular book covering his work appeared two decades ago (Hofstadter, 1979). Few persons outside of the computer science community realize that the computer's basic design of pointing instructions to numbers was Gödel's.

       9. Gödel was influenced by an amateur interest in linguistics, and by the Jewish Kabbalists, who associated letters and words in the Hebrew Bible with numerical values and spiritual relationships; and by the emerging doctrine of non-Euclidean geometry. Prof. David Hilbert had issued a challenge to his mathematical colleagues to design a consistent set of axioms from which all true statements of mathematics could be proved. Such an achievement would stake out an irreducible core of true principles in mathematics, which would not blow away with each passing fad in relativistic physics.

       10. The fundamental tool of mathematical reasoning is Aristotle's SYLLOGISM. For example:
(1) All men are mortal;
(2) Socrates is a man;
(3) Therefore, Socrates is mortal.
If one knows that assertions (1) and (2) are true, then one is entitled to INFER that assertion (3) is true. This stepwise derivation of additional true statements from known true statements is a MATHEMATICAL PROOF.

       11. Aristotle also proposed the paradox of Epimenides the Cretan, who asserted that all Cretans are liars (Aristotle, Nicomachaean Ethics). This so-called PARADOX OF SELF-REFERENCE has no truth-value, for if the assertion is true, then it is false; if the assertion is false, then it is true. There are many forms of this paradox, including: "this statement is false"; "the barber shaves everyone who doesn't shave himself"; and "the set of all sets" (FREGE-RUSSELL PARADOX, see Appendix B).

       12. Gödel demolished Hilbert's dream by proving that EVERY system of mathematics at least as rich as as ordinary arithmetic, geometry, or set theory, must necessarily contain true but unprovable statements. The method by which Gödel achieved this result was as stunning as the result itself. Gödel assigned a unique whole number to every grammatically well-formed statement in mathematics. He then constructed this true statement in his enumeration model: THIS STATEMENT IS UNPROVABLE.

       13. Gödel's place in the history of mathematics, science, and technology is secure. His ideas have influenced computer science, artificial intelligence, neural nets, and possible limits on human sentience and creativity. Prof. John von Neumann, an early supporter of Gödel's work, clearly had Gödel's enumeration model in mind when von Neumann designed the first modern computer in the 1940s. The initial philosophical pessimism over the impossibility of establishing a complete and consistent mathematical system has matured: the reverse of the argument is that there will always be future work for creative mathematicians.



Last Updated: September 23, 2001, by G. William Moore, MD, PhD.