ANATOMIC PATHOLOGY
NATURAL LANGUAGE PROCESSING.
DRAFT COPY ONLY.
2/9/2006.

G. William Moore, MD, PhD.

Departments of Pathology,
Baltimore Veterans Affairs Medical Center,
University of Maryland Medical System,
The Johns Hopkins Medical Institutions.

http://www.netautopsy.org/natlngpr.htm
http://www.netautopsy.org/natlngpr.ppt


Presented at: Preclinical Teaching Building 206B, December 6, 2005, 9:00-10:30 AM, for the course: Data, Information, and Knowledge (ME 600.701), Division of Health Science Informatics, The Johns Hopkins Medical Institutions, Baltimore, MD 21287.

Send comments and correspondence to: George.Moore4@va.gov
See also: http://www.netautopsy.org/gwmcv.htm .................. http://www.netautopsy.org/vhpsapsx.htm .................. http://www.netautopsy.org/apdmchap.htm .................. http://www.netautopsy.org/jharzipf.htm

United States Government Work, uncopyrighted, public-domain, DRAFT COPY ONLY. This document does not necessarily represent the views or policies of any United States Government agency. This document is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and noninfringement. In no event shall the authors be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of, or in connection with the document or the use or other dealings made with the document..

CHAPTER 0. TABLE OF CONTENTS.

Chapter 1. Introduction.
Chapter 2. Linguistic Science.
Chapter 3. Rule-based Systems.
Chapter 4. Generative Linguistics.
Chapter 5. Artificial Intelligence.
Chapter 6. Basic Concepts of Linguistics.
Chapter 7. Competence Grammar.
Chapter 8. Ambiguity of Language.
Chapter 9. Corpus Linguistics. Introduction.
Chapter 10. Zipf's Laws.
Chapter 11. Collocations.
Chapter 12. Concordances.
Chapter 13. Mathematical Foundations.
Chapter 14. Statistics.
Chapter 15. General Linguistics.
Chapter 16. Phrase Structure Grammar.
Chapter 17. Context-Free Grammar.
Chapter 18. Dependency Grammar.
Chapter 19. Corpus Linguistics: Sources.
Chapter 20. Words and Phrases.
Chapter 21. Syntax.
Chapter 22. JHAR/JHSP Corpus.
Chapter 23. Statistical Inventory.
Chapter 24. Future of NLP in medicine.
Chapter 25. NLP Problems in Anatomic Pathology.
Chapter 26. References.
Chapter 27. Mini-histories.
Chapter 28. Glossary.

SLIDES FOR PRESENTATION


Chapter 1. Introduction.

1.1. Why NLP in medicine?
1.1.1. Copious computerized natural language information (terabytes annually).
1.1.2. Storage and organization of information is chaotic.
1.1.3. Standards are too loose to be useful (HL7).
1.1.4. Anatomic pathologists use free-text language precisely, because they are consultants with no direct patient contact.
1.1.5. Synoptic diagnoses: for billing and regulatory purposes.
1.1.6. NLP vs synoptic: the war is on.

Variant forms, same diagnosis.


Colon adenocarcinoma metastatic to lung.
Colonic adenocarcinoma metastatic to lung.
Large bowel adenocarcinoma metastatic to lung.
Large intestine adenocarcinoma metastatic to lung.
Large intestinal adenocarcinoma metastatic to lung.
Colon's adenocarcinoma metastatic to lung.
Adenocarcinoma of colon with metastasis to lung.
Adenocarcinoma of colon with lung metastasis.
Adenocarcinoma of colon with pulmonary metastasis.

Colon adenocarcinoma, metastatic to lung.
Colonic adenocarcinoma, metastatic to lung.
Large bowel adenocarcinoma, metastatic to lung.
Large intestine adenocarcinoma, metastatic to lung.
Large intestinal adenocarcinoma, metastatic to lung.
Colon's adenocarcinoma, metastatic to lung.
Adenocarcinoma of colon, with metastasis to lung.
Adenocarcinoma of colon, with lung metastasis.
Adenocarcinoma of colon, with pulmonary metastasis.
IT IS UNREASONABLE TO DEMAND...


...that a busy physician navigate through a hierarchy of pick-lists in order to write his/her report, as long as the report is:
1. Spelled correctly;
2. Grammatically correct;
3. Complete; and
4. Unambiguous.

HOWEVER, WHAT ABOUT A REPORT LIKE THIS?
UNDERSTANDABLE, YET NUMEROUS MISSPELLINGS.

http://www.mrc-cbu.cam.ac.uk/~mattd/Cmabrigde/
http://www.mrc-cbu.cam.ac.uk/%7Emattd/Cmabrigde/
This is really weird. Can you raed tihs? Olny srmat poelpe can. I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can sitll raed it wouthit a porbelm. This is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Amzanig huh? yaeh and I awlyas tghuhot slpeling was ipmorantt! if you can raed tihs psas it on !!
Physicians can read this, but not computers. Thanks to Drs. James DeLeo and Larry Brown and to Mrs. Liz Dunbar for showing this to me.
Basics of Natural Language Processing.


1.2. Questions addressed by NLP:
1.2.1. What you say: syntax.
1.2.2. What it means: semantics.
1.3. Scope of medical NLP: in principle, all medical texts, especially text involved in billing or quality assurance.
1.4. Zipf's Law: f ∝ 1/r, for f=word frequency and r=word rank.
1.5. Anecdotal NLP:early scientific linguistic literature.
1.6. Statistical NLP: reach for the low-hanging fruit.
1.7. Medical NLP: All significant fruit should hang low.

Reach for the low-hanging fruit.




Tantalus.. Hans Holbein the Younger [1497-1543]. U. S. National Gallery of Art, Washington, DC, USA.

Reach for the low-hanging fruit.




Adam und Eva. Vertreibung aus dem Paradies. Der Sündenfall.. Lukas Cranach the Elder [1472-1553]. Kunsthistorisches Museum, Wien, Österreich.

Chapter 2. Linguistic Science.

2.1. Characterize and explain linguistic observations.
2.1.1. Conversation.
2.1.2. Writing.
2.1.3. Childhood development.
2.2. NLP in medicine.
2.2.1. Medical dictations (speech recognition).
2.2.2. Medical handwriting (ugh!).
2.2.3. Paper printed texts, scanned into computer.
2.2.4. Electronic medical record: Veterans Affairs Computerized Medical Record System (VA-CPRS).

Veterans Affairs
Computerized Medical Record System
(CPRS).



Veterans Affairs
Enterprise Reference Terminology.
The Veterans Health Affairs (VHA) branch of the Department of Veterans Affairs, arguably the largest integrated healthcare provider in the United States, has completely computerized virtually all clinical transactions, including physician orders and documentation. VHA has undertaken an Enterprise Reference Terminology (ERT) which has been designed to provide a terminology development environment, terminology services, and maintenance services for the clinical and business content in Health Data Repository (HDR) and other VHA applications. The goal is for the ERT to encompass all HDR domains by 2008.
How did the VA enforce compliance?

1. The VA is a U. S. military organization, with top-down management.
2. There is a U. S. federal mandate for record exchangeability among VA hospitals and clinics nationwide.
3. Implementation by January 1, 2001.
4. No exceptions, no discussion.
5. Employees: timid federal bureaucrats.

Tower of Babel.




Tower of Babel. Pieter Brueghel [1520-1569]. Museum Boymans-van Beuningen, Rotterdam, The Netherlands.

Ancient Alphabets.


Phoenician alphabet (1000 BC):


Hebrew alphabet:
א ב ג ד ה ו ז ח ט י כ ל מ נ ס ע פ צ ק ר ש ת


Greek alphabet:
Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω


Roman alphabet:
A B C D E F G H I K L M N O P Q R S T V X Y Z


Chapter 3. Rule-based systems.
3.1. Grammars in ancient civilizations: translation.
3.1.1. Ancient Phoenician/Hebrew: Tower of Babel,Pentecost.
3.1.2. Ancient Greco-Roman.
3.1.2.1. Everyone WANTED to learn Latin and abandon their local tongue.
3.1.2.2. Exception: Masada.

3.1.3. Ancient China: Qin Shi-Huang (260-210 BC):
3.1.3.1. Everyone REQUIRED to adopt the imperial ideograms, or else.
3.1.3.2. Execution of 460 scholars. (The Ten Crimes of Qin.)

Qin Shi-Huang [260-210 BC].
First Emperor of China.




Aristotle [384-322 BC].




3.2. Aristotelian Logic.

3.2.1. All Greeks are mortal; Socrates is a Greek; ....
3.2.2. Flaws in Greek logic: no inclusive-or, no empty-set ("zero").
3.2.2. Formalized in Boolean logic: inclusive-or; algebraic expressions.
3.2.3. Application: Boolean searches in PubMed: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
3.2.4. Disadvantage: all-or-none reasoning: no room for probability.
3.2.5. No tense, no real subjunctive.
3.2.6. No tolerance of inconsistency: ex falso quod libet.
3.2.7. No belief (doxistic), reality (ontology), deontic (moral obligation), alethic (knowable), intentionality, or other variant logics.
3.2.8. Fuzzy logic.
3.2.9. Paraconsistency, modal logic.
George Boole [1815-1864].




Col. John Shaw Billings, MD. [1838-1913].



U. S. Civil War Surgeon.
Creator of Index Medicus.
Father of PubMed.

Approach to Mathematical Grammars.


3.3. "All grammars leak" (Sapir, 1921).
3.3.1. Retreat in despair.
3.3.2. Characterize statistical properties.
3.3.3. Common patterns in language use.

3.4. Early scientific foundations.

3.4.1. Rationalist approach (1960s-1970s).
3.4.2. Chomsky: innate language facility.
3.4.3. Poverty of stimulus (does not explain language competence).
3.4.4. Assume, characterize the rule-base.
3.4.5. Generative grammars.

Prof. Noam Chomsky [1928-].




Father of modern computational linguistics.

3.5. Empirical Approach.

3.5.1. Active in 1980s.
3.5.2. Assume basic cognitive ability; deny tabula rasa.
3.5.3. Baby has general operations:

3.5.3.1. Association.
3.5.3.2. Pattern recognition.
3.5.3.3. Generalization.

3.5.4. Rich sensory input.
3.5.5. Deduce parameters for general model.
3.5.6. Corpus Linguistics.
3.5.7. American Structuralists.

3.5.7.3. Harris (1950s): know-nothing computer program.
3.5.7.3. Harris's student, Sager (1980s): NYU Linguistic String Project.


Chapter 4. Generative Linguistics.

4.1. Chomsky: describe the innate language (I-language).
4.2. Indirect evidence: text: E-language.
4.3. Linguistic competence: "competence grammar". Property of the speaker.
4.4. Linguistic performance: memory lapses, distractions, etc.
4.5. Medicine:
4.5.1. linguistic competence in the medical writer.
4.5.2. models of medical reports: ADASP in pathology.
4.5.3. performance: slow typist, time constraints, idiosyncratic abbrs.
4.5.4. JCAHO: forbidden abbreviations.


Chapter 5. Artificial Intelligence.

5.1. Build small systems that behave intelligently.
5.2. Criticized as "toy problems".
5.3. Language engineering: devoid of general principles.
5.4. Prof. Marvin Minsky [1927-], Father of Artificial Intelligence.

Prof. Marvin Minsky [1927-].




Father of artificial intelligence.

Chapter 6. Basic Concepts.

6.1. Fundamental Questions.
6.1.1. What do people say/write?
6.1.2. What do these utterances/writings say about the world?

6.2. Grammaticality.
6.3. Conventionality.
6.4. Ambiguity.
6.5. Corpus Sources. Brown corpus, Gutenberg, OMIM, JHAR, JHSP.
6.6. Zipf's Laws.
6.7. Collocations.
6.8. Concordances.

Chapter 7. Competence Grammar.

7.1. Property of the rational speaker.
7.2. Grammaticality-only includes wierd sentences: "Colorless green ideas sleep furiously."
7.3. Conventionality: the usual expression, even when others are possible or more sensible (e.g.: how do you do?)
7.4. Conventionality in medicine:
7.4.1. Malignant melanoma => melanosarcoma.
7.4.2. Hepatoma => hepatocellular carcinoma.
7.4.3. Hypernephroma => renal cell carcinoma.
7.4.4. Nutmeg liver => chronic passive congestion of liver.
7.4.5. Caseous necrosis => necrotic granuloma.

Statistical conventionality in medicine.
Conventional meanings for conventional phrases.

Chapter 8. Ambiguity of Language.

8.1. Our department|NP      is|H      training pathologists|VP.
8.2. Our department|NP      is|V      training pathologists|VP.
8.3. Our department|NP      is|V      training pathologists|NP.

Chapter 9. Corpus Linguistics.

9.1. Text-corpora: Brown corpus. One million words, tagged, representative of American English.
9.2. Text-corpora: Project Gutenberg. 17,000 uncopyrighted literary texts (Tom Sawyer, etc.)
9.3. Text-corpora: OMIM: Comprehensive list of medical conditions.
9.2. Word frequencies.
9.3. Zipf's First Law.

Chapter 10. Zipf's Laws.

10.1. Zipf's First Law.

10.1.1. f ∝ 1/r:
f = word-frequency,
r = word-frequency rank,
m = number of meetings per word.

10.1.2. There exists a k such that f × r = k.
10.1.3. Alternatively, log f = log k - log r.
10.1.4. English literature, Johns Hopkins Autopsy Resource, German, and Chinese.
10.2. Zipf's Second Law.
10.2.1. m ∝ √f
10.2.2. There exists a k such that k × f = m2.
10.2.3. Corollary: m ∝ 1/√r
10.2.4. Highly dependent upon what qualifies a "different meaning" for a word.
10.3. Zipf's Third Law.

10.3.1. f ∝ 1/wordlength:
10.3.2. There exists a k such that f × wordlength = k.

10.3.3. Highly dependent upon word-division conventions for a language. For example, "mitral valve stenosis" (English, 3 words) is Mitralklappenstenose (1 word) in German and 6 words in Japanese.

10.3.4. German, Turkish, and Finnish are highly agglomerative languages. Examples: Donaudampfschifffahrtsgesellschaftskapitän (German: Danube steam shipping line company captain); Avrupalilastirilamiyanlardansiniz (Turkish: you are one of those who cannot be Europeanized).

10.4. Mandelbrot (fractal guy): "...bien que la formule de Zipf donne l'allure générale des courbes, elle en représente très mal les détails...."
10.4.1. f = P(r + ρ)-B, P, B, ρ are parameters.
10.5. Similar observations made by Baudot (1870), Pareto (1896), Estoup (1916), and Condon (1928).

Mandelbrot's Formula.


1. f = P(r + ρ)-B.
2. With six parameters, you can draw an elephant.
3. With seven parameters, you can wag its tail.

The Johns Hopkins Autopsy Resource.




Zipf's First Law:
50,000 JHH Autopsy Facesheets.




Zipf's First Law:
50,000 JHH Autopsy Facesheets.




Zipf Distribution:
50,000 JHH Autopsy Facesheets.




Zipf's Law:
Chinese.




This Chinese ideogram is variously translated as of, which, or the adjectival endings -ic or -ical. It is pronounced, de. The word is a loan-word from English! Supposedly, the word it corresponds to is the English suffix, -tic, as in sclerotic, nephritic, and fibrotic, etc.

Zipf's Law as a signature.


Ten professors of medicine at Goethe University Medical School, Frankfurt, Germany.
Zipf distribution of their computerized medical workups.
Medical students could identify the professor by his/her Zipf distribution.

Kenner's Corollary:
Zipf's Law of Messy Desks.


See: Kenner (2003).

Chapter 11. Collocations.


11.1. Multiple word sequence, perceived to have an existence beyond sum of parts.
11.2. Example: Johns Hopkins Surgical Pathology (JHSP).
11.3. Reuse of phrases: cliches, not requiring Chomskyan high-level competence.
11.4. Barrier word method vs frequency distribution filter.
11.5. Barrier words from JHSP:
 RANK	FREQUENCY   BARRIER WORD
   1      222,175   and
   2      196,153   of
   3      189,799   with
   4      107,039   for
   5      104,067   the
   6       82,104   note
   7       80,740   in
   8       78,549   right
   9       77,885   left
  10       70,923   is
  11       70,261   see
  12       67,917   are
  13       53,071   mild
  14       49,987   identified
  15       47,804   to
  16       41,467   consistent
  17       39,792   this
  18       30,352   present
  19       27,189   seen
  20       25,371   at
  21       25,097   there
  22       24,657   on
  23       24,284   or
  24       23,021   be
  25       21,243   associated
  26       19,515   was
  27       18,376   one
  28       16,122   but
  29       16,057   case
  30       16,057   from

11.6. Example of barrier word filter:
TERMINAL ILEUM , CECUM , APPENDIX and COLON ( RIGHT HEMICOLECTOMY ) ; MODERATELY DIFFERENTIATED COLONIC ADENOCARCINOMA , with extension through MUSCULARIS PROPRIA into PERICOLIC SOFT TISSUE , and with involvement of PERINEURAL SPACES . TUBULOVILLOUS ADENOMA and associated VASCULAR MALFORMATION in the TRANSVERSE COLON ; TUBULAR ADENOMA in the DESCENDING COLON . recent COLOSTOMY SITE with SUBMUCOSAL FIBROSIS and INFLAMED GRANULATION TISSUE in the SEROSA . multiple ADHESIONS and SEROSAL ABSCESSES with GRANULATION TISSUE , FOREIGN BODY GIANT CELLS , SCARRING , focal OSSIFICATION , and FAT NECROSIS . ISCHEMIC BOWEL DISEASE diffusely involving ILEAL MUCOSA , with focal TRANSMURAL NECROSIS and ACUTE INFLAMMATION .

BARRIER WORD FILTER, GERMAN:
CHRONISCHE SKLEROSIERENDE PANCREATITIS , vorwiegend im bereich des CAPUT PANCREATIS . PSEUDOCYSTE des PANCREASKOPFES . multiple KALKSPRITZERARTIGE FETTGEWEBSNEKROSEN des CAPUT PANCREATIS und des CORPUS PANCREATIS . CHRONISCHSKLEROSIERENDE EXTRAHEPATISCHE CHOLANGITIS . FETTLEBER . PARIETALTHROMBOSE der PFORTADER . SUBCAPSULAERER ABSZESS des rechten LEBERLAPPENS mit ANAEMISCHER NEKROSE der nachbarschaft . FLAECHENHAFTE PERITONEALVERWACHSUNGEN der LEBEROBERFLAECHE. STAUUNGSMILZ. FLAECHENHAFTE PERITONEALVERWACHSUNGEN der MILZKAPSEL. zustand nach nicht ganz frischer LAPAROTOMIE im bereich des rechten OBERBAUCHES mit QUERVERLAUFENDER ABDOMINALNAHT und anlage einer DRAINAGE der BURSA OMENTALIS . DILATATION des rechten HERZVENTRIKELS . schwere STENOSIERENDE CORONARARTERIENSKLEROSE . multiple PETECHIEN der HERZHINTERWAND , vorwiegend im bereich beider VORHOEFE . beginnende GALLERTATROPHIE des SUBEPICARDIALEN FETTGEWEBES . flaechenhafte PLEURAVERWACHSUNGEN beiderseits . PLEURASPITZENSCHWIELEN beiderseits . schweres LUNGENOEDEM . akute BLUTSTAUUNG der LUNGEN . INTIMALIPOIDOSE der PULMONALARTERIEN . ATELEKTASEN BASALER und PARAVERTEBRALER LUNGENABSCHNITTE . NARBENCARCINOM der spitze des linken LUNGENUNTERLAPPENS . mittelgradige allgemeine ARTERIOSKLEROSE . OEDEM der ARYEPIGLOTTISCHEN FALTEN . mehrere NEKROSEN der SCHLEIMHAUT von EPIGLOTTIS , LARYNX , und TRACHEA . GASTROMALACIA ACIDA . ULCUSNARBE des ANTRUM VENTRICULI. frische HAEMORRHAGISCHE MAGENSCHLEIMHAUTEROSION des ANTRUM VENTRICULI . teils BLUTIGER , teils HAEMATINHALTIGER DUENNDARMINHALT . TRUEBE SCHWELLUNG der NIEREN . RESTE RENCULAERER LAPPUNG . sogenannte KALKINFARKTE der NIERENPAPILLEN . FLECKFOERMIGE HARNBLASENSCHLEIMHAUTBLUTUNGEN . LIPOIDVERARMUNG der NEBENNIERENRINDE . fortgeschrittene AUTOLYSE .


Chapter 12. Concordances.

12.1. Bible concordances.
 is no balm in Gilead; [is there] no   physician there? why then is not the health of th
 them, They that be whole need not a   physician, but they that are sick. 13 But go ye a
  that are whole have no need of the   physician, but they that are sick: I came not to 
 ll surely say unto me this proverb,   Physician, heal thyself: whatsoever we have heard
 hem, They that are whole need not a   physician; but they that are sick. 32 I came not 
 in Hierapolis. 14 Luke, the beloved   physician, and Demas, greet you. 15 Salute the br

12.2. Keyword in Context (KWIC).
 epidermoid carcinoma , uterine cervix extending to   fundus , adnexa , bladder , rectum , and pelvic
 r . diverticula colon . surgical absence , uterine   fundus , and appendix . peritoneal adhesions . 
 iae . external cardiac massage . petechiae gastric   fundus .                                       
  cell nuclei . capillary microaneurysms left optic   fundus . history of traumatic lumbar puncture .
 eral renal pelves and trachea . surgical absence ,   fundus and corpus uteri , and subtotal absence 
  and intact healed end to side anastomosis between   fundus of stomach and proximal jejunum . hyperp
 ial necrosis aorta . surgical absence , body , and   fundus of uterus , appendix , and left sixth ri


Chapter 13. Mathematical Foundations.

13.1. Probability Theory. Sample space, S; event A ⊆ S; field F is the set of all events A ⊆ S. Probability, P(A), defined for every event A, such that 0 < P(A) < 1.
13.2. Axioms of probability.
13.2.1. P(S)=1;
13.2.2. P(Ø)=0; and
13.2.3. P(∪i Ai) = ∑i Ai if (Ai ∩ Aj) = Ø for every i≠j.
13.3. Uniform distribution. All events are equally likely.
13.4. Conditional Probability. For events A, B, the conditional probability, P(A|B) of event A given event B is defined as: P(A|B) = P(A∩B)/P(B).
13.5. Probabilistic Independence. For events A and B, P(A∩B) = P(A) × P(B).
13.6. Bayes' Law. The conditional probability, P(B|A) of event B given event A is defined as: P(B|A) = (P(A|B)×P(B))/P(A). Bayes' Law is used when it is relatively more easy to calculate P(A|B), the more difficult P(B|A) is desired.
13.7. Random variable. Function X : S -> R, that maps a probability event space into the real line, R.

Chapter 14. Statistics.

14.1. Estimation: The average, or expected value, of a probabilistic process.
14.2. Hypothesis Testing: If the NULL HYPOTHESIS is true, then a particular experimental outcome is likely at a given probability, e.g., p < 0.05.
14.3. Expected Value: E(X) = ∑ x P(X=x).
14.4. Variance: Var(X) = E[X - E(X)]]2
14.5. Binomial distribution:
14.5.1. The fair coin toss (p=0.5); unfair coin toss (p≠0.5).
14.5.2. r successes, n trials: B(r,n,p) = n!/((n-r)!r!).

14.6. Normal (Gaussian) distribution:
14.6.1. N(x,μ,σ) = (1/√2π)e-0.5((x-μ)/σ)2
14.6.2. Limit of binomial distribution for large n, and probability "close" to 1/2.


Chapter 15. General Linguistics.

15.1. Parts of Speech, morphology.
15.2. Nouns, pronouns, cases, declensions.
15.3. Proper nouns: Dr. Smith, Ms. Barrett. Wilms. Grave's.
15.4. Adverbial nouns: home, west, tomorrow.
15.5. Determiners, adjectives.
15.6. Verbs: tenses, person.
15.7. Conjunction, complementizers.
15.8. Phrase Structure Grammar.
15.9. Context Free Grammar.
15.10. Generative Grammar.

Chapter 16. Phrase Structure Grammar.

16.1. All grammar can be reduced to a sequence of phrases.
16.2. Noun phrase.
16.3. Prepositional phrase.
16.4. Verb phrase.
16.5. Adjectival phrase.
16.6. Phrase Structure Grammar.
16.6.1. free word order (Latin, Russian).
16.6.2. dependency grammar (English).
16.7. Rewrite rules.
16.8. Backus-Naur form.

16.8.1. [] ==> [Nφ]
16.8.2. [Nφ] ==> [N]
16.8.3. [Nφ] ==> [AN]
16.8.4. [Nφ] ==> [NPN]
where:
[]=null-sentence.
Nφ=noun-phrase.
N=noun.
P=preposition.
A=adjective


Chapter 17. Context Free Grammar.

17.1. Rewrite rules depend solely upon internal structure. Examples of non-context-free grammars. Is peels a transitive or intransitive verb?
Grandma peels: potatoes?
scrofulitic?
ecdysiastic?
Medical:
Foot
Foot of hippocampus.
Fundus:
uterine.
ocular.
gastric.
German (verb separable-prefixes):
Hör mal! → Listen up!
Hör mal auf! → Desist!

17.2. Surrounding context is irrelevant.
17.3. Recursive phrase structure expressions.
17.4. Used in high-level computer languages, compilers, interpreters.

Chapter 18. Dependency Grammar.

18.1. Definition: dependency between words (arrows):
 The old man ate the rice slowly.
                       ______________
                       |            | 
                       ↓             | 
 The → old man --→  ate    the → rice       slowly.
  |__________↑         ↑                            |
                       |_________________________|
18.2. Arguments: noun phrases, e.g, the old man; verb phrases, e.g., ate the rice

18.3. Adjuncts: adverbs, e.g., slowly

18.4. Useful for disambiguating multiple noun phrases:
red carpet movers.
nevus
blue nevus
cellular blue nevus

Chapter 19. Corpus Linguistics.

19.1. Corpus sources.

http://www.netautopsy.org Johns Hopkins Autopsy Resource.
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM Online Mendelian Inheritance in Man.
http://www.gutenberg.org Project Gutenberg.
http://www.ldc.upenn.edu Linguistic Data Consortium.
http://www.elra.info European Language Resources Association.
http://nora.hd.uib.no/icame.html International Computer Archive of Modern English.
http://ota.ahds.ac.uk Oxford Text Archive.
http://childes.psy.cmu.edu Child Language Data Exchange System.

19.2. Markup.
19.3. Word Frequencies.
19.4. Programming languages.
19.4.1. C/C++
19.4.2. SNOBOL: historical predecessor of string languages.
19.4.3. MUMPS: good sorting, U. S. taxpayer-sponsored monopoly.
19.4.3. Perl: Cost-free, universal on internet, good string commands.


Chapter 20. Words and Phrases.

20.1. Collocations.
20.2. Word-sense disambiguation.
20.3. Lexical acquisition.

Chapter 21. Syntax.
21.1. Markov Models.
Markov chain: chain of events, A1, A2, A3, ..., with a limited memory, classically, only a single step.

Markov (1913) originally developed Markov chains to examine the sequence of letters in Russian literature.

The probablity of letter/word n depends only upon the previous k words.
21.2. Hidden Markov Model (HMM): probabilistic function of a Markov process.
21.3. HMMs are the dominant model in speech recognition research.
21.4. HMMs used in part-of-speech tagging of a document.
21.5. Forward Hidden Markov Model algorithm.
21.6. Backward Hidden Markov Model algorithm.
21.7. Probabilistic Context-Free Grammars.
21.8. Probabilistic Parsing.

Chapter 22. Experience with JHSP/JHAR corpus.

22.1. Johns Hopkins Autopsy Resource (JHAR), posted 1995-2003.
22.2. Not publicly available now: HIPAA.
22.3. Requires Institutional Review Board (IRB) approval.
22.3.1. Why the project won't harm the patients.
22.3.2. Why the risk of harm is outweighed by presumed benefits.
22.4. Same for http://www.netautopsy.org/vhpsapsx.htm JHSP corpus.

Chapter 23. Statistical Inventory.

23.1. All Words: Zipf's Law.
23.2. Barrier Words: Zipf's Law.
23.3. Collocations: Zipf's Law.
23.4. Grammaticality: Zipf's Law.
23.5. BNF formulas: Zipf's Law.

23.2. Barrier Words: Zipf's Law.
RANK	FREQUENCY   BARRIER WORD
   1      222,175   and
   2      196,153   of
   3      189,799   with
   4      107,039   for
   5      104,067   the
   6       82,104   note
   7       80,740   in
   8       78,549   right
   9       77,885   left
  10       70,923   is
  11       70,261   see
  12       67,917   are
  13       53,071   mild
  14       49,987   identified
  15       47,804   to
  16       41,467   consistent
  17       39,792   this
  18       30,352   present
  19       27,189   seen
  20       25,371   at
  21       25,097   there
  22       24,657   on
  23       24,284   or
  24       23,021   be
  25       21,243   associated

23.3. Collocations: Zipf's Law.
RANK	FREQUENCY   COLLOCATION
   1       38,401   chronic inflammation
   2       20,328   lymph nodes
   3       18,428   diff quik
   4       16,104   soft tissue
   5       14,456   bone marrow
   6       13,104   non diagnostic
   7       13,021   diagnostic findings
   8       13,004   non diagnostic findings
   9       12,868   helicobacter pylori
  10       12,328   crypt distortion
  11       12,316   lymph node
  12       12,292   quik stain
  13       12,284   diff quik stain
  14       11,080   mild chronic
  15       10,229   epithelial changes
  16       10,004   fibroadipose tissue
  17        9,967   non specific
  18        9,052   left breast
  19        8,893   inflammatory disease
  20        8,741   gastroesophageal reflux

23.4. Grammaticality: Zipf's Law.
  RANK  FREQUENCY       SENTENCE-PATTERN   EXAMPLE 
     1    423,177                    [N]   hemangioma
     2    106,034                 [N[N]]   liver [needle]
     3     98,958                   [AN]   left foot
     4     85,908                  [N|V]   scar
     5     79,741                 [NN|V]   skin scar
     6     62,042                  [AAN]   epidermal inclusion cyst
     7     50,461                [AN[N]]   laryngeal mass [biopsy]
     8     41,958                  [NCN]   decidua and villi
     9     38,689                [A|NPN]   negative for actinomyces
    10     26,745               [N[NPN]]   cervix [biopsy at 9:00] 
    11     22,097                [N[NN]]   cervix [biopsy 9:00]
    12     21,704                 [NPAN]   skin of left ear
    13     21,102                   [NN]   ear lobe
    14     20,638                  [BAN]   non diagnostic findings
    15     16,864               [AAN[N]]   left chest wall [biopsy]
    16     13,674                 [AAAN]   left axillary soft tissue
    17     12,798              [NCAN[N]]   skin , left flank [biopsy]
    18     12,692                [ANCAN]   soft tissue , inguinal region
    19     12,596               [ANPAAN]   fibrous plaque from left carotid artery
    20     12,507   [N[N]ANCA|VANCA|NPN]   leg [ bka ] old thrombus and calcified atherosclerotic plaque , negative for osteomyelitis 

23.5. BNF formulas: Zipf's Law.
  RANK   FREQUENCY      BNF FORMULA   EXAMPLE
     1     689,478       [N] ==> []   [prostate]
     2     313,234      [AN] ==> []   [actinic keratosis]
     3     117,039     [AAN] ==> []   [hypertrophic actinic keratosis]
     4      86,762     [N|V] ==> []   [scar]
     5      80,127    [NN|V] ==> []   [skin scar]
     6      66,816     [NAN] ==> []   [skin soft tissue]
     7      60,129     [NCN] ==> []   [decidua and villi]
     8      55,728       [AN ==> [N   [actinic KERATOSIS
     9      52,777     [A|N] ==> []   [negative]
    10      47,375      [NN] ==> []   [granulation tissue]
    11      47,139       [A] ==> []   [void]
    12      42,661     [NPN] ==> []   [adenocarcinoma of colon]
    13      36,076    [AAAN] ==> []   [focal bowenoid actinic keratosis]
    14      31,946    [NPAN] ==> []   [skin with actinic keratosis]
    15      25,168     [BAN] ==> []   [focally invasive tumor]
    16      22,761    [NCAN] ==> []   [ulcer and acute inflammation]
    17      22,276     [ANN] ==> []   [exuberant granulation tissue]
    18      16,791       [NN ==> [N   [lung CARCINOMA
    19      15,577    [NAPN] ==> []   [carcinoma metastatic to lung]
    20      13,764     [NNN] ==> []   [liver gallbladder pancreas]

PHRASE STRUCTURE GRAMMAR, PARSING.
   [ adenocarcinoma     of   colon   metastatic   to   lung ]
   [        N            P       N       A         P     N  ]

PHRASE STRUCTURE GRAMMAR, UMLS CODES.
  [ ADENOCARCINOMA     OF       COLON   METASTATIC     TO          LUNG   ]
  [    C0001418     C0332285  C0009368   C0027627    C0332286    C0024109 ]

PHRASE STRUCTURE GRAMMAR, XML FORMAT.
  <code section scheme="UMLS">
    <c type="morph" value="C0001418>adenocarcinoma
      >c type="topo" value="C0009368">colon
        <c type="morph" value="C0027627">metastatic
          <c type="topo" value="C0024109">lung
          </c>
        </c>
      </c>
    </c>
  </code-section>

A NOTE OF PESSIMISM.

"Linguistic theories ... do not cover varieties of exceptional expressions which practical machine translation systems have to handle. A machine translation system, which is still imperfect and will never be completed, is exposed to very crude tests when the system construction reaches a certain stage. At that stage of development, the system is given a comparatively simple sentence for translation, with structures that can be analyzed by a grammar given to the system. After completion, people other than those who developed the system are asked to translate a variety of texts such as newspaper articles, science magazines, patent documents, contract documents, and commercial letters. Because the documents have not been adequately tested at the development stage, users are disappointed by the poor translation results produced by the system. Many of the failures of the system come from the fact that the dictionary and the grammar are not sufficient to accept such unexpected input sentences."


Chapter 24. Conclusions: Future of NLP in medicine.

24.1. Terabytes of text information in medicine annually.
24.2. Raw materials for epidemiologic studies.
24.3. Competition: fast turnaround time versus tolerating a grammatical filter (e.g., Microsoft® Word® email filter (ugh!).
24.4. Acceptable phrase structure grammar rules: professional societies.
24.5. NLP reducible to synoptic reporting.
24.6. Physicians do not easily surrender control of their documents.
24.7. Prof. Siegel's (father of filmless radiology) Test: Who wins the first lawsuit.

Chapter 25. Problems for NLP in anatomic pathology.

25.1. Undetected associations between diseases, e.g., Mesothelioma-asbestos.
25.2. Does one "outgrow" cancer? Age-specific cancer incidences in an aging population.

Chapter 26. References.


Chapter 27. Mini-histories.


Chapter 28. Glossary.


CHAPTER 1.
INTRODUCTION.



1.1. Reasons for NLP in medicine.

There is currently a raging controversy going on in anatomic pathology practice, and the fallout will eventually reach our colleagues in other medical specialties. Anatomic pathologists have always written their diagnostic reports in free text, either English or some other medically competent language (including Latin!). So far, my colleagues have successfully resisted the onslaught of data-miners and administrators who want us to write our diagnoses in standardized coding systems (CAP, 2005; Ackerman, 2005; Ackerman, 2004).

This controversy was a big topic at the most recent meeting of Advancing Practice, Instruction, and Innovation through Informatics (APIII, 2005); and is a requirement for hospitals accredited as a certified cancer center by the College of American Pathologists (CAP, 2005); or by the American College of Surgeons (ACS, 2005). The driving forces are billing ( Mauung, 2004; Hardhats, 2005) and regulation ( JCAHO, 2005). When do two diagnostic reports deserve the same compensation; and what is the mix of cases for a particular medical institution? It is hopeless to tabulate records of this complexity manually. And, in my opinion, it is equally hopeless to expect pathologists and other physicians to compose their reports by making selections from pick-lists.

CHAPTER 2.
LINGUISTIC SCIENCE.



2.1. Characterize and explain linguistic observations.


CHAPTER 3.
RULE-BASED SYSTEMS.



3.1. Grammars in Ancient Civilizations.


CHAPTER 4. GENERATIVE LINGUISTICS.



4.1. Chomsky: describe the innate language (I-language).


CHAPTER 5. ARTIFICIAL INTELLIGENCE.



5.1. Build small systems that behave intelligently.


CHAPTER 6. BASIC CONCEPTS.



6.1. Fundamental questions.


CHAPTER 7. COMPETENCE GRAMMAR.



7.1. Property of the rational speaker.


CHAPTER 8. AMBIGUITY OF LANGUAGE.



8.1. Verbs, gerunds, gerundives.


CHAPTER 9. CORPUS LINGUISTICS: INTRODUCTION.



9.1. Text corpora: Brown corpus.


CHAPTER 10. ZIPF'S LAWS.



10.1. Zipf's First Law.


10.2. Zipf's Second Law.


10.3. Zipf's Third Law.


CHAPTER 11. COLLOCATIONS.



11.1. Definition: Multiple word sequence.


CHAPTER 12. CONCORDANCES.



12.1. Biblical.


CHAPTER 13. MATHEMATICAL FOUNDATIONS.



13.1. Probability Theory.


CHAPTER 14. STATISTICS.



14.1. Estimation.


CHAPTER 15. GENERAL LINGUISTICS.



15.1. Parts-of-speech, morphology.


CHAPTER 16. PHRASE STRUCTURE GRAMMAR.



16.1. Grammar reduced to a sequence of phrases.


CHAPTER 17. CONTEXT-FREE GRAMMAR.



17.1. Surrounding context is irrelevant.


CHAPTER 18. DEPENDENCY GRAMMAR.



18.1. Definition: dependency between words.


CHAPTER 19. CORPUS LINGUISTICS: SOURCES AND METHODS.



19.1. Johns Hopkins Autopsy Resource.


CHAPTER 20. WORDS AND PHRASES.



20.1. Collocations.


CHAPTER 21. SYNTAX.



21.1. Markov models.


CHAPTER 22. JHAR/JHSP CORPORA.



22.1. JHAR.


CHAPTER 23. STATISTICAL INVENTORY.



23.1. All words: Zipf's Law.


23.2. Barrier words: Zipf's Law.


23.3. Collocations: Zipf's Law.


23.4. Grammaticality: Zipf's Law.


23.5. BNF formulas: Zipf's Law.


CHAPTER 24. FUTURE OF NLP IN MEDICINE.



24.1. Terabytes of medical text annually.


CHAPTER 25. PROBLEMS FOR NLP IN PATHOLOGY



25.1. Undetected associations.


CHAPTER 26.
REFERENCES.



Pubmed.
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi

Ackerman AB.
Protocols for the reporting of cutaneous melanoma.
Am J Clin Pathol. 2004 Nov;122(5):815-7. No abstract available.
PMID: 15540388.
PubMed Entry

Ackerman AB.
Dermatologist not equal to dermatopathologist: no place in a profession for pretenders.
J Am Acad Dermatol. 2005 Oct;53(4):698-699.
PMID: 16198796.
PubMed Entry

Ackerman AB.
Garble that derives from lack of definition.
Am J Dermatopathol. 2005 Aug;27(4):369-370.
PMID: 16121068.
PubMed Entry

Ackerman AB.
The future of pathology as a discipline: none without a dictionary!
Cesk Patol. 2005 Jan;41(1):4-5.
PMID: 15816116.
PubMed Entry

Ackerman AB.
Reviewer conflicts of interest should be disclosed.
J Am Acad Dermatol. 2005 Mar;52(3 Pt 1):538; author reply 538; discussion 538-539.
PMID: 15761446.
PubMed Entry

Ackerman AB.
Decline of a discipline: abetment by journals.
J Cutan Pathol. 2005 Mar;32(3):254; author reply 254.
PMID: 15701091.
PubMed Entry

Kung JX, Ackerman AB.
Staging of melanoma: a critique of the most recent (2002) system proposed by the American Joint Committee on Cancer: part II.
Am J Dermatopathol. 2005 Apr;27(2):165-167.
PMID: 15798445.
PubMed Entry

Bakotic B, Ackerman AB.
Staging of melanoma: a critique in historical perspective: part I.
Am J Dermatopathol. 2005 Apr;27(2):160-164.
PMID: 15798444.
PubMed Entry

Dabbs DJ, Geisinger KR, Ruggiero F, Raab SS, Nalesnik M, Silverman JF; Association of Directors of Anatomic and Surgical Pathology.
Recommendations for the reporting of tissues removed as part of the surgical treatment of malignant liver tumors.
Hum Pathol. 2004 Nov;35(11):1315-1323.
PMID: 15668887.
PubMed Entry
ADASP Reporting protocol.

Wei JT, Miller EA, Woosley JT, Martin CF, Sandler RS.
Quality of colon carcinoma pathology reporting: a process of care study.
Cancer. 2004 Mar 15;100(6):1262-1267.
PMID: 15022295.
PubMed Entry
ADASP Reporting protocol.

Jaffe ES, Banks PM, Nathwani B, Said J, Swerdlow SH.
Recommendations for the reporting of lymphoid neoplasms: A report from the Association of Directors of Anatomic and Surgical Pathology.
Mod Pathol. 2004 Jan;17(1):131-135.
PMID: 14657953.
PubMed Entry
ADASP Reporting protocol.

Lawrence WD; Association of Directors of Anatomic and Surgical Pathology.
ADASP recommendations for processing and reporting of lymph node specimens submitted for evaluation of metastatic disease.
Virchows Arch. 2001 Nov;439(5):601-603. Review.
PMID: 11764377.
PubMed Entry
ADASP Reporting protocol.

Association of Directors of Anatomic and Surgical Pathology.
ADASP recommendations for processing and reporting lymph node specimens submitted for evaluation of metastatic disease.
Am J Surg Pathol. 2001 Jul;25(7):961-963.
PMID: 11420470.
PubMed Entry

ADASP Committee. The Association of Directors of Anatomic and Surgical Pathology.
ADASP recommendations for processing and reporting of lymph node specimens submitted for evaluation of metastatic disease.
Mod Pathol. 2001 Jun;14(6):629-632.
PMID: 11406667.
PubMed Entry
ADASP Reporting protocol.

Kishi K.
Comments regarding the American Association of Directors of Anatomic and Surgical Pathology (ADASP) recommendations for the reporting of urinary bladder specimens containing bladder neoplasms: comparison with the Japanese General Rule for Clinical and Pathological Studies on Bladder Cancer.
Pathol Int. 1997 May;47(5):332.
PMID: 9143031.
PubMed Entry
ADASP Reporting protocol.

Association of Directors of Anatomic and Surgical Pathology.
Recommendations for the reporting of resected large intestinal carcinomas. Association of Directors of Anatomic and Surgical Pathology.
Am J Clin Pathol. 1996 Jul;106(1):12-15.
PMID: 8701921.
PubMed Entry
ADASP Reporting protocol.

Association of Directors of Anatomic and Surgical Pathology. Recommendations for the reporting of breast carcinoma.
Association of Directors of Anatomic and Surgical Pathology.
Am J Clin Pathol. 1995 Dec;104(6):614-619.
PMID: 8526202.
PubMed Entry
ADASP Reporting protocol.

Simpson PR, Tschang TP.
ADASP recommendations: consultations in surgical pathology. Association of Directors of Anatomic and Surgical Pathology.
Hum Pathol. 1993 Dec;24(12):1382.
PMID: 8276389.
PubMed Entry
ADASP Reporting protocol.

Aitchison J.
Teach Yourself Linguistics. Fifth Edition.
Chicago: NTC/Contemporary Publishing Co. 2000.
ISBN: 0844226688.

Bengtsson S, Schneider W, Spencer WA, Pratt AW, Kastner VV, Reichertz P, Lamson BG, Anderson J.
The application of computer techniques in health care.
World Hosp. 1976;12(1):47-51.
PMID: 1024332.
PubMed Entry

Berman JJ, Moore GW.
Object-oriented controlled-vocabulary translator using TRANSOFT + HyperPAD.
Proc Annu Symp Comput Appl Med Care. 1991;15:973-975.
PMID: 1807773.
PubMed Entry

Berman JJ.
Tumor classification: molecular analysis meets Aristotle.
BMC Cancer. 2004 Mar 17;4:10.
PMID: 15113444.
PubMed Entry

Borst F, Lyman M, Nhan NT, Tick LJ, Sager N, Scherrer JR.
TEXTINFO: a tool for automatic determination of patient clinical profiles using text analysis.
Proc Annu Symp Comput Appl Med Care. 1991;:63-67.
PMID: 1807679.
PubMed Entry

Bundy A, ed.
Artificial Intelligence Techniques: A Comprehensive Catalogue. Fourth, Revised Edition.
Heidelberg: Springer Verlag. 1997;:.
ISBN: 3540593233.

Chi EC, Sager N, Tick LJ, Lyman MS.
Relational data base modelling of free-text medical narrative.
Med Inform (Lond). 1983 Jul-Sep;8(3):209-223.
PMID: 6600043.
PubMed Entry

Chomsky N.
Morphophonemics of Modern Hebrew.
Undergraduate Honors Essay. University of Pennsylvania. 1949;:. Cited in: Newmeyer FJ. Generative Linguistics. A historical Perspective. London: Routledge. 1996;:.

Chomsky N.
Syntactic Structures.
The Hague: Mouton. 1957;:.

Chomsky N.
The development of grammar in child language: Formal discussion.
Monogr Soc Res Child Dev. 1964;29:35-39.
PMID: 14125365.
PubMed Entry

Chomsky N.
Aspects of the Theory of Syntax.
Cambridge, MA: MIT Press. 1965;:.

Chomsky N.
Language and Mind.
San Diego: Harcourt Brace Jovanovich. 1968.

Chomsky N.
Rules and Representations.
New York: Columbia University Press. 1980;:.

Chomsky N.
Knowledge of Language: Its Nature, Origin, and Use.
New York: Prager. 1986;:.

Chomsky N.
The Minimalist Program.
Cambridge, MA: MIT Press. 1995;:.

Chomsky N.
Universals of human nature.
Psychother Psychosom. 2005;74(5):263-268.
PMID: 16088263.
PubMed Entry

Cios KJ, Moore GW.
Medical Data Mining and Knowledge Discovery: Overview.
Chapter 1. In: Cios KJ. Medical Data Mining and Knowledge Discovery. Berlin: Springer Verlag. 2000;1:1-16.
ISBN: 3-7908-1340-0, 502 pages.
Published within the series: "Studies in Fuzziness and Soft Computing", Physica-Verlag Heidelberg, a Springer-Verlag Company.

Condon EU.
Statistics of vocabulary.
Science 1928;67:300.

Craig J, Bevington W.
Designing with type. A basic course in typography. Fourth edition.
New York: Watson-Guptill Publications. 1999;:.
ISBN 0-8230-1347-2, 176 pages.
Chapter 1. Origins of the Alphabet. pp. 8-11.

Dunham GS, Pacak MG, Pratt AW.
Automatic indexing of pathology data.
J Am Soc Inf Sci. 1978 Mar;29(2):81-90.
PMID: 10318395.
PubMed Entry

Estoup JB.
Gammes Sténographiques. Fourth Edition.
Paris:. 1916;:.

Fedorowicz J.
A Zipfian model of an automatic bibliographic system: An application to MEDLINE.
J Am Soc Info Sci 1982;33:223-232.

Fitch WT, Hauser MD, Chomsky N.
The evolution of the language faculty: Clarifications and implications.
Cognition. 2005 Sep;97(2):179-210.
PMID: 16112662.
PubMed Entry

Giere W.
Foundations of clinical data automation in cooperative programs.
Proc 5th Ann Symp Comp Applic Med Care. 1981;5:1142-1148.

Graepel PH, Henson DE, Pratt AW.
Comments on the use of the Systematized Nomenclature of Pathology.
Methods Inf Med. 1975 Apr;14(2):72-75.
PMID: 1207468.
PubMed Entry

Description of VistA® Filemanager.
http://www.hardhats.org
Includes instructions for obtaining at-cost copies of the complete, public-domain system, through the Freedom of Information Act.

Harris Z.
Methods in Structural Linguistics.
Chicago: University of Chicago Press. 1951;:.

Hauser MD, Chomsky N, Fitch WT.
The faculty of language: what is it, who has it, and how did it evolve?
Science. 2002 Nov 22;298(5598):1569-1579. Review.
PMID: 12446899.
PubMed Entry

Hirschman L, Story G, Marsh E, Lyman M, Sager N.
An experiment in automated health care evaluation from narrative medical records.
Comput Biomed Res. 1981 Oct;14(5):447-463.
PMID: 7273723.
PubMed Entry

Huff D.
How to lie with statistics.
New York: W. W. Norton & Company. 1954;:.
ISBN 0-393-31072-8, 142 pages.

Hutchins WJ.
Machine Translation : Past, Present, Future .
Ellis Horwood/Wiley, Chichester/ New York. 1986. Ellis Horwood Series in Computers and Their Applications. ASIN: 0135435218 .

Hutchins GM, Berman JJ, Moore GW, Hanzlick R, the Autopsy Committee of the College of American Pathologists.
Practice Guidelines for Autopsy Pathology.
Arch Pathol Lab Med. 1999; 123:1085-1092.

Joseph DM, Wong RL.
Correction of misspellings and typographical errors in a free-text medical English information storage and retrieval system.
Methods Inf Med. 1979 Oct;18(4):228-234.

Justeson JS, Katz SM.
Technical terminology: some linguistic properties and an algorithm for identification in text.
Natural Language Engineering. 1995;1:9-27. December 7, 2003: The master critic - The late Hugh Kenner's theory of everything. By John Wilson. The Boston Globe / available from Boston.com. "When Hugh Kenner died on Nov. 24, a few weeks shy of his 81st birthday, the first problem for writers of obituaries and tributes was how to categorize him. ... He was himself a 'pattern recognizer,' as he described inventor Raymond Kurzweil in the December 1990 issue of the pioneering personal computer magazine Byte. ... This openness to experience, this confidence that the patterns he saw derived from some ultimate coherence, must have been owing in part to Kenner's faith, a subject about which he was reticent in his writing. ... [W]hile some of his coreligionists were wringing their hands about the implications of artificial intelligence -- and while MIT's Marvin Minsky was proclaiming that human beings are machines made out of meat -- Kenner was busy devising, with Joseph O'Rourke, a computer program called TRAVESTY, which manipulates a text to create odd effects of language. Later, with Charles Hartman, Kenner published a volume of computer-generated poetry, 'Sentences.'" See: Poetry, Tributes, Pattern Recognition, Natural Language Processing, Machine Learning, Applications

Kenner's Corollary: Article in Discover Magazine, circa 1985: The idea that a desk with an "archeologic ordering" of papers, i.e., chronological with most recently used papers at the top of the pile, is a demonstration of Zipf's Law. That is, the 90% of papers used most often typically appear in the top 10% of the pile.

Kucera H, Francis WN.
Computational Analysis of Present-Day American English.
Providence, RI: Brown University Press. 1967;:.

Laird CG.
The miracle of language.
Publisher: Fawcett Publications. 1965;:.
ASIN: B0007I1X2Y, 255 pages.

Lewis CI, Langford CH.
Symbolic Logic. Second Edition.
New York: Dover Publications, Inc. 1932.

Li W.
Zipf's Law Bibliography.
http://linkage.rockefeller.edu/wli/zipf/index_ru.html

Lyman M, Sager N, Tick L, Nhan N, Borst F, Scherrer JR.
The application of natural-language processing to healthcare quality assessment.
Med Decis Making. 1991 Oct-Dec;11(4 Suppl):S65-S68.
PMID: 1770852.
PubMed Entry

Mandelbrot B.
Structure formelle des textes et communication.
Word 1954;10:1-27.

Manning CD, Schütze H.
Foundations of Statistical Natural Language Processing.
Cambridge, MA: The MIT Press. 2000;:.
ISBN: 0262133601, 680 pages.
http://www-nlp.stanford.edu/fsnlp/intro/

Markov AA.
An example of statistical investigation in the text of Eugene Onyegin, illustrating coupling of tests in chains.
Proc Acad Sci St Petersburg 1913;7;153-162.
Markov was a student of Tschebyscheff.

Masarie FE jr, Miller RA, Bouhaddou O, Guise NB, Warner HR.
An Interlingua for Electronic Interchange of Medical Information: Using Frames to Map Between Clinical Vocabularies.
Comp Biomed Res 1991; 24(4):379-400.

Maung RTA.
What is the best indicator to determine anatomic pathology workload? Canadian experience.
Am J Clin Pathol. 2005;123:45-55.

Upstate Medicare Division.
Sample CPT® Fee Schedule: Upstate Medicare Division, 2004 Fee Schedule.
http://www.umd.nycpic.com/2004_80000-89999.html
Accessed January 18, 2005.
From:
http://www.umd.nycpic.com/
Note: CPT® NUMBER and CPT® DESCRIPTOR are copyrighted products of the American Medical Association.

Minsky M, Hillis D, Rudisch G.
Artificial intelligence.
N Engl J Med. 1980 Jun 26;302(26):1482.
PMID: 7374720.
PubMed Entry

Moore GW, Miller RE, Hutchins GM, Riede UN, Polacsek RA.
Multilingual translation techniques in the analysis of narrative medical text.
Proc Annu Symp Comput Appl Med Care. 1985;9:. November 10-13, 1985, Baltimore, MD.

Moore GW, Miller RE, Hutchins GM.
Microcomputer translator for medical text: Theorem verification for Chapter Two of Zeman's Modal Logic.
Adv Math Comput Med. 7:1621-1633, 1986.

Moore GW, Riede UN, Polacsek RA, Miller RE, Hutchins GM.
Automated translation of German to English medical text.
Am J Med. 1986 Jul;81(1):103-111.
PMID: 3755289.
PubMed Entry

Moore GW, Riede UN, Polacsek RA, Miller RE, Hutchins GM.
Group theory approach to computer translation of medical German.
Methods Inf Med. 1986 Jul;25(3):176-182.
PMID: 3755498.
PubMed Entry

Moore GW, Polacsek RA, Erozan YS, de la Monte SM, Miller RE, Hutchins GM, Riede UN.
Multilingual translation techniques in the analysis of narrative medical text.
Comput Methods Programs Biomed. 1986 Mar;22(1):35-42.
PMID: 3634670.
PubMed Entry

Moore GW, Hutchins GM, Boitnott JK, Miller RE, Polacsek RA.
Word root translation of 45,564 autopsy reports into MeSH titles.
Proc Annu Symp Comput Appl Med Care. 1987;11:. Washington DC, November 1-4, 1987.

Moore GW, Boitnott JK, Miller RE, Eggleston JC, Hutchins GM.
Integrated anatomic pathology reporting system using natural language diagnoses.
Modern Pathol 1988;1:44-50.

Moore GW, Miller RE, Hutchins GM.
Indexing by MeSH titles of natural language pathology phrases identified on first encounter using the Barrier Word Method.
In: Scherrer JR, Cote RA, Mandil SH, eds. Computerized Natural Medical Language Processing for Knowledge Representation. North-Holland. 1989;:29-39.

Moore GW, Wakai I, Satomura Y, Giere W.
TRANSOFT: Medical translation expert system.
Artif Intell Med 1:149-157, 1989.

Moore GW.
TRANSOFT: Public-domain English-to-SNOMED computer translation shell, using the DVA File Manager. Abstract.
Mod Pathol. 4:123A, 1991.

Moore GW.
Medical Expert System User Interface. Editorial.
Artif Intell Med. 1991:15;.

Moore GW, Berman JJ, Hanzlick RL, Buchino JJ, Hutchins GM.
A prototype internet autopsy database: 1625 consecutive fetal and neonatal autopsy facesheets spanning twenty years.
Arch Pathol Lab Med. 1996;120:782-785.
http://www.medparse.com/protoiad.htm

Moore GW, Berman JJ.
Anatomic Pathology Data Mining.
Chapter 4. In: Cios KJ. Medical Data Mining and Knowledge Discovery. Berlin: Springer Verlag. 2000;4:61-107.
ISBN: 3-7908-1340-0, 502 pages.
Published within the series: "Studies in Fuzziness and Soft Computing", Physica-Verlag Heidelberg, a Springer-Verlag Company.
http://www.medparse.com/apdmchap.htm

Nagao M.
Machine Translation.
In: Shapiro SC, ed. Encyclopedia of Artificial Intelligence. Volume 2. M-Z. New York: Wiley-Interscience. 1992;2:898-902.
A nice quote from one of the leaders in the field, that captures the fruitlessness of open-ended programs for computer translation:
"Linguistic theories ... do not cover varieties of exceptional expressions which practical machine translation systems have to handle. A machine translation system, which is still imperfect and will never be completed, is exposed to very crude tests when the system construction reaches a certain stage. At that stage of development, the system is given a comparatively simple sentence for translation, with structures that can be analyzed by a grammar given to the system. After completion, people other than those who developed the system are asked to translate a variety of texts such as newspaper articles, science magazines, patent documents, contract documents, and commercial letters. Because the documents have not been adequately tested at the development stage, users are disappointed by the poor translation results produced by the system. Many of the failures of the system come from the fact that the dictionary and the grammar are not sufficient to accept such unexpected input sentences."


Naur P.
Revised Report on the Algorithmic Language ALGOL 60.
Comm ACM, 1960 May; 3(5):299-314.

Nelson SJ, Olson NE, Fuller L, Tuttle MS, Cole WG, Sherertz DD.
Identifying concepts in medical knowledge.
Medinfo. 1995;8:33-36.

Newmeyer FJ.
Generative Linguistics. A historical Perspective.
London: Routledge. 1996;:.

Pacak MG, Pratt AW.
Identification and transformation of terminal morphemes in medical English part II.
Methods Inf Med. 1978 Apr;17(2):95-100.
PMID: 661609.
PubMed Entry

Pareto V.
Cours d'economie politique
Geneva: Droz. 1896;:. Lausanne and Paris: Rouge. 1897;:.
Pareto's Principle, a predecessor of Zipf's Law.

Pratt AW, Pacak M.
Identification and transformation of terminal morphemes in medical English.
Methods Inf Med. 1969 Apr;8(2):84-90.
PMID: 5819388.
PubMed Entry

Pratt AW.
Interactive data processing in the medical research institution.
Methods Inf Med Suppl. 1976;10:65-76.
PMID: 1078477.
PubMed Entry

Sager N, Bross ID, Story G, Bastedo P, Marsh E, Shedd D.
Automatic encoding of clinical narrative.
Comput Biol Med. 1982;12(1):43-56.
PMID: 7075165.
PubMed Entry

Sager N, Wong R.
Developing a database from free-text clinical data.
J Clin Comput. 1983;11(5-6):184-194.
PMID: 10278191.
PubMed Entry

Sager N, Lyman M, Tick LJ, Nhan NT, Bucknall CE.
Natural language processing of asthma discharge summaries for the monitoring of patient care.
Proc Annu Symp Comput Appl Med Care. 1993;:265-268.
PMID: 8130474.
PubMed Entry

Sager N, Lyman M, Bucknall C, Nhan N, Tick LJ.
Natural language processing and the representation of clinical data.
J Am Med Inform Assoc. 1994 Mar-Apr;1(2):142-160. Review.
PMID: 7719796.
PubMed Entry

Sager N, Lyman M, Nhan NT, Tick LJ.
Automatic encoding into SNOMED III: a preliminary investigation.
Proc Annu Symp Comput Appl Med Care. 1994;:230-234.
PMID: 7949925.
PubMed Entry

Sager N, Lyman M, Nhan NT, Tick LJ.
Medical language processing: applications to patient data representation and automatic encoding.
Methods Inf Med. 1995 Mar;34(1-2):140-146.
PMID: 9082123.
PubMed Entry

Salton G.
Automatic text analysis.
Science. 1970 Apr 17;168(929):335-343.
PMID: 5435890.
PubMed Entry

Salton G.
Experiments in automatic thesaurus construction for information retrieval.
In: Proceedings IFIP Congress, 1971;:43-49.

Salton G, ed.
The Smart Retrieval System - Experiments in Automatic Document Processing.
Englewood Cliffs, NJ: Prentice-Hall. 1971;:.

Salton G, McGill MJ.
Introduction to modern information retrieval.
New York: McGraw-Hill. 1983;:.

Salton G, Fox EA, Wu H.
Extended boolean information retrieval.
Communications of the ACM 1983;26:1022-1036.

Salton G, Buckley C, Fox EA.
Automatic query formulations in information retrieval.
J Am Soc Inf Sci. 1983 Jul;34(4):262-280.
PMID: 10299297.
PubMed Entry

Salton G.
Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer.
Reading, MA: Addison Wesley. 1989;:.

Salton G, Buckley C.
Global text matching for information retrieval.
Science. 1991;253:1012-1015.

Salton G, Allen J.
Selective text utilization and text traversal.
In: Proceedings of ACM Hypertext 93, New York.
New York: Association for Computing Machinery. 1993;:.

Salton G, Allan J, Buckley C, Singhal A.
Automatic analysis, theme generation and summarization of machine-readable texts.
Science 1994;264:1421-1426.

Sawyer R, Berman JJ, Borkowski A, Moore GW.
Elevated prostate-specific antigen levels in black men and white men.
Mod Pathol. 1996 Nov;9(11):1029-1032.
http://www.medparse.com/elevpsal.htm

Sorace JM, Berman JJ, Carnahan GE, Moore GW.
PRELOG: precedence logic inference software for blood donor deferral.
Proc Annu Symp Comput Appl Med Care. 1991;:976-977.
PMID: 1807774.
PubMed Entry

Suppes P.
Introduction to Logic.
New York: Van Nostrand. 1957;:.

Suppes P.
Probabilistic grammars for natural languages.
Synthese 1970;22:95-116.

Suppes P.
Axiomatic Set Theory.
New York: Dover Publications. 1972;:.
ISBN: 0486616304.

Suppes P.
Probabilistic Metaphysics.
Oxford: Blackwell. 1984;:.

Suppes P, Bottner M, Liang L.
Machine learning comprehension grammars for ten languages.
Computational Linguistics. 1996;22:329-350.

Taylor M, Saltz J, Nichols JH.
Design of an Integrated Clinical Data Warehouse.
J Assn Lab Automation. 2000. in press.

Tersmette KWF, Scott AF, Moore GW, Matheson NW, Miller RE.
Barrier word method for detecting molecular biology multiple word terms.
Proc Annu Symp Comput Appl Med Care. 1988;12:207-211. Washington DC, November 6-9, 1988.

Twain M.
Life on the Mississippi.
New York: Signet Classics, Reissue edition. 2001;:. (November 7, 2001). Twain M, Kaplan J.
ISBN: 0451528174, 359 pages. See:
http://en.wikipedia.org/wiki/Mark_Twain

Tymoczko T, ed.
New Directions in the Philosophy of Mathematics.
Princeton, NJ: Princeton University Press. 1998;:.

U. S. National Library of Medicine.
Unified Medical Language System.
http://www.nlm.nih.gov/research/umls/

U. S. National Library of Medicine.
UMLS Knowledge Sources. Eleventh Edition. Unified Medical Language System.
U. S. Department of Health and Human Services. National Institutes of Health. National Library of Medicine. 2000;:.

U. S. National Library of Medicine.
UMLS Knowledge Sources. Tenth Edition. Unified Medical Language System.
U. S. Department of Health and Human Services. National Institutes of Health. National Library of Medicine. 1999.

U. S. National Library of Medicine.
UMLS Knowledge Sources. Ninth Edition. Unified Medical Language System.
U. S. Department of Health and Human Services. National Institutes of Health. National Library of Medicine. 1998;:.

Wilbur WJ.
Overview of Books at NCBI.
http://www.ncbi.nlm.nih.gov:80/books/mboc/bookshelp/bookover.html#link

Wingert F.
[PAULA: program for evaluation of logical expressions. Plausibility-control and evaluation of optical mark reader forms]
Methods Inf Med. 1972 Apr;11(2):96-103.
PMID: 5026579.
PubMed Entry

Wingert F, Ries P.
[Pathology findings system]
Methods Inf Med. 1973 Jul;12(3):150-155. German.
PMID: 4729117.
PubMed Entry

Wingert F.
[Morphosyntactical analysis of compound word forms in medical language]
Methods Inf Med. 1977 Oct;16(4):248-255. German.
PMID: 337050.
PubMed Entry

Wingert F.
Morphologic analysis of compound words.
Methods Inf Med. 1985 Jul;24(3):155-162.
PMID: 4033445.
PubMed Entry

Wingert F.
Automated indexing based on SNOMED.
Methods Inf Med. 1985 Jan;24(1):27-34.
PMID: 3982279.
PubMed Entry

Wingert F.
An indexing system for SNOMED.
Methods Inf Med. 1986 Jan;25(1):22-30.
PMID: 3753739.
PubMed Entry

Wingert F.
Automated indexing of SNOMED statements into ICD.
Methods Inf Med. 1987 Jul;26(3):93-98.
PMID: 3670105.
PubMed Entry

Wingert F.
Medical linguistics: automated indexing into SNOMED.
Crit Rev Med Inform. 1988;1(4):333-403.
PMID: 3288353.
PubMed Entry

Wittgenstein L.
Philosophical Investigations [Philosophische Untersuchungen]. Third edition.
Oxford: Basil Blackwell. 1968;:.

Wong RL, Gaynon P.
An automated parsing routine for diagnostic statements of surgical pathology reports.
Methods Inf Med. 1971 Jul;10(3):168-175.

Wong RL, Reno JD, Hain TC, Platt RC, Gaynon PS, Joseph DM.
Profile of a dictionary compiled from scanning over one million words of surgical pathology narrative text.
Comput Biomed Res. 1980 Aug;13(4):382-398.

Yu CC-Y, Moore GW, Unschuld PU.
Romanized Chinese respelling rules for an English medical word list.
Proc Annu Symp Comput Appl Med Care. 1987;11:. Washington DC, November 1-4, 1987.

Zhang Q.
Easy entry of Chinese character set symbols.
Proc 5th Ann Symp Comp Appl Med 1981;5:143-149.

Zipf GK.
Relative frequency as a determinant of phonetic change.
Harvard Studies in Classical Philology 1929;40:1-95.

Zipf GK. Selective Studies and the Principle of Relative Frequency in Language.
?1932.

Zipf GK.
The Psycho-Biology of Language.
Boston, MA: Houghton Mifflin. 1935;:.
Boston, MA: MIT Press. 1965;:.

Zipf GK.
National Unity and Disunity: The Nation As a Bio-Social Organism.
Bloomington, IN: Principia Press. 1941;:.

Zipf GK.
Human Behavior and The Principle of Least Effort. An Introduction to Human Ecology.
Reading, MA: Addison-Wesley Press. 1949;:19-55.

Campbell JR, Carpenter P, Sneiderman C, Cohn S, Chute CG, Warren J.
Phase II evaluation of clinical coding schemes: completeness, taxonomy, mapping, definitions, and clarity. CPRI Work Group on Codes and Structures.
J Am Med Inform Assoc. 1997 May-Jun;4(3):238-51.
http://www.pubmedcentral.gov/articlerender.fcgi?tool=pubmed&pubmedid=9147343

Chute CG, Cohn SP, Campbell KE, Oliver DE, Campbell JR.
The content coverage of clinical classifications. For The Computer-Based Patient Record Institute's Work Group on Codes & Structures.
J Am Med Inform Assoc. 1996 May-Jun;3(3):224-33.
PMID 8723613.
http://www.pubmedcentral.gov/articlerender.fcgi?tool=pubmed&pubmedid=8723613

Campbell JR, Payne TH.
A Comparison of Four Schemes for Codification of Problem Lists.
Proc SCAMC 1994, Washington, DC, p. 201-205
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=7949920&query_hl=4

Humphreys BL, McCray AT, Cheh ML
Evaluating the coverage of controlled health data terminologies: report on the results of the NLM/AHCPR large scale vocabulary test.
J Am Med Inform Assoc. 1997 Nov-Dec;4(6):484-500.
http://www.pubmedcentral.gov/articlerender.fcgi?tool=pubmed&pubmedid=9391936

U. S. National Library of Medicine.
Papers covering UMLS/SNOMED/Read Codes in different domains:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Display&dopt=pubmed_pubmed&from_uid=9147343

Apelon Resources:
http://www.apelon.com/literature/conferencepapers.htm

Evaluation of SNOMED coverage of VHA Terms.
http://www.apelon.com/literature/papers/FinalVA_SNOMEDPaper.pdf

Lingologix is the commercial tool that uses NLP to map to (SNOMED CT), in clinical use at Mayo and Hopkins:
http://www.lingologix.com/

CHAPTER 27.
MINI-HISTORIES.



Genesis 11:1-19 [circa 4000 BC]. Tower of Babel. According to this story, all persons on earth once spoke a single language. The people attempted to build a tower reaching to heaven. Because of their arrogance, God punished them by confounding their languages, and their building project failed.
There are now over 2000 distinct written languages on earth today.
In this story, different languages are viewed as a curse, a barrier to understanding.

Aristotle (Αριστοτελης) [384 BC - 322 BC]. Greek philosopher, who compiled an encyclopedia of all scientific and other human knowledge available at that time. Aristotle's Rule: for every positive y such that x > y, there exists an n > 0 such that yn > x. Note that if y=0, the rule doesn't work. This and other pernicious properties of zero caused Aristotle to avoid the concept. Zero was rediscovered and developed almost a millennium later by Indian and Arabic mathematicians.
See: http://en.wikipedia.org/wiki/Aristotle

Rosetta Stone [196 BC] The Rosetta Stone is a dark granite stone with writing in two languages, Egyptian and Greek, using three scripts: Hieroglyphic Egyptian, Demotic Egyptian, and Greek. Because Greek was well known, the stone was important to scholars for deciphering the hieroglyphs. Ptolemy V assumed the crown at age five, and was faced with the task of reclaiming lands lost to various invaders. As an attempt to reestablish legitimacy for Ptolemy, his priests issued a series of decrees, inscribed on stones and distributed throughout Egypt. The Rosetta stone is the decree issued in the city of Memphis. It stone describes various taxes repealed by Ptolemy V, and instructs that his statues be erected in temples in three languages.

"Rosetta" is iconic for "translation", and some computerized translation systems have "Rosetta" as part of their name.
See: http://en.wikipedia.org/wiki/Rosetta_Stone

Qin Shi-Huang (夌始皇) [260 BC - 210 BC] First emperor of China (Qin = Ch'in), only emperor of the Qin Dynasty, who unified the country administratively and linguistically, in part by burning all books which disagreed with his regime. The advantage of this linguistic unification is that a document written in one part of China can be read anywhere else in China (assuming that the readers are literate), even though the spoken languages (so-called dialects) are mutually unintelligible. Everyone was REQUIRED to adopt the imperial ideograms, or else. Execution of 460 scholars. (The Ten Crimes of Qin.) See:
http://en.wikipedia.org/wiki/Qin_Shi_Huang

The subject of the rise of Emperor Qin, and the conflict of scholarship versus political unification, is treated in the movie Ying Xiong (2002) (Hero, starring Jet Li, Mandarin with English subtitles).
See: http://www.imdb.com/title/tt0299977/

Acts 2:1-15. [circa 35 AD] The Christian Pentecost miracle, where the Holy Spirit descends upon a group of disciples, and allows them to preach in many different languages. In contrast to the Tower of Babel, this Biblical reference is a positive reference to the multiple languages of the earth.

Masada. [72 AD] Site of an apparent mass suicide among first-century Jews, rather than be conquered and subjugated to the spiritual and linguistic demands of the Roman Empire. Chronicled by Flavius Josephus, a first-century Jewish historian, based upon eye-witness accounts. Masada (Hebrew: מצדה = fortress) was built by Herod the Great between 37 and 31 BC as a refuge for himself, in case his subjects should rise up against him. In 66 AD, a group of Jewish rebels overtook Masada from the Roman garrison, and used Masada as their base for raiding and harassing local settlements. In 72 AD, the Roman governor of Judaea, Lucius Flavius Silva, marched against Masada and eventually built a rampart against the western plateau, using thousands of tons of stones and beaten earth. Silva finally breached the wall of the fortress with a battering ram. When the Romans entered the fortress, they discovered that its defenders had set all the buildings ablaze and committed mass suicide, rather than face certain capture or defeat. See:
http://en.wikipedia.org/wiki/Masada

Gaius Suetonius Tranquillus: Lives of the Grammarians and Rhetoricians. "The science of grammar was in ancient times far from being in vogue at Rome; indeed, it was of little use in a rude state of society, when the people were engaged in constant wars, and had not much time to bestow on the cultivation of the liberal arts. At the outset, its pretensions were very slender, for the earliest men of learning, who were both poets and orators, may be considered as half-Greek: I speak of Livius and Ennius, who are acknowledged to have taught both languages as well at Rome as in foreign parts. But they only translated from the Greek, and if they composed anything of their own in Latin, it was only from what they had before read. For although there are those who say that this Ennius published two books, one on "Letters and Syllables," and the other on "Metres," Lucius Cotta has satisfactorily proved that they are not the works of the poet Ennius, but of another writer of the same name...."
Translation by Alexander Thompson, MD.
See: http://en.wikipedia.org/wiki/Suetonius
http://classicpersuasion.org/pw/cicero/suetoniusrhetor.htm

Rev. Thomas Bayes. British Anglican priest who developed the theory of conditional probability.
See: http://en.wikipedia.org/wiki/Bayes

Benjamin Disraeli, Earl of Beaconsfield (1804-1881). Conservative British Prime Minister during the Victorian Era. "There are lies, damn lies, and statistics." It is not an accident that statistics developed in Great Britain, and that the world's best statisticians still live and work there. Great Britain is an island nation, and has always made its national livelihood from maritime trade. Ships at sea, like dice at a gaming table, are subject to chance occurrences. In his career, Disraeli must have seen more than his share of deceptive statistics. See:
http://en.wikipedia.org/wiki/Benjamin_Disraeli

John Maynard Keynes (1883-1946). "In the long run, we're all dead." British economist, who developed concepts of national fiscal and monetary policy. Many economic theories distinguish between short-run and long-run processes, without really specifying how long is long-run. This quote is Keynes's ridicule of this particular paradox of academic economics. See:
http://en.wikipedia.org/wiki/John_Maynard_Keynes

Karl Pearson. Early twentieth century British statistician, who introduced the correlation coefficient, or Pearson's r. Father of E. S. Pearson, another twentieth century statistical giant. See:
http://en.wikipedia.org/wiki/Karl_Pearson

Aleksander N. Kolmogorov. Great 20th c. Russian statistician and mathematician, who introduced many non-parametric methods in statistics, including the Kolmogorov-Smirnov test. See:
http://en.wikipedia.org/wiki/Kolmogorov

George Boole [1815-1864] British mathematician and philosopher. As the inventor of Boolean algebra, the basis of all modern computer arithmetic, Boole is regarded as one of the founders of the field of computer science, although computers did not exist in his day.
See: http://en.wikipedia.org/wiki/George_Boole

Col. John Shaw Billings [1838 - 1913]. U. S. surgeon and librarian, born in Indiana. In the Civil War, Billings was medical inspector of the Army of the Potomac. After the war, he directed the Surgeon General's Library in Washington, DC. The catalog entries greatly increased under his supervision by 1873, and soon thereafter, Billings began work on the Index Catalogue. Sixteen volumes appeared before his military retirement. In 1879, he initiated the Index Medicus, a monthly guide to current medical literature, which eventually became PubMed, curated by the U. S. National Library of Medicine. Dr. Billings designed plans for the construction of Johns Hopkins Hospital. His works include classic essays on hospital administration and training. Under his leadership (1864 - 1895), the National Library of Medicine became one of the greatest medical library systems in the world.

Émile Baudot The Baudot code was used extensively in telegraph systems. It is a five bit code invented by the Frenchman Emile Baudot in 1870.

Ludwig Josef Johann Wittgenstein [1889 - 1951] was an Austrian philosopher, who contributed several ground-breaking works to modern philosophy, primarily on the foundations of logic and the philosophy of language. He is widely regarded as one of the most influential philosophers of the 20th century. See:
http://en.wikipedia.org/wiki/Ludwig_Wittgenstein

George Kingsley Zipf [1902-1950] was an American linguist and philologist, who studied the statistical properties of different languages. He is the eponym of Zipf's Law (actually, Zipf's First Law), which states that only a few words are used very often, whereas many or most words are used rarely, according to the formula:
f = k/r
where f is word-frequency, r is word-rank, and k is a constant. Zipf's work was treated harshly when it first appeared, perhaps somewhat justifiably because Zipf's claims were so grandiose: namely, an explanation for all linguistic usage in all major human languages. Also, Zipf's "principle of least effort" (i.e., speakers use a few words repeatedly, because they are linguistically lazy) has never been verified experimentally.

As recently as a few years ago, a humanities professor from a prestigious east-coast university made disparaging remarks to me about Zipf's work. (This wasn't a friendly conversation: I impugned the professor's abilities and discernment as a scientist.) Also, Zipf's "principle of least effort" (i.e., speakers use a few words repeatedly, because they are linguistically lazy) has never been verified experimentally.

However, Zipf was right. His basic claim (i.e., Zipf's First Law) has been verified for many major languages, including English, German, and Chinese; as well as for specialized bodies of medical text, including The Frankfurt University Medical Consultation Database and The Johns Hopkins University Autopsy Facesheets. Major internet indexing systems (google.com, yahoo.com) apparently exploit Zipf's First Law, although their exact search algorithms are closely-guarded trade secrets.

Furthermore, as anyone knows who has studied a second language, all beginning (i.e., first-year) textbooks introduce fewer than a thousand words. Even though this is the vocabulary of a preschooler, it is the thousand most-used words in the language, and gets you a pretty good start on ordering dinner or checking into a hotel.

Zipf died at age 48, and did not live to see the incredible growth of interest in his work. See:
http://en.wikipedia.org/wiki/George_Kingsley_Zipf

Marvin Lee Minsky [1927-]. U. S. scientist in the field of artificial intelligence (AI), co-founder of the Laboratory of Artificial Intelligence at the Massachusetts Institute of Techology, and author of several texts on AI and philosophy. He served in the U.S. Navy in 1944-1945. He holds a BA in Mathematics from Harvard (1950) and a PhD in Mathematics from Princeton (1954). He has been on the faculty of the Massachusetts Institute of Techology since 1958. He is currently Toshiba Professor of Media Arts and Sciences and Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology.

Prof. Avram Noam Chomsky [1928-] is Institute Professor Emeritus of linguistics at the Massachusetts Institute of Technology. Chomsky developed the theory of generative grammar, regarded as the most significant contribution to the field of theoretical linguistics of the 20th century. Chomsky established the so-called Chomsky hierarchy, a classification of formal languages in terms of their generative power. Chomsky is also widely known for his political activism, and for his criticism of the foreign policy of the United States and other governments, particularly in the Vietnam War era. See:
http://en.wikipedia.org/wiki/Noam_Chomsky

Lotfi Asker Zadeh [1922-]. The so-called "Pope of Fuzzy Logic", whose 1968 paper introducing fuzzy set theory has been cited over 11,000 times in peer-reviewed journals of mathematics, computer science, or engineering.
See: http://en.wikipedia.org/wiki/Lotfi_Zadeh

William S. Gossett (Student). An employee of the Guinness Brewery in Dublin, Ireland, who wrote the ground-breaking papers in the British journal, Nature, about the Student t test. Gossett was a student of Karl Pearson, but because Gossett did his work as an employee, he concealed his identity because of his commercial ties. His papers were signed, simply, Student. The Guinness Book of World Records was written by the Guinness Brewery as an aid to settle arguments in British bars were Guinness products were served. See:
http://en.wikipedia.org/wiki/William_Sealey_Gossett

Sir Ronald A. Fisher. Greatest British statistician of the twentieth century. Sir Ronald corrected a small error in a formula for variance that had originally been promulgated by Karl Pearson. Fisher proved that the correct formula for the sample variance is: s2 = (∑ni=1  (xi) - x)2)/(n-1), not s2 = (∑ni=1  (xi) - x)2)/n, as Pearson had thought.
Sir Ronald was the scientist who demonstrated statistically that Mendel had probably fudged his data.
The F-test for the analysis of variance is named in honor of Fisher.
However, Sir Ronald sold out to the tobacco industry. When the news first emerged that tobacco use was bad for your health, Fisher defended the tobacco industry by asserting that the cause-effect relationship was not conclusively demonstrated. Fisher developed the concept of CONFOUNDING, in which he argued that tobacco users might have some other mysterious quality that caused them to develop tobacco-related illnesses, apart from the tobacco use. Fisher's prominence in the field of statistics helped the tobacco industry hide from its responsibilities for a number of years. Fisher's assertion was eventually rebuffed by the fact that tobacco users who quit experienced subsequent decrease in tobacco-related illnesses. See:
http://en.wikipedia.org/wiki/Ronald_Fisher

Huff D.
How to lie with statistics.
New York: W. W. Norton & Company. 1954;:.
ISBN 0-393-31072-8, 142 pages.
"In the space of one hundred seventy-six years, the Lower Mississippi has shortened itself two hundred and forty-two miles. That is an average of a trifle over one mile and a third per year. Therefore, any calm person, who is not blind or idiotic, can see that in the Old Oölitic Silurian Period, just a million years ago next November, the Lower Mississippi River was upward of one million three hundred thousand miles long, and stuck out over the Gulf of Mexico like a fishing-rod. And by the same token, any person can see that seven hundred and forty-two years from now, the Lower Mississippi will be only a mile and three-quarters long, and Cairo [Illinois] and New Orleans [Louisiana] will have joined their streets together, and be plodding comfortably along under a single mayor and a mutual board of aldermen. There is something fascinating about science. One gets such wholesale returns of conjecture out of such a trifling investment of fact."
Cited in: Huff D. How to lie with statistics. New York: W. W. Norton & Company. 1954;:. ISBN 0-393-31072-8, 142 pages. Page 142.
COMMENT. Mark Twain's classic book, Life on the Mississippi, is the first book in the world ever submitted by an author to a publisher as a typewritten manuscript, in 1883. The inventor of the typewriter was ... Howe, who was born on June 23, 18.. The (mechanical) typewriter was invented in 1868. Source: Garrison Keillor, Author's Corner, Maryland Public Radio, June 23, 2004.

CHAPTER 28.
GLOSSARY.



Estimation. The statistical procedure of determining the best value for a statistical parameter, given sample data.

Random variable. Function X : S -> R, that maps a probability event space into the real line, R.

Expected Value: The average value, E(X), of a random value over the probability space: E(X) = ∑ x P(X=x).

Variance: The average squared deviation, Var(X), of a random value from its expected value, over the probability space: Var(X) = E[X - E(X)]]2.

Hypothesis testing.

Null hypothesis.

Alternative hypothesis.

Set theory: Zermelo-Frankel Set Theory (ZFST) is ordinary set theory.

Set: Undefined concepts of ZFST: is-a-member-of or belongs-to ; null-set or empty-set, Ø or {}.

Set: defined exactly by its members, arbitrary order.

Set-of-x not equal x.

There are no repeat elements in a set.

Set-Roster (extensional, list) notation: set X = {heart, lung, liver, pancreas, ...}.

Set-Raster (intensional) notation: O = {x|x is a major-body-organ}.

Set-subset: X ⊆ Y if and only if for every x ∈ X, x ∈ Y.

set-equality: X = Y if and only if X ⊆ Y and Y ⊆ X.

set-union: X ∪ Y is the set of all x such that x ∈ X or x ∈ Y or both.

Set-intersection: X ∪ Y is the set of all x such that x ∈ X and x ∈ Y.

Set-subtraction: X - Y is the set of all x such that x ∈ X and x ~∈ Y.

CHAPTER 29.
ADDITIONAL READINGS.



Campbell JR, Carpenter P, Sneiderman C, Cohn S, Chute CG, Warren J.
Phase II evaluation of clinical coding schemes: completeness, taxonomy, mapping, definitions, and clarity. CPRI Work Group on Codes and Structures.
J Am Med Inform Assoc. 1997 May-Jun;4(3):238-251.
PMID: 9147343.
PubMed Entry

Humphreys BL, McCray AT, Cheh ML.
Evaluating the coverage of controlled health data terminologies: report on the results of the NLM/AHCPR large scale vocabulary test.
J Am Med Inform Assoc. 1997 Nov-Dec;4(6):484-500.
PMID: 9391936.
PubMed Entry

Chute CG, Cohn SP, Campbell KE, Oliver DE, Campbell JR.
The content coverage of clinical classifications. For The Computer-Based Patient Record Institute's Work Group on Codes & Structures.
J Am Med Inform Assoc. 1996 May-Jun;3(3):224-233.
PMID: 8723613.

Cimino JJ.
Review paper: coding systems in health care.
Methods Inf Med. 1996 Dec;35(4-5):273-284.
PMID: 9019091.

Campbell JR, Payne TH.
A comparison of four schemes for codification of problem lists.
Proc Annu Symp Comput Appl Med Care. 1994;:201-205.
PMID: 7949920.

Langlotz CP, Caldwell SA.
The completeness of existing lexicons for representing radiology report information.
J Digit Imaging. 2002;15 Suppl 1:201-5. Epub 2002 Mar 21.
PMID: 12105728.

Hales JW, Schoeffler KM, Kessler DP.
Extracting medical knowledge for a coded problem list vocabulary from the UMLS Knowledge Sources.
Proc AMIA Symp. 1998;:275-279.
PMID: 9929225.

Brown PJ, Warmington V, Laurence M, Prevost AT.
Randomised crossover trial comparing the performance of Clinical Terms Version 3 and Read Codes 5 byte set coding schemes in general practice.
BMJ. 2003 May 24;326(7399):1127.
PMID: 12763986.

Elkin PL, Ruggieri AP, Brown SH, Buntrock J, Bauer BA, Wahner-Roedler D, Litin SC, Beinborn J, Bailey KR, Bergstrom L.
A randomized controlled trial of the accuracy of clinical record retrieval using SNOMED-RT as compared with ICD9-CM.
Proc AMIA Symp. 2001;:159-163.
PMID: 11825173.

Mullins HC, Scanland PM, Collins D, Treece L, Petruzzi P Jr, Goodson A, Dickinson M.
The efficacy of SNOMED, Read Codes, and UMLS in coding ambulatory family practice clinical records.
Proc AMIA Annu Fall Symp. 1996;:135-139.
PMID: 8947643.
PubMed Entry

Bodenreider O, Burgun A, Botti G, Fieschi M, Le Beux P, Kohler F.
Evaluation of the Unified Medical Language System as a medical knowledge source.
J Am Med Inform Assoc. 1998 Jan-Feb;5(1):76-87.
PMID: 9452987.
PubMed Entry

Campbell KE, Musen MA.
Representation of clinical data using SNOMED III and conceptual graphs.
Proc Annu Symp Comput Appl Med Care. 1992;:354-358.
PMID: 1482897.
PubMed Entry

Campbell JR.
Semantic features of an enterprise interface terminology for SNOMED RT.
Medinfo. 2001;10(Pt 1):82-85.
PMID: 11604710.
PubMed Entry

Han SB, Kwak M, Kim S, Yoo S, Park H, Kijoo J, Kim J, Choi M, Choi J.
A comparative study on concept representation between the UMLS and the clinical terms in Korean medical records.
Medinfo. 2004;11(Pt 1):616-620.
PMID: 15360886.
PubMed Entry

O'Keefe KM, Sievert M, Mitchell JA.
Mendelian inheritance in man: diagnoses in the UMLS. Proc Annu Symp Comput Appl Med Care. 1993;:735-739. PMID: 8130573.
PubMed Entry

Humphreys BL, Hole WT, McCray AT, Fitzmaurice JM.
Planned NLM/AHCPR large-scale vocabulary test: using UMLS technology to determine the extent to which controlled vocabularies cover terminology needed for health care and public health.
J Am Med Inform Assoc. 1996 Jul-Aug;3(4):281-287.
PMID: 8816351.
PubMed Entry

Han SB, Choi J.
The comparative study on concept representation between the UMLS and the clinical terms in Korean medical records.
Int J Med Inform. 2005 Jan;74(1):67-76.
PMID: 15626637.
PubMed Entry

Wasserman H, Wang J.
An applied evaluation of SNOMED CT as a clinical vocabulary for the computerized diagnosis and problem list.
AMIA Annu Symp Proc. 2003;:699-703.
PMID: 14728263.