JUNE 1, 2000: DEMOGRAPHIC AND LINGUISTIC CONTENT
OF JOHNS HOPKINS SURGICAL PATHOLOGY (JH-SP) DATABASE.
FEMALE MALE UNKNOWN TOTAL % PATIENTS
0-9 years 3,763 5,906 13 9,682 6.1%
10-19 years 6,409 2,841 7 9,257 5.8%
20-29 years 17,318 3,341 16 20,675 13.0%
30-39 years 18,743 5,618 13 24,374 15.3%
40-49 years 15,149 7,405 26 22,580 14.2%
50-59 years 12,269 11,057 18 23,344 14.7%
60-69 years 10,873 14,501 26 25,400 16.0%
70-79 years 8,198 9,377 14 17,589 11.1%
80-89 years 2,650 2,239 9 4,898 3.1%
90-99 years 225 121 0 346 0.2%
100-109 years 5 2 0 7 0.0%
Unknown age 919 919 0.6%
Total 95,602 62,408 1,061 159,071 100.1%
60.1% 39.2% 0.7%
3. DISTRIBUTION OF ORGAN SYSTEMS.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
ORGAN-SYSTEMS REPRESENTED AMONG 361,957 CASES:
ORGAN SYSTEM CASES PERCENT
Gastrointestinal 103,819 28.7%
Lymphoreticular 54,597 15.1%
Gynecologic 50,579 14.0%
Bone 25,578 7.1%
Breast 20,939 5.8%
Dermatologic 20,747 5.7%
Obstetric 19,167 5.3%
Genitourinary 18,916 5.2%
Blood 17,787 4.9%
Marrow 16,576 4.6%
Heart 14,490 4.0%
Lung 8,015 2.2%
CNS 6,320 1.7%
Neuromuscular 4,789 1.3%
Endocrine 3,288 0.9%
4. RAW WORD COUNTS.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
9,004,337 WORDS.
27,139 DISTINCT WORDS.
11,550 SINGLY OCCURRING WORDS (HAPAX LEGOMENA).
15,589 MULTIPLY OCCURRING WORDS.
222,175 OCCURRENCES OF WORD 'AND'.
TO THE 11,550 SINGLY OCCURRING WORDS.
MISSPELLING RATE OF 0.1% = 11,550/9,004,337.
5. UMLS PART-OF-SPEECH LIST.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
EACH WORD ASSIGNED TO SYNTACTIC CATEGORY OR PART-OF-SPEECH:
A=Adjective;
B=adverB;
C=Conjunction (and, or, ...);
D=Determiner (the, this, ...);
H=Helpingverb;
I=Interrogative (who, which, why, how,..., including complementizers);
N=Noun;
P=Preposition (at, by, to, for, from,...);
R=pRonoun (he, she, it, we, they,...);
V=Verb.
POS NAME POS LETTER POS DECIMAL NO. OCCURRENCES
Adjective A 1 2,187,808
adverB B 2 204,278
Conjunction C 16 262,589
Determiner D 32 164,435
Helpingverb H 4 174,048
Interrogative I 8 18,338
Noun N 128 4,458,102
Preposition P 256 709,617
pRonoun R 512 11,824
mainVerb V 1024 21,205
Adj or Adv A|B 3 4,586
Adj or Noun A|N 129 115,075
Adj, Noun, Vb A|N|V 1153 114,981
Adj or Verb A|V 1025 259,917
Dtrmr or Int D|I 40 10,263
Noun or Verb N|V 1152 275,683
Unassigned 11,588
TOTAL 9,004,337
6. HISTORY OF COMPUTATIONAL LINGUISTICS.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
NOAM CHOMSKY'S THESIS ON THE LINGUISTICS OF HEBREW.
ALL HUMAN LANGUAGES MAY BE REPRESENTED AS SET OF PRODUCTION RULES.
BACKUS NAUR FORM (BNF) ORIGINALLY USED TO DESCRIBE COMPUTER LANGUAGES.
SCANT ATTENTION TO QUANTITATIVE AND STATISTICAL BEHAVIOR TECHNICAL PROSE.
SURGICAL PATHOLOGY FREE-TEXT: RESTRICTED VOCABULARY;
INTENTIONALLY UNAMBIGUOUS.
7. DISCOVERY METHODS.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
ZIPF'S LAW: HIGH-FREQUENCY WORDS
IN LARGE FREE-TEXT CORPUS ARE EXTREMELY COMMON.
WORD-RANK (r) INVERSELY PROPORTIONAL TO WORD-FREQUENCY (f):
FOR CONSTANT, k, r = k/f.
APPROXIMATELY ONE HUNDRED BARRIER WORDS:
OVER HALF OF ALL WORD-OCCURRENCES.
BARRIER WORDS OR STOP WORDS:
COMMONLY-OCCURRING, INCLUDING
ARTICLES, CONJUNCTIONS, INTERROGATIVES, COMPLEMENTIZERS,
INTERROGATIVES, PREPOSITIONS, PRONOUNS, AUXILIARY VERBS.
BARRIER WORD METHOD: BARRIER WORDS
ARE SEPARATORS, OR BARRIERS,
BETWEEN MULTIPLE-WORD MEDICAL TERMS.
MULTIPLE-WORD TERMS,
OR COLLOCATIONS,
BOUNDED ON EITHER SIDE BY BARRIER WORDS.
TERMINAL ILEUM , CECUM , APPENDIX and COLON ( RIGHT HEMICOLECTOMY ) ;
MODERATELY DIFFERENTIATED COLONIC ADENOCARCINOMA , with extension through
MUSCULARIS PROPRIA into PERICOLIC SOFT TISSUE , and with involvement
of PERINEURAL SPACES . TUBULOVILLOUS ADENOMA and associated
VASCULAR MALFORMATION in the TRANSVERSE COLON ; TUBULAR ADENOMA
in the DESCENDING COLON . recent COLOSTOMY SITE with SUBMUCOSAL FIBROSIS
and INFLAMED GRANULATION TISSUE in the SEROSA . multiple ADHESIONS
and SEROSAL ABSCESSES with GRANULATION TISSUE , FOREIGN BODY GIANT CELLS ,
SCARRING , focal OSSIFICATION , and FAT NECROSIS . ISCHEMIC BOWEL DISEASE
diffusely involving ILEAL MUCOSA , with focal TRANSMURAL NECROSIS
and ACUTE INFLAMMATION .
ZIPF'S LAW FOR GRAMMAR:
BACKUS NAUR FORM AKIN TO WORDS IN TEXT.
CANONICAL FORM: PREFERRED NOTATION, ENCAPSULATES
ALL EQUIVALENT FORMS OF SAME CONCEPT.
ARBITRARY LEVELS OF RECURSION OF XML TAGS:
<code-section>
<c> ... <c> ... <c > ... </c></c></c>
</code-section>
INTEGRATED CLINICAL DATA WAREHOUSE:
<patient>
<case>
<specimen>
<report-section>
<code-section>
<c> ... <c> ... < <c> ... </c></c></c>
</code-section>
</report-section>
</specimen>
</case>
</patient>
8. BACKUS NAUR PARSING MODEL.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1. [] ==> [Nx]
2. [Nx] ==> [N] Example: HEMANGIOMA.
3. [Nx] ==> [AN] Example: ACTINIC KERATOSIS.
4. [Nx] ==> [NPN] Example: ADENOCARCINOMA OF COLON.
REVERSE BACKUS-NAUR-FORM PARSING:
PARSER BEGINS WITH THE MORE COMPLEX EXPRESSION
AND WORKS BACKWARD TO SIMPLER EXPRESSION.
9. FREQUENCY DISTRIBUTION OF BARRIER WORDS.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
RANK FREQUENCY BARRIER WORD
1 222,175 and
2 196,153 of
3 189,799 with
4 107,039 for
5 104,067 the
6 82,104 note
7 80,740 in
8 78,549 right
9 77,885 left
10 70,923 is
11 70,261 see
12 67,917 are
13 53,071 mild
14 49,987 identified
15 47,804 to
16 41,467 consistent
17 39,792 this
18 30,352 present
19 27,189 seen
20 25,371 at
21 25,097 there
22 24,657 on
23 24,284 or
24 23,021 be
25 21,243 associated
26 19,515 was
27 18,376 one
28 16,122 but
29 16,057 case
30 16,057 from
31 16,036 these
32 15,672 show
33 15,396 separate
34 15,135 by
35 13,776 as
36 13,730 an
37 13,542 has
38 13,074 only
39 12,615 shows
40 11,735 portion
41 11,487 involving
42 10,803 two
43 10,718 which
44 10,448 features
45 10,263 that
46 10,119 low
47 10,097 three
10. FREQUENCY DISTRIBUTION
OF MULTIPLE-WORD TERMS (COLLOCATIONS).
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
RANK FREQUENCY COLLOCATION
1 38,401 chronic inflammation
2 20,328 lymph nodes
3 18,428 diff quik
4 16,104 soft tissue
5 14,456 bone marrow
6 13,104 non diagnostic
7 13,021 diagnostic findings
8 13,004 non diagnostic findings
9 12,868 helicobacter pylori
10 12,328 crypt distortion
11 12,316 lymph node
12 12,292 quik stain
13 12,284 diff quik stain
14 11,080 mild chronic
15 10,229 epithelial changes
16 10,004 fibroadipose tissue
17 9,967 non specific
18 9,052 left breast
19 8,893 inflammatory disease
20 8,741 gastroesophageal reflux
21 8,234 gleason grade
22 7,994 squamous metaplasia
23 7,797 tubular adenoma
24 7,237 reactive epithelial
25 6,944 reactive epithelial changes
26 6,793 active chronic
27 6,714 granulation tissue
28 6,634 seminal vesicles
29 6,312 surgical margins
30 6,199 lamina propria
31 6,086 acute rejection
32 6,038 fallopian tube
33 6,019 cell metaplasia
34 5,990 pelvic lymph
35 5,932 chronic gastritis
36 5,928 secretory endometrium
37 5,790 hematopoietic elements
38 5,648 bile reflux
39 5,633 chronic cervicitis
40 5,595 pelvic lymph nodes
41 5,538 no helicobacter
42 5,521 no helicobacter pylori
43 5,422 anti inflammatory
44 5,418 small bowel
45 5,379 type indeterminate
46 5,247 inflammatory drugs
47 5,243 anti inflammatory drugs
48 5,183 chronic inflammatory
49 5,173 hernia sac
50 5,003 antral mucosa
11. ZIPF GRAMMAR.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
RANK FREQUENCY SENTENCE-PATTERN EXAMPLE
1 423,177 [N] hemangioma
2 106,034 [N[N]] liver [needle]
3 98,958 [AN] left foot
4 85,908 [N|V] scar
5 79,741 [NN|V] skin scar
6 62,042 [AAN] epidermal inclusion cyst
7 50,461 [AN[N]] laryngeal mass [biopsy]
8 41,958 [NCN] decidua and villi
9 38,689 [A|NPN] negative for actinomyces
10 26,745 [N[NPN]] cervix [biopsy at 9:00]
11 22,097 [N[NN]] cervix [biopsy 9:00]
12 21,704 [NPAN] skin of left ear
13 21,102 [NN] ear lobe
14 20,638 [BAN] non diagnostic findings
15 16,864 [AAN[N]] left chest wall [biopsy]
16 13,674 [AAAN] left axillary soft tissue
17 12,798 [NCAN[N]] skin , left flank [biopsy]
18 12,692 [ANCAN] soft tissue , inguinal region
19 12,596 [ANPAAN] fibrous plaque from left carotid artery
20 12,507 [N[N]ANCA|VANCA|NPN] leg [ bka ] old thrombus and calcified atherosclerotic plaque , negative for osteomyelitis
21 12,136 [BAANHA|VPAAN] no helicobacter pylori organisms are identified on diff quik stain
22 12,097 [AAAN[N]] left true vocal cord [biopsy]
23 10,555 [ANPAN] soft tissue of right wrist
24 10,257 [A|NPNCN] negative for fungi or afb
25 9,952 [ANN] left ear lobe
26 9,650 [N[AN]] colon [biopsy left]
27 9,533 [N[NCN]] cervix [biopsy , 9:00]
28 8,732 [ANN[N]] left ear lobe [biopsy]
29 8,239 [NAN] skin right ear
30 7,937 [A] void
31 7,550 [NPN] biopsy at 9:00
32 6,862 [NCN[N]] skin, face [biopsy]
33 6,675 [ANPN] left head of femur
34 6,121 [NCNCN] placenta, membranes and cord
35 6,111 [AN[NN]] left breast [core biopsy]
36 6,096 [N[NPA]] colon [biopsy of left]
37 5,728 [N[NA]] colon [biopsy right]
38 5,685 [N[NPAN]] colon [biopsy of right colon]
39 5,650 [ANCAN[N]] soft tissue , left chest [excision]
40 5,422 [DNHA|VPDAAN] this case was shown at the quality assurance conference
12. ZIPF DISTRIBUTION OF BACKUS NAUR FORMS.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
RANK FREQUENCY BNF FORMULA EXAMPLE
1 689,478 [N] ==> [] [prostate]
2 313,234 [AN] ==> [] [actinic keratosis]
3 117,039 [AAN] ==> [] [hypertrophic actinic keratosis]
4 86,762 [N|V] ==> [] [scar]
5 80,127 [NN|V] ==> [] [skin scar]
6 66,816 [NAN] ==> [] [skin soft tissue]
7 60,129 [NCN] ==> [] [decidua and villi]
8 55,728 [AN ==> [N [actinic KERATOSIS
9 52,777 [A|N] ==> [] [negative]
10 47,375 [NN] ==> [] [granulation tissue]
11 47,139 [A] ==> [] [void]
12 42,661 [NPN] ==> [] [adenocarcinoma of colon]
13 36,076 [AAAN] ==> [] [focal bowenoid actinic keratosis]
14 31,946 [NPAN] ==> [] [skin with actinic keratosis]
15 25,168 [BAN] ==> [] [focally invasive tumor]
16 22,761 [NCAN] ==> [] [ulcer and acute inflammation]
17 22,276 [ANN] ==> [] [exuberant granulation tissue]
18 16,791 [NN ==> [N [lung CARCINOMA
19 15,577 [NAPN] ==> [] [carcinoma metastatic to lung]
20 13,764 [NNN] ==> [] [liver gallbladder pancreas]
21 13,212 [AAN ==> [N [hypertrophic actinic KERATOSIS
22 12,417 [BAN ==> [BN [FOCALLY active GASTRITIS
23 12,053 [NCN ==> [N [decidua and VILLI
13. CONCLUSION.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
ENCODER PARSES COHERENT SURGICAL-PATHOLOGY FREE-TEXT.
TARGET: STANDARDIZED CODING LANGUAGE (UMLS), FORMATTED AS XML.
PROTOTYPE PARSER: 82.8% SUCCESSFUL PARSES.
14. REFERENCES.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1. Aitchison J.
Teach Yourself Linguistics.
Fifth Edition.
Chicago: NTC/Contemporary Publishing Co. 2000.
ISBN: 0844226688.
2. Bundy A (Ed).
Artificial Intelligence Techniques:
A Comprehensive Catalogue.
Fourth, Revised Edition.
Heidelberg: Springer Verlag. 1997.
ISBN: 3540593233.
3. Chomsky N.
Aspects of the Theory of Syntax.
Cambridge, MA: The MIT Press. 1965.
4. Giere W.
Foundations of clinical data automation in cooperative programs.
Proc 5th Ann Symp Comp Applic Med Care. 1981;5:1142-1148.
5. Manning CD, Schuetze H.
Foundations of Statistical Natural Language Processing.
Cambridge, MA: The MIT Press. ISBN: 0262133601. 2000.
6. Masarie FE jr, Miller RA, Bouhaddou O, Guise NB, Warner HR.
An Interlingua for Electronic Interchange of Medical Information:
Using Frames to Map Between Clinical Vocabularies.
Comp Biomed Res 1991; 24(4):379-400.
7. Moore GW, Boitnott JK, Miller RE, Eggleston JC, Hutchins GM.
Integrated anatomic pathology reporting system
using natural language diagnoses.
Modern Pathol 1988;1:44-50.
8. Moore GW, Miller RE, Hutchins GM.
Indexing by MeSH titles of natural language pathology phrases
identified on first encounter using the Barrier Word Method.
In: Scherrer JR, Cote RA, Mandil SH, eds.
Computerized Natural Medical Language Processing
for Knowledge Representation. North-Holland. 1989;29-39.
9. Moore GW, Berman JJ, Hanzlick RL, Buchino JJ, Hutchins GM.
A prototype internet autopsy database:
1625 consecutive fetal and neonatal autopsy facesheets
spanning twenty years.
Arch Pathol Lab Med. 1996;120:782-785.
10. Moore GW, Berman JJ.
Anatomic Pathology Data Mining.
In: Cios KJ, ed. Medical Data Mining and Knowledge Discovery.
Heidelberg: Springer Verlag. 2000 (in press).
11. Nagao M.
Machine Translation.
In: Shapiro SC, ed. Encyclopedia of Artificial Intelligence.
Volume 2. M-Z. New York: Wiley-Interscience. 1992;:898-902.
12. Nelson SJ, Olson NE, Fuller L, Tuttle MS, Cole WG, Sherertz DD.
Identifying concepts in medical knowledge.
Medinfo. 1995;8:33-36.
13. Newmeyer FJ.
Generative Linguistics. A historical Perspective.
London: Routledge. 1996.
14. Salton G, Buckley C.
Global text matching for information retrieval.
Science. 991;253:1012-1015.
15. Taylor M, Saltz J, Nichols JH.
Design of an Integrated Clinical Data Warehouse.
J Assn Lab Automation. 2000. in press.
16. Tersmette KWF, Scott AF, Moore GW, Matheson NW, Miller RE.
Barrier word method for detecting molecular biology multiple word terms.
Proc 12th Annu Symp Comput Appl Med Care. 1988;12:.
17. Tymoczko T (ed.).
New Directions in the Philosophy of Mathematics.
Princeton, NJ: Princeton University Press. 1998.
18. U.S. National Library of Medicine.
Unified Medical Language System.
http://www.nlm.nih.gov/research/umls/
19. U. S. National Library of Medicine.
UMLS Knowledge Sources.
Eleventh Edition.
Unified Medical Language System.
U. S. Department of Health and Human Services.
National Institutes of Health.
National Library of Medicine. 2000.
20. Wilbur WJ.
Overview of Books at NCBI.
http://www.ncbi.nlm.nih.gov:80/books/mboc/bookshelp/bookover.html#link
21. Zipf GK.
Human Behavior and The Principle of Least Effort.
An Introduction to Human Ecology.
Reading, MA:
Addison-Wesley Press. 1949;:19-55.
Last Revised: 10/22/2000 by G. William Moore.