WHITE: 80,524 (65.5%).
3. ANNUAL DISTRIBUTION OF CASES.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
YEAR CASES SPECIMENS
1984 14,942 22,112
1985 18,969 29,846
1986 19,046 31,835
1987 19,051 33,133
1988 19,705 35,437
1989 20,253 37,166
1990 21,052 39,274
1991 21,645 40,766
1992 22,003 43,660
1993 21,006 43,180
1994 21,351 43,967
1995 22,139 44,974
1996 23,174 47,576
1997 26,502 54,824
1998 27,528 57,197
1999 29,596 61,531
2000 13,995 27,965
Total 361,957 694,443
4. DISTRIBUTION OF ORGAN SYSTEMS.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
ORGAN-SYSTEMS REPRESENTED AMONG THE 361,957 CASES.
ORGAN SYSTEM CASES PERCENT
Gastrointestinal 103,819 28.7%
Lymphoreticular 54,597 15.1%
Gynecologic 50,579 14.0%
Bone 25,578 7.1%
Breast 20,939 5.8%
Dermatologic 20,747 5.7%
Obstetric 19,167 5.3%
Genitourinary 18,916 5.2%
Blood 17,787 4.9%
Marrow 16,576 4.6%
Heart 14,490 4.0%
Lung 8,015 2.2%
CNS 6,320 1.7%
Neuromuscular 4,789 1.3%
Endocrine 3,288 0.9%
5. DE-IDENTIFIERS.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
23,911 (6.6%) CASES CONTAINING PROPER-NAMES.
TOKENIZED BY THE DE-IDENTIFICATION SOFTWARE.
ALL NAMES AND ALL MISSPELLINGS
MUST BE CAPTURED AND INCLUDED IN THE DICTIONARY:
DICTIONARY POLICEPERSON.
EPONYMOUS DISEASES ARE MANAGED AS MULTIPLE-WORD TERMS:
DR. BARRETT; BARRETT ESOPHAGUS.
DR. CLARK; CLARK LEVEL.
NO PROTECTION AGAINST A VERY UNUSUAL COMBINATION
OF DISEASES THAT MIGHT BE KNOWN PUBLICLY.
6. RAW WORD COUNTS.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
9,004,337 WORDS.
27,139 DISTINCT WORDS.
11,550 SINGLY OCCURRING WORDS (HAPAX LEGOMENA).
15,589 MULTIPLY OCCURRING WORDS.
222,175 OCCURRENCES OF WORD 'AND'.
TO THE 11,550 SINGLY OCCURRING WORDS.
MISSPELLING RATE OF 0.1% = 11,550/9,004,337.
7. UMLS PART-OF-SPEECH LIST.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
EACH WORD ASSIGNED TO SYNTACTIC CATEGORY OR PART-OF-SPEECH:
A=Adjective;
B=adverB;
C=Conjunction (and, or, ...);
D=Determiner (the, this, ...);
H=Helpingverb;
I=Interrogative (who, which, why, how,..., including complementizers);
N=Noun;
P=Preposition (at, by, to, for, from,...);
R=pRonoun (he, she, it, we, they,...);
V=Verb.
POS NAME POS LETTER POS DECIMAL NO. OCCURRENCES
Adjective A 1 2,187,808
adverB B 2 204,278
Conjunction C 16 262,589
Determiner D 32 164,435
Helpingverb H 4 174,048
Interrogative I 8 18,338
Noun N 128 4,458,102
Preposition P 256 709,617
pRonoun R 512 11,824
mainVerb V 1024 21,205
Adj or Adv A|B 3 4,586
Adj or Noun A|N 129 115,075
Adj, Noun, Vb A|N|V 1153 114,981
Adj or Verb A|V 1025 259,917
Dtrmr or Int D|I 40 10,263
Noun or Verb N|V 1152 275,683
Unassigned 11,588
TOTAL 9,004,337
8. DISCOVERY METHODS.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
ZIPF'S LAW: HIGH-FREQUENCY WORDS
IN LARGE FREE-TEXT CORPUS ARE EXTREMELY COMMON.
WORD-RANK (r) INVERSELY PROPORTIONAL TO WORD-FREQUENCY (f):
FOR CONSTANT, k, r = k/f.
APPROXIMATELY ONE HUNDRED BARRIER WORDS:
OVER HALF OF ALL WORD-OCCURRENCES.
BARRIER WORDS OR STOP WORDS:
COMMONLY-OCCURRING, INCLUDING
ARTICLES, CONJUNCTIONS, INTERROGATIVES, COMPLEMENTIZERS,
INTERROGATIVES, PREPOSITIONS, PRONOUNS, AUXILIARY VERBS.
BARRIER WORD METHOD: BARRIER WORDS
ARE SEPARATORS, OR BARRIERS,
BETWEEN MULTIPLE-WORD MEDICAL TERMS.
MULTIPLE-WORD TERMS,
OR COLLOCATIONS,
BOUNDED ON EITHER SIDE BY BARRIER WORDS.
TERMINAL ILEUM , CECUM , APPENDIX and COLON ( RIGHT HEMICOLECTOMY ) ;
MODERATELY DIFFERENTIATED COLONIC ADENOCARCINOMA , with extension through
MUSCULARIS PROPRIA into PERICOLIC SOFT TISSUE , and with involvement
of PERINEURAL SPACES . TUBULOVILLOUS ADENOMA and associated
VASCULAR MALFORMATION in the TRANSVERSE COLON ; TUBULAR ADENOMA
in the DESCENDING COLON . recent COLOSTOMY SITE with SUBMUCOSAL FIBROSIS
and INFLAMED GRANULATION TISSUE in the SEROSA . multiple ADHESIONS
and SEROSAL ABSCESSES with GRANULATION TISSUE , FOREIGN BODY GIANT CELLS ,
SCARRING , focal OSSIFICATION , and FAT NECROSIS . ISCHEMIC BOWEL DISEASE
diffusely involving ILEAL MUCOSA , with focal TRANSMURAL NECROSIS
and ACUTE INFLAMMATION .
ZIPF'S LAW FOR GRAMMAR:
BACKUS NAUR FORM AKIN TO WORDS IN TEXT.
CANONICAL FORM: PREFERRED NOTATION, ENCAPSULATES
ALL EQUIVALENT FORMS OF SAME CONCEPT.
ARBITRARY LEVELS OF RECURSION OF XML TAGS:
<code-section>
<c> ... <c> ... <c > ... </c></c></c>
</code-section>
INTEGRATED CLINICAL DATA WAREHOUSE:
<patient>
<case>
<specimen>
<report-section>
<code-section>
<c> ... <c> ... < <c> ... </c></c></c>
</code-section>
</report-section>
</specimen>
</case>
</patient>
9. BACKUS NAUR PARSING MODEL.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1. [] ==> [Nx]
2. [Nx] ==> [N] Example: HEMANGIOMA.
3. [Nx] ==> [AN] Example: ACTINIC KERATOSIS.
4. [Nx] ==> [NPN] Example: ADENOCARCINOMA OF COLON.
REVERSE BACKUS-NAUR-FORM PARSING:
PARSER BEGINS WITH THE MORE COMPLEX EXPRESSION
AND WORKS BACKWARD TO SIMPLER EXPRESSION.
10. FREQUENCY DISTRIBUTION OF BARRIER WORDS.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
RANK FREQUENCY BARRIER WORD
1 222,175 and
2 196,153 of
3 189,799 with
4 107,039 for
5 104,067 the
6 82,104 note
7 80,740 in
8 78,549 right
9 77,885 left
10 70,923 is
11 70,261 see
12 67,917 are
13 53,071 mild
14 49,987 identified
15 47,804 to
16 41,467 consistent
17 39,792 this
18 30,352 present
19 27,189 seen
20 25,371 at
21 25,097 there
22 24,657 on
23 24,284 or
24 23,021 be
25 21,243 associated
26 19,515 was
27 18,376 one
28 16,122 but
29 16,057 case
30 16,057 from
31 16,036 these
32 15,672 show
33 15,396 separate
34 15,135 by
35 13,776 as
36 13,730 an
37 13,542 has
38 13,074 only
39 12,615 shows
40 11,735 portion
41 11,487 involving
42 10,803 two
43 10,718 which
44 10,448 features
45 10,263 that
46 10,119 low
47 10,097 three
11. FREQUENCY DISTRIBUTION
OF MULTIPLE-WORD TERMS (COLLOCATIONS).
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
RANK FREQUENCY COLLOCATION
1 38,401 chronic inflammation
2 20,328 lymph nodes
3 18,428 diff quik
4 16,104 soft tissue
5 14,456 bone marrow
6 13,104 non diagnostic
7 13,021 diagnostic findings
8 13,004 non diagnostic findings
9 12,868 helicobacter pylori
10 12,328 crypt distortion
11 12,316 lymph node
12 12,292 quik stain
13 12,284 diff quik stain
14 11,080 mild chronic
15 10,229 epithelial changes
16 10,004 fibroadipose tissue
17 9,967 non specific
18 9,052 left breast
19 8,893 inflammatory disease
20 8,741 gastroesophageal reflux
21 8,234 gleason grade
22 7,994 squamous metaplasia
23 7,797 tubular adenoma
24 7,237 reactive epithelial
25 6,944 reactive epithelial changes
26 6,793 active chronic
27 6,714 granulation tissue
28 6,634 seminal vesicles
29 6,312 surgical margins
30 6,199 lamina propria
31 6,086 acute rejection
32 6,038 fallopian tube
33 6,019 cell metaplasia
34 5,990 pelvic lymph
35 5,932 chronic gastritis
36 5,928 secretory endometrium
37 5,790 hematopoietic elements
38 5,648 bile reflux
39 5,633 chronic cervicitis
40 5,595 pelvic lymph nodes
41 5,538 no helicobacter
42 5,521 no helicobacter pylori
43 5,422 anti inflammatory
44 5,418 small bowel
45 5,379 type indeterminate
46 5,247 inflammatory drugs
47 5,243 anti inflammatory drugs
48 5,183 chronic inflammatory
49 5,173 hernia sac
50 5,003 antral mucosa
12. ZIPF GRAMMAR.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
RANK FREQUENCY SENTENCE-PATTERN EXAMPLE
1 423,177 [N] hemangioma
2 106,034 [N[N]] liver [needle]
3 98,958 [AN] left foot
4 85,908 [N|V] scar
5 79,741 [NN|V] skin scar
6 62,042 [AAN] epidermal inclusion cyst
7 50,461 [AN[N]] laryngeal mass [biopsy]
8 41,958 [NCN] decidua and villi
9 38,689 [A|NPN] negative for actinomyces
10 26,745 [N[NPN]] cervix [biopsy at 9:00]
11 22,097 [N[NN]] cervix [biopsy 9:00]
12 21,704 [NPAN] skin of left ear
13 21,102 [NN] ear lobe
14 20,638 [BAN] non diagnostic findings
15 16,864 [AAN[N]] left chest wall [biopsy]
16 13,674 [AAAN] left axillary soft tissue
17 12,798 [NCAN[N]] skin , left flank [biopsy]
18 12,692 [ANCAN] soft tissue , inguinal region
19 12,596 [ANPAAN] fibrous plaque from left carotid artery
20 12,507 [N[N]ANCA|VANCA|NPN] leg [ bka ] old thrombus and calcified atherosclerotic plaque , negative for osteomyelitis
21 12,136 [BAANHA|VPAAN] no helicobacter pylori organisms are identified on diff quik stain
22 12,097 [AAAN[N]] left true vocal cord [biopsy]
23 10,555 [ANPAN] soft tissue of right wrist
24 10,257 [A|NPNCN] negative for fungi or afb
25 9,952 [ANN] left ear lobe
26 9,650 [N[AN]] colon [biopsy left]
27 9,533 [N[NCN]] cervix [biopsy , 9:00]
28 8,732 [ANN[N]] left ear lobe [biopsy]
29 8,239 [NAN] skin right ear
30 7,937 [A] void
31 7,550 [NPN] biopsy at 9:00
32 6,862 [NCN[N]] skin, face [biopsy]
33 6,675 [ANPN] left head of femur
34 6,121 [NCNCN] placenta, membranes and cord
35 6,111 [AN[NN]] left breast [core biopsy]
36 6,096 [N[NPA]] colon [biopsy of left]
37 5,728 [N[NA]] colon [biopsy right]
38 5,685 [N[NPAN]] colon [biopsy of right colon]
39 5,650 [ANCAN[N]] soft tissue , left chest [excision]
40 5,422 [DNHA|VPDAAN] this case was shown at the quality assurance conference
13. ZIPF DISTRIBUTION OF BACKUS NAUR FORMS.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
RANK FREQUENCY BNF FORMULA EXAMPLE
1 689,478 [N] ==> [] [prostate]
2 313,234 [AN] ==> [] [actinic keratosis]
3 117,039 [AAN] ==> [] [hypertrophic actinic keratosis]
4 86,762 [N|V] ==> [] [scar]
5 80,127 [NN|V] ==> [] [skin scar]
6 66,816 [NAN] ==> [] [skin soft tissue]
7 60,129 [NCN] ==> [] [decidua and villi]
8 55,728 [AN ==> [N [actinic KERATOSIS
9 52,777 [A|N] ==> [] [negative]
10 47,375 [NN] ==> [] [granulation tissue]
11 47,139 [A] ==> [] [void]
12 42,661 [NPN] ==> [] [adenocarcinoma of colon]
13 36,076 [AAAN] ==> [] [focal bowenoid actinic keratosis]
14 31,946 [NPAN] ==> [] [skin with actinic keratosis]
15 25,168 [BAN] ==> [] [focally invasive tumor]
16 22,761 [NCAN] ==> [] [ulcer and acute inflammation]
17 22,276 [ANN] ==> [] [exuberant granulation tissue]
18 16,791 [NN ==> [N [lung CARCINOMA
19 15,577 [NAPN] ==> [] [carcinoma metastatic to lung]
20 13,764 [NNN] ==> [] [liver gallbladder pancreas]
21 13,212 [AAN ==> [N [hypertrophic actinic KERATOSIS
22 12,417 [BAN ==> [BN [FOCALLY active GASTRITIS
23 12,053 [NCN ==> [N [decidua and VILLI
14. DATABASE DEFINITION.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.