GOETHE UNIVERSITY AUTOPSY REGISTER:
ANONYMIZED BILINGUAL DATABASE.
W. Giere, MD. [1],
G. William Moore, MD, PhD [2,3,4].
Grover M. Hutchins, MD [3].
From: Center for Medical Informatics,
J. W. von Goethe University, Frankfurt, Germany [1].
Pathology and Laboratory Medicine Service (113),
Baltimore VA Maryland Health Care System, Baltimore, MD [2].
Department of Pathology, The Johns Hopkins Medical Institutions,
Baltimore, MD [3].
Department of Pathology,
University of Maryland School of Medicine, Baltimore, MD [4].
TABLE OF CONTENTS.
1. ABSTRACT.
2. INTRODUCTION.
3. MATERIALS AND METHODS.
4. BUILDING THE UMLS TERM LEXICON.
5. BUILDING THE ZIPF DISTRIBUTION.
6. BARRIER WORD METHOD.
7. BUILDING THE PARSING TABLE.
8. SYNTACTIC PROCESSING.
9. WORD ORDER REARRANGEMENT.
10. USER INTERFACE.
11. BUILDING MEDICAL ONTOLOGIES.
12. RESULTS: DISTRIBUTION OF WORDS AND UMLS CODES.
13. DISCUSSION.
14. REFERENCES.
1. ABSTRACT.
GOETHE UNIVERSITY AUTOPSY REGISTER:
ANONYMIZED BILINGUAL DATABASE.
NEXT PAGE.
RETURN TO TABLE OF CONTENTS.
W. Giere, M.D. [1],
G. William Moore, MD, PhD [2,3,4].
Grover M. Hutchins, MD, PhD [3].
From:
Center for Medical Informatics, J. W. von Goethe University,
Frankfurt, Germany [1].
Pathology and Laboratory Medicine Service (113),
Baltimore VA Maryland Health Care System, Baltimore, MD [2].
Department of Pathology, The Johns Hopkins Medical Institutions,
Baltimore, MD. [3].
Department of Pathology, University of Maryland School of Medicine,
Baltimore, MD [4].
Background:
There is a growing need for public inventories
of anatomic pathology data, coded in standard nomenclatures,
that could be used for epidemiologic investigations, outcome studies,
and quality improvement projects; and combined with data from other
institutions for comparative studies.
Design:
The Goethe University Autopsy Register (GUAR)
consists of 12,447 autopsy summaries, spanning 15 years, serving
the Goethe University Medical Clinics. Reports are free-text
with controlled orthography and grammar.
Results:
The GUAR text file was a 34.2 MB,
with a total of 2,996,154 words. There were 35,145 distinct words,
ranging in frequency from 144,291 occurrences of `der' (the)
to one occurrence apiece of 14,512 words. GUAR autopsy summaries
are posted on the Internet, with a bilingual query engine and linkage
to The Johns Hopkins Autopsy Resource (JHAR) in English.
Conclusions:
GUAR is an example of autopsy summaries originally in German,
with an interface across language domains, including English
and the Unified Medical Language System (UMLS). This register could serve
as a model for decentralized anatomic databases worldwide,
with a potential for building medical ontologies,
improving the dissemination of structured medical knowledge,
and conducting cost-effective international collaborative projects.
2. INTRODUCTION.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1.
The idea of shared, multi-institutional anatomic pathology databases
has been the dream of pathologists for a quarter-century (1,2,3).
Such databases, coded in standard nomenclatures with structured grammars,
could be used to inform interested persons
regarding the range and frequency of diseases
encountered by a particular institution,
and can potentially be combined with data from
other institutions for conducting comparative epidemiologic studies,
outcome studies, and quality assurance studies (4,5).
2.
The Johns Hopkins Autopsy Resource (JHAR) is the first, large
anatomic pathology database posted on the Internet.
Founded as an institutional database in 1980
and available publicly since 1995,
the JHAR lists over 50,000 autopsy summaries,
on patients born over a span of two centuries,
with an estimated one million tissue blocks (3,6,7,8).
3.
At the same time, there is a need to protect the identities of patients and
medical providers (7,9,10,11,12,13,14).
The GUAR is anonymized, i.e., all links to exact patient identifiers
have been eliminated. The JHAR is de-identified,
i.e., all links to exact patient identifiers have been encrypted.
With one exception, the JHAR satisfies
the computational-disclosure definition
of k-anonymous de-identification, for k=4 (6,12).
4.
The Goethe University Autopsy Register (GUAR)
consists of 12,447 autopsy summaries, spanning 15 years,
serving the Goethe University Medical Clinics (15,16,17).
5.
This report describes the translation of these reports
into the Unified Medical Language System (UMLS)
of the U. S. National Library of Medicine (USNLM)
(18,19,20,21,22).
The primary language of the UMLS is English,
but a subset of UMLS has been translated into German by our laboratory
(23,24,25,26,27).
3. MATERIALS AND METHODS.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1.
Format of GUAR reports.
The initial file was a 34.2 MB text file,
and consisted of free-text autopsy summaries spanning 15 years,
with user-determined, German-language vocabulary, with spell-check control.
Each autopsy summary consisted of four divisions:
principal findings. individual diagnoses. microscopic notes. and comment.
2.
Processing GUAR reports.
The autopsy summary report was initially dropped to all lower case,
and each punctuation mark was buffered on either side
with a blankspace (ASCII 32).
For patient-confidentiality reasons, all numerals were tokenized .
A period forming part of an abbreviation (e.g., li. niere = left kidney)
was handled as an exceptional case in the term lexicon.
After preprocessing, each complete sentence ended with a period (ASCII 46),
followed by a blankspace.
Unpaired punctuation marks ( , ; : ) were handled grammatically as commas.
Paired punctuation marks ( [] () [] <> `' )
were handled grammatically as parentheses.
3.
Rules for consistent medical German:
Free-text autopsy summaries employed
user-determined, German-language vocabulary, with spell-check control.
Umlauts and sz-ligatures were expressed in expanded form only
(i.e., ä always expressed as ae, etc.).
German words were listed case-insensitively in the lexicon.
The system includes a pretty-print dictionary for displaying
case-sensitive words (German nouns beginning with upper-case),
as well as the usual appearance of umlauts, sz-ligatures, etc.
The text consisted of short sentences, grammatically correct,
ending with a period (16).
4. BUILDING THE UMLS TERM LEXICON.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1.
Preparation of the UMLS Vocabulary.
The Unified Medical Language System (UMLS)
of the U. S. National Library of Medicine (USNLM)
is by far the world's largest medical concept system (20),
and is the best tool for research studies
in controlled medical vocabularies. UMLS serves as an indexing tool
for PubMed, a collection of over eleven million
medical citations available on the Internet to the general public.
UMLS provides a uniform, integrated distribution format from
over one hundred biomedical vocabularies and classifications.
The 2001 UMLS Metathesaurus contains over one million
biomedical concepts and over two million
different concept-names. Each Concept Unique Identifier (CUI)
consists of C followed by seven decimal digits, with an accompanying
synonym-name. Different synonym-names have the same CUI.
UMLS is updated annually and made available cost-free to registered users.
English is the primary language of UMLS.
2.
A German-to-UMLS Lexicon was prepared as follows.
An uncopyrighted list of English-language autopsy-related words
was harvested from The Johns Hopkins Autopsy Resource (JHAR) (6),
which is keyed to the UMLS.
3.
This list was computer-transliterated into German
by substitution of common suffixes and prefixes, according to Wingert,
e.g., hematology ==> Haematologie (26,27,28).
4.
All words are assigned to:
part-of-speech, UMLS code, English translation.
5.
Newly-encountered words are default-assigned to NOUN,
except for words having
common ADJECTIVAL endings, such as:
-ische -ischem -ischen -ischer -isches,
-ale -alem -alen -aler -ales, etc.
-aere -aerem -aeren -aerer -aeres
-oese -oesem -oesen -oeser -oeses
5. BUILDING THE ZIPF DISTRIBUTION.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1.
It is extremely helpful to prepare a descending-order
frequency distribution, or Zipf distribution,
of single words occurring in the source text document.
The vertical axis (Y-axis) of a Zipf distribution is the
frequency, f, of each word,
and the horizontal axis (X-axis) is the RANK, r,
for that word.
2.
Rank-one is assigned to the most-frequent word
(English: of, and, the; German: der, des, im, und);
rank-two is assigned to the second-most-frequent word, etc.
3.
Zipf (29) made the observation
for humanities English-language literature,
and subsequently observed in large healthcare text databases
in English (8) and German (24),
that there is an approximate inverse relationship
between frequency and rank, namely, f = k / r,
for some constant, k.
4.
The immediate consequence of Zipf's Law
is that less than one hundred distinct words,
ranked one through 100 (left X-axis),
typically account for over half the words
in a given, large free-text document;
and conversely there are thousands of words (right X-axis)
that occur rarely or only once.
5.
Left-Zipf-words, also known as barrier words
or stop words, have almost no recall value
in a computerized indexing system.
On the other hand, right-Zipf-words
are highly specific for indexing purposes.
6.
This understanding allows the lexicon builder
to prioritize certain groups of words for attachment to UMLS codes.
6. BARRIER WORD METHOD.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1. In the Barrier Word Method,
short, low-information words serve as BARRIERS
between indexable, multiple-word pathology concepts in free-text
(22,30,31,32).
barrier words are in lower case. KEYWORDS ARE IN UPPER CASE.
CHRONISHE PANCREATITIS , vorwiegend im bereich des CAPUT PANCREATIS .
multiple FETTGEWEBSNEKROSEN des CAPUT PANCREATIS und CORPUS PANCREATIS .
SCHLEIMBILDENDES ADENOCARCINOM , stellenweise mit SIEGELRINGZELLEN .
CARCINOMA SOLIDUM SCIRRHOSUM der MAMMA .
COLON SIGMOIDEUM : DIVERTIKEL mit CHRONISCHER DIVERTICULITIS .
2. BARRIER WORDS:
RANK UMLS-CUI FREQUENCY GERMAN ENGLISH
1 ... D:C0205447 ........... 144,291 ... der .............. the
2 ... PD:C0456627-C0205447 . 87,031 ... des .............. of the
3 ... Z:C0475207 ........... 72,404 ... cm ............... cm
4 ... PD:C0332285-C0205447 . 65,237 ... im ............... in the
5 ... C:C0332287 ........... 63,377 ... und .............. and
6 ... P:C0332287 ........... 37,173 ... mit .............. with
7 ... N:C0441949 ........... 34,900 ... X ................ X
8 ... P:C0205105 ........... 28,725 ... bis .............. until
9 ... P:C0231290 ........... 28,493 ... nach ............. after
10 ... A:C0205172 ........... 18,485 ... mehrere .......... more
A=adjective, B=adverb, C=conjunction, D=determiner,
H=helpingverb, I=complementizer, N=noun, P=preposition,
Q=pronoun, V=mainverb, Z=number.
3. MULTIPLE-WORD TERMS DISCOVERED BY THE BARRIER WORD METHOD:
RANK FREQUENCY GERMAN ENGLISH
1 ... 12,391 ... AN:mikroskopische Untersuchungen ... microscopic studies
2 ... 10,598 ... AN:akute Blutstauung ............... acute blood congestion
3 ... 9,167 ... AN:linken Herzventrikels ........... left cardiac ventricle
4 ... 7,301 ... AN:rechten Herzventrikels .......... right cardiac ventricle
5 ... 5,881 ... AN:allgemeine Atherosklerose ....... general atherosclerosis
6 ... 5,491 ... AN:disseminierte Herzmuskelschwielen . disseminated cardiac muscle scars
7 ... 5,379 ... AN:klinischen Angaben .............. clinical data
8 ... 5,318 ... AN:pathologischen Befund ........... pathologic finding
9 ... 5,222 ... AN:truebe Schwellung ............... cloudy swelling
10 ... 3,403 ... AN:katarrhalische Tracheobronchitis . catarrhal tracheobronchitis
11 ... 3,175 ... AN:Gastromalacia acida ............. acidic gastromalacia
12 ... 3,072 ... AN:linker Ventrikel ................ left ventricle
13 ... 2,852 ... AN:coronare Atherosklerose ......... coronary atherosclerosis
14 ... 2,731 ... AN:aryepiglottischen Falten ........ aryepiglottic folds
15 ... 2,291 ... AN:flache Nierenrindennarben ....... flat renal cortical scars
16 ... 2,203 ... AN:katarrhalische Bronchitis ....... catarrhal bronchitis
17 ... 2,051 ... AN:mittelgradige allgemeine Atherosklerose . moderate-grade general atherosclerosis
18 ... 2,043 ... AN:eitrige Bronchiolitis ........... purulent bronchiolitis
19 ... 1,995 ... AN:strangfoermige Pleuraverwachsungen . necklace-like pleural growths
20 ... 1,920 ... AN:cerebrale atherosklerose ........ cerebral atherosclerosis
Barrier words include: all punctuation, all numerals,
and nearly all one-letter words and two-letter words,
articles, prepositions, and common verbs and modifiers.
A medical concept, or KEYWORD TERM, is a
SEQUENCE OF KEYWORDS UNINTERRUPTED BY BARRIER WORDS.
7. BUILDING THE PARSING TABLE.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1.
Basic German sentence patterns are common and obvious,
and may be obtained from elementary grammar texts.
2.
A MUMPS program attempts to parse each initial sentence.
3.
Each failed parse is examined manually,
in order to enrich the parsing table,
or verify that the sentence was incorrectly formed.
4.
Sample Zipf Distribution of parsing formulas
for English surgical pathology documents reports (33).
5.
Sample parser for selected sentences (34).
8. SYNTACTIC PROCESSING.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1.
Chomsky's Generative grammar (33,35).
Each word/phrase is assigned a part-of-speech, or choice of parts-of-speech.
Therefore, each sentence points to a sequence of parts-of-speech .
Each grammatically correct sentence pattern
corresponds to a part-of-speech sequence,
listed in the parsing table .
A long sentence is parsed iteratively.
2.
For example, this German sentence:
` Hypertrophie und Dilatation des linken Herzventrikels '
( English: Hypertrophy and dilatation of-the left cardiac-ventricle )
finds the following parts-of-speech and UMLS codes
in the German Lexicon:
hypertrophie::N:C0020564
und::C:C0332287
dilatation::N:C0012356
des::PD:C0456627-C0205447
linken::A:C0208973
herzventrikels::N:C0018827
A=adjective, B=adverb, C=conjunction, D=determiner,
H=helpingverb, I=complementizer, N=noun, P=preposition,
Q=pronoun, V=mainverb, Z=number.
3.
The part-of-speech-sequence, or parsandum ,
for the example is: `NCNDAN' .
This parsandum is known to the parser,
so the sentence parses successfully (23,24).
The final UMLS-coded sentence is:
N:C0020564 C:C0332287 N:C0012356 P-D:C0456627-C0205447 A:C0208973 N:C0018827
4.
Final sentence translated into English:
Hypertrophy and dilatation of-one left cardiac-ventricle.
9. WORD ORDER REARRANGEMENT.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1.
If necessary, stereotypic German word order
is converted into stereotypic English word order (23,25).
A common non-English grammatical construction
found in German autopsy reports is illustrated
in the following sentence:
Sigmadivertikulose mit mehreren bis 1-cm tiefen Divertikeln
(English: Sigmoid-colon-diverticulosis with multiple up-to 1-cm deep diverticula )
2.
The parsandum for the German sentence is: `NPAPZAN'
3.
The conventional English word-order for this sentence would be:
Sigmoid-colon-diverticulosis with multiple diverticula up-to 1-cm deep.
Thus, the parsandum for the English sentence would be: `NPANPZA'.
in stereotypic English word order.
4.
The German parsandum is transformed into its corresponding English parsandum
with the following formula:
1[2N3P4A6P7Z8A5N9]
The preceding superscript gives the new position to move each part-of-speech.
5.
Long-term use of this parser at the
Goethe University Center for Medical Informatics, since 1990 (16).
10. USER INTERFACE.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1.
Assumes a slight knowledge of German medical terminology,
or a willingness to learn.
Most German words are Greco-Latin derivatives,
and differ from their English counterparts
by only a few spelling conventons (26,27,28).
2.
Choice of English or German input word.
3.
For German input word, browser goes directly to autopsy text
containing those German words:
5.
Bilingual display, UMLS codes if desired.
4.
For English input word, browser shows corresponding German words,
and prompts the user for a German search word.
6.
Bilateral cross linkage to The Johns Hopkins Autopsy Resource (6)
11. BUILDING MEDICAL ONTOLOGIES.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1.
An ONTOLOGY is a (Platonic) description of essential reality,
i.e., what actually is, as opposed to what one can see
(observation, accident), or what one can know
(epistemiology) (36).
The term ontology was coined by two German philosophers,
Göckel and Lorhard, in 1613, and first appeared in English in 1721.
Quine (37) views ontology as the metaphysical commitments
or presuppositions embodied in the different natural sciences.
For example, the belief that a cancer can metastasize
would be an ONTOLOGICAL COMMITMENT.
In the philosophy and practice of science, ontology goes under various names:
essence, reality, Mind of God, nature,
gold standard, or mathiverse (38).
In medical informatics, ontology has come to mean
a structured list of concepts, typically prepared by an expert
or panel of experts.
2.
With the ease of posting structured lists on the Internet,
and with Extended Markup Language (XML)
as an emerging standard for such lists,
it is likely that the next decade will witness an explosion
of public medical ontologies, both amateur and professional.
3.
The importance of ontologies has been recognized by the
U. S. Defense Advanced Research Projects Agency (DARPA),
the original sponsor of the Internet, which has proposed guidelines
for a formal ontology AGENT MARKUP LANGUAGE,
that employs the ONTOLOGY INFERENCE LAYER (39,40).
4.
A simple ontology is illustrated by the observation at autopsy that
CHRONIC-PASSIVE-CONGESTION-LIVER (CPCL = C0700148-C0721399)
often accompanies HEART HYPERTROPHY (HH = C0795691-C0333959).
In an approximate sense, HH causes CPCL (41).
Thus, one might expect a hypothetical collection, say,
of 10,000 autopsy cases to distribute as follows:
HEART HYPERTROPHY (C0795691-C0333959)
____________________________________________
| | HH (-) | HH (+) | Total |
|___________________________________________|
| CPCL (-) | 7,000 | 1,000 | 8,000 |
|___________________________________________|
| CPCL (+) | 0 | 2,000 | 2,000 |
|___________________________________________|
| Total | 7,000 | 3,000 | 10,000 |
|___________________________________________|
CHRONIC-PASSIVE
-CONGESTION-LIVER
(C0700148-C0721399).
5.
That is, most autopsies are negative for both features;
HH anticipates CPCL in some autopsies;
but there should be only rare autopsies with CPCL but without HH.
Therefore, in the language of first-order propositional logic,
CPCL+ IMPLIES HH+.
6.
Such correlations (2x2 CONTINGENCY TABLES),
could be edited for redundancy and nonsense correlations (42).
6.
As necessary, a collection of such 2x2 contingency tables could be
ORDERED BY IMPORTANCE, based upon the frequencies
of autopsy cases appearing in the lower right corner of the table.
12. RESULTS: DISTRIBUTION OF
WORDS AND UMLS CODES.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1.
ZIPF DISTRIBUTION OF WORDS IN THE GUAR.
Total of 2,996,154 words. There were 35,145 distinct words,
ranging in frequency from 144,291 occurrences of `der'
to one occurrence apiece of 14,512 single words.
The hundred most frequent words are as follows:
In some cases, the German word corresponds to a compound concept
in UMLS, as for example, CARCINOMMETASTASEN
(carcinomatous-metastases, C0007097-C0027627).
RANK UMLS-CUI FREQUENCY GERMAN ENGLISH
1 ... C0205447 ........... 144,291 ... D:der .............. the
2 ... C0456627-C0205447 .. 87,031 ... D:des .............. of the
3 ... C0475207 ........... 72,404 ... Z:cm ............... cm
4 ... C0332285-C0205447 .. 65,237 ... PD:im .............. in the
5 ... C0332287 ........... 63,377 ... C:und .............. and
6 ... C0332517 ........... 38,481 ... N:Durchmesser ...... diameter
7 ... C0549177 ........... 38,103 ... A:grosse ........... large
8 ... C0332287 ........... 37,173 ... P:mit .............. with
9 ... C0441949 ........... 34,900 ... N:X ................ X
10 ... C0019080-C0700148 .. 33,576 ... N:Blutstauung ...... blood congestion
11 ... C0439801 ........... 32,314 ... A:geringe .......... limited
12 ... C0205105 ........... 28,725 ... P:bis .............. until
13 ... C0231290 ........... 28,493 ... P:nach ............. after
14 ... C0205090 ........... 27,177 ... A:rechten .......... right
15 ... C0205091 ........... 26,941 ... A:linken ........... left
16 ... C0721399 ........... 24,547 ... N:Leber ............ liver
17 ... C0004153 ........... 23,484 ... N:Atherosklerose ... atherosclerosis
18 ... C0205090 ........... 20,306 ... B:rechts ........... right
19 ... C0205091 ........... 19,368 ... B:links ............ left
20 ... C0012359 ........... 18,854 ... N:Dilatation ....... dilatation
21 ... C0205172 ........... 18,485 ... A:mehrere .......... more
22 ... C0348080 ........... 18,045 ... N:Zustand .......... condition
23 ... C0439203 ........... 16,860 ... P:in ............... in
24 ... C0018827 ........... 16,678 ... N:Herzventrikels ... cardiac ventricle
25 ... C0443224 ........... 15,692 ... A:frische .......... fresh
26 ... C0522501 ........... 15,552 ... A:ausgepraegte ..... striking
27 ... C0024109 ........... 15,100 ... N:Lunge ............ lung
28 ... C0439539 ........... 14,581 ... A:schwere .......... heavy
29 ... C0430007 ........... 13,018 ... N:Untersuchungen ... investigations
30 ... C0034063 ........... 12,987 ... N:Lungenoedem ...... pulmonary edema
31 ... C0401925-C0243095 .. 12,729 ... N:Hauptbefund ...... principal finding
32 ... C0238767 ........... 12,654 ... B:beiderseits ...... bilaterally
33 ... C0237401-C0348026 .. 12,427 ... N:Einzeldiagnosen .. individual diagnoses
34 ... C0205288 ........... 12,420 ... A:mikroskopische ... microscopic
35 ... C0205178 ........... 12,233 ... A:akute ............ acute
36 ... C0496927 ........... 12,162 ... N:Niere ............ kidney
37 ... C0022646 ........... 12,142 ... N:Nieren ........... kidneys
38 ... C0205081 ........... 12,099 ... A:mittelgradige .... moderate grade
39 ... C0205246 ........... 12,046 ... A:allgemeine ....... general
40 ... C0238767 ........... 11,765 ... A:beider ........... both
41 ... C0812414 ........... 11,569 ... N:Milz ............. spleen
42 ... C0205147 ........... 11,145 ... N:Bereich .......... region
43 ... C0795691 ........... 10,993 ... N:Herz ............. heart
44 ... C0549177 ........... 10,819 ... A:grosses .......... large
45 ... C0333959 ........... 10,611 ... N:Hypertrophie ..... hypertrophy
46 ... C0205191 ........... 10,409 ... A:chronische ....... chronic
47 ... C0027361 ........... 10,238 ... N:Seite ............ page
48 ... C0205160 ........... 10,151 ... B:nicht ............ not
49 ... C0238767 ........... 9,980 ... B:beidseits ........ bilaterally
50 ... C0439239 ........... 9,689 ... Z:ml ............... ml
51 ... C0027061-C0008767 .. 9,295 ... N:Herzmuskelschwielen . cardiac muscle scars
52 ... C0549177 ........... 9,094 ... A:grosser .......... large
53 ... C0154054 ........... 9,084 ... N:Lymphknoten ...... lymph nodes
54 ... C0018964-C0014448 .. 9,072 ... N:HE ............... hematoxylin-eosin
55 ... C0024109 ........... 8,981 ... N:Lungen ........... lungs
56 ... C0439751 ........... 8,804 ... B:ganz ............. entirely
57 ... C0205210 ........... 8,357 ... A:klinischen ....... clinical
58 ... C0001779 ........... 8,339 ... N:Alter ............ age
59 ... C0242356 ........... 8,285 ... N:Angaben .......... data
60 ... C0700321 ........... 8,006 ... A:kleine ........... small
61 ... C0019080 ........... 7,980 ... N:Blutungen ........ hemorrhages
62 ... C0004144 ........... 7,865 ... N:Atelektasen ...... atelectases
63 ... C0236177 ........... 7,818 ... A:katarrhalische ... catarrhal
64 ... C0205221 ........... 7,728 ... A:disseminierte .... disseminated
65 ... C0018827 ........... 7,617 ... N:Herzventrikel .... cardiac ventricle
66 ... C0549177 ........... 7,514 ... A:grossen .......... large
67 ... C0016059 ........... 7,477 ... N:Fibrose .......... fibrosis
68 ... C0003483 ........... 7,387 ... N:Aorta ............ aorta
69 ... C0443224 ........... 7,210 ... A:frischer ......... fresher
70 ... C0205094-C0441995 .. 7,165 ... N:Vorderwand ....... anterior wall
71 ... C0006285 ........... 7,096 ... N:Bronchopneumonie . bronchopneumonia
72 ... C0442504 ........... 7,066 ... V:stellen .......... place
73 ... C0029456 ........... 7,011 ... N:Osteoporose ...... osteoporosis
74 ... C0038999 ........... 6,814 ... N:Schwellung ....... swelling
75 ... C0243095 ........... 6,765 ... N:Befund ........... finding
76 ... C0205294 ........... 6,703 ... A:multiple ......... multiple
77 ... C0205082 ........... 6,646 ... A:hochgradige ...... high grade
78 ... C0149514 ........... 6,277 ... N:Bronchitis ....... bronchitis
79 ... C0439126 ........... 6,268 ... D:eines ............ a
80 ... C0205234 ........... 6,239 ... A:herdfoermige ..... focal
81 ... C0205041 ........... 6,188 ... A:coronare ......... coronary
82 ... C0007097-C0027627 .. 6,156 ... N:Carcinommetastasen . carcinomatous metastases
83 ... C0443176 ........... 6,124 ... A:umschriebene ..... circumscribed
84 ... C0012359 ........... 6,077 ... N:Ektasie .......... ectasia
85 ... C0439665 ........... 6,059 ... A:eitrige .......... purulent
86 ... C0332288 ........... 5,968 ... P:ohne ............. without
87 ... C0024109-C0003850 .. 5,968 ... N:Pulmonalarteriensklerose . pulmonary arteriosclerosis
88 ... C0750873 ........... 5,804 ... N:Colon ............ colon
89 ... C0205091 ........... 5,695 ... A:linker ........... left
90 ... C0302132 ........... 5,693 ... A:ausgedehnte ...... bulging
91 ... C0205469 ........... 5,664 ... A:pathologischen ... pathologic
92 ... C0813176 ........... 5,646 ... N:Pankreas ......... pancreas
93 ... C0005682 ........... 5,481 ... N:Harnblase ........ urinary bladder
94 ... C0205406 ........... 5,381 ... A:truebe ........... cloudy
95 ... C0004372 ........... 5,199 ... N:Autolyse ......... autolysis
96 ... C0205447 ........... 5,176 ... D:den .............. the
97 ... C0439798-C0019080 .. 4,878 ... N:Schleimhautblutungen . mucosal hemorrhages
98 ... C0332251 ........... 4,817 ... B:vorwiegend ....... predominantly
99 ... C0205447 ........... 4,798 ... D:die .............. the
100 ... C0750519 ........... 4,797 ... I:wegen ............ because of
A=adjective, B=adverb, C=conjunction, D=determiner,
H=helpingverb, I=complementizer, N=noun, P=preposition,
Q=pronoun, V=mainverb, Z=number.
2.
Complete listing of: Barrier words (50),
UMLS translations (51), Zipf Distribution (52).
13. DISCUSSION.
NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1.
There is a perceived need, and a potentially enormous opportunity,
to share anatomic pathology records between institutions,
and even between different medical language domains.
Vast resources of data are currently available as text files
in pathology departments worldwide.
Possible uses for these shared resources include:
epidemiologic studies, outcome studies, and quality assurance studies.
These unstructured text files often contain
highly specific information about an individual patient,
which is both difficult to extract in a standardized fashion,
and is potentially open to attack by hostile persons
in search of confidential patient data.
2.
The serious issues of patient and provider confidentiality
can be addressed by removing specific identifiers,
and by Sweeney's k-anonymity test.
The Johns Hopkins Autopsy Resource (JHAR)
is already an example of how this can be achieved,
for over 50,000 reports in the English language domain.
Further anonymization can be achieved by standardization
of terminology and translation into a coded language.
3.
GUAR is an example of over 12,000 reports in the German language domain,
with linkages to the JHAR.
There are semiautomated translation into English and UMLS,
an interface across language domains,
including English, German, UMLS, XML.
GUAR serves as a model for decentralized autopsy databases worldwide,
across institutions and languages.
There is a decentralized database, with common codes and a bilingual output.
4.
Such coded registers can serve as a resource for
building medical ontologies, and posting them on the Internet,
for better understanding of medical knowledge.
Initially, ontologies can be built from textbooks
or textbook-summaries.
Fine-tuning through examination of autopsy registers.
5.
GUAR can serve as a tool for broader understanding of medical relationships,
and a wider forum for medical knowledge dissemination and discussion,
through such Internet ontologies.
GUAR is a model for inexpensive international collaborative projects.
14. REFERENCES.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.
1.
Peery TM.
The autopsy data bank.
A proposal for pathologists to contribute
to the health care of the nation.
Am J Clin Pathol 69 (Suppl): 258-259, 1978.
2.
Carter JR, Nash NP, Cechner RL, Platt RD.
Proposal for a national autopsy data bank.
A potential major contribution of pathologists
to the health care of the nation.
Am J Clin Pathol. 76 (Suppl): 597-617, 1981.
3.
Moore GW, Berman JJ, Hanzlick RL, Buchino JJ, Hutchins GM.
A prototype Internet autopsy database.
1625 consecutive fetal and neonatal
autopsy facesheets spanning 20 years.
Arch Pathol Lab Med. 1996 Aug;120(8):782-785.
4.
Moore GW, Berman JJ.
Anatomic Pathology Data Mining. Chapter 4.
In: Cios KJ. Medical Data Mining and Knowledge Discovery. 2001.
Published within the series: "Studies in Fuzziness and Soft Computing",
Physica-Verlag Heidelberg, a Springer-Verlag Company.
Heidelberg: Springer-Verlag.
ISBN: 3-7908-1340-0, 502 pages.
5.
Shared Pathology Informatics Network.
http://grants.nih.gov/grants/guide/rfa-files/RFA-CA-01-006.html
6.
The Johns Hopkins Autopsy Resource:
http://www.netautopsy.org/
7.
Berman JJ, Moore GW, Hutchins GM.
Maintaining patient confidentiality
in the public domain Internet Autopsy Database (IAD).
JAMIA (Suppl). 1996;20:328-332.
Proc AMIA Annu Fall Symp. 1996;20:328-332.
8.
Moore GW, Boitnott JK, Miller RE, Eggleston JC, Hutchins GM.
Integrated anatomic pathology reporting system
using natural language diagnoses.
Modern Pathol 1:44-50, 1988.
9.
U. S. Code of Federal Regulations. 1995. 45 CFR Subtitle A
(10-1-95 Edition), part 46.101 (b) (4).
U. S. Department of Health and Human Services. Office of the Secretary.
The complete Common Rule document (45CFR46), at URL:
http://www.uaf.edu/oar/irb/45cfr46.html
or at URL:
http://ohrp.osophs.dhhs.gov/humansubjects/guidance/45cfr46.htm
10.
U. S. Code of Federal Regulations. 1999. 45 CFR Parts 160 - 164.
Standards for Privacy of Individually Identifiable Health Information;
Proposed Rule.
Department of Health and Human Services. Office of the Secretary.
Fed Regist. 1999 Nov 3;64(212):59917-59966.
http://aspe.hhs.gov/admnsimp/
11.
National Cancer Institute's Confidentiality Brochure, at URL:
http://www-cdp.ims.nci.nih.gov/policy.html
12.
Sweeney L.
Computational Disclosure Control: A Primer on Data Privacy Protection.
PhD Thesis. Massachusetts Institute of Technology. Spring, 2001. Draft.
http://www.swiss.ai.mit.edu/classes/6.805/articles/privacy/sweeney-thesis-draft.pdf
13.
Sweeney L.
Privacy and medical-records research.
N Engl J Med. 1998 Apr 9;338(15):1077.
PMID: 9537887; UI: 98181820.
14.
Sweeney L.
Guaranteeing anonymity when sharing medical data, the Datafly System.
Proc AMIA Annu Fall Symp. 1997;:51-55.
PMID: 9357587; UI: 98020458.
15.
Goethe University Autopsy Resource.
http://www.medparse.com/guaruprt.htm
16.
Giere W, Moore GW.
Xmed-ED. EDV-gestützte Übersetzungen medizinischer Texte
aus dem Englischen ins Deutsche.
Messe-Exponate der Uni-Frankfurt, 1994.
In: Kirsten W, Klar R, eds.
Dokumentation und Informationsaufbereitung für den Arzt.
Beiträge zur Medizinischen Informatik von Wolfgang Giere.
Darmstadt: Epsilon Verlag, 1996.
ISBN 3-9803214-7, 437 pages.
17.
Moore GW, Hutchins GM.
The persistent importance of autopsies.
Mayo Clin Proc. 2000 Jun;75(6):557-8.
18.
U.S. National Library of Medicine.
Unified Medical Language System.
http://www.nlm.nih.gov/research/umls/
19.
U. S. National Library of Medicine.
UMLS Knowledge Sources. Twelfth Edition.
Unified Medical Language System.
U. S. Department of Health and Human Services.
National Institutes of Health.
National Library of Medicine. 2001.
See also: Ninth Edition, 1998.
20.
Hahn U, Romacker M, Schulz S.
How knowledge drives understanding --
matching medical ontologies with the needs
of medical language processing.
Artif Intell Med 1999; 15:25-51.
21.
Moore GW, Hutchins GM, Boitnott JK, Miller RE, Polacsek RA.
Word root translation of 45,564 autopsy reports into MeSH titles.
Proc Annu Symp Comput Appl Med Care. 1987;11:.
Washington DC, November 1-4, 1987.
22.
Moore GW, Miller RE, Hutchins GM.
Indexing by MeSH titles of natural language pathology phrases identified
on first encounter using the barrier word method.
In: Scherrer JR, Cote RA, and Mandil SH, eds.,
Computerized Natural Medical Language Processing
for Knowledge Representation.
Amsterdam: North-Holland. 1989;:29-39.
ISBN 0-444-87356-2, 296 pages.
23.
Moore GW, Riede UN, Polacsek RA, Miller RE, Hutchins GM.
Automated translation of German to English medical text.
Am J Med. 1986 Jul;81(1):103-111.
24.
Giere W.
Foundations of clinical data automation in cooperative programs.
5th Ann Symp Comp Applic Med Care (Heffernan HG, ed).
Washington, DC, 1981, pp. 1142-1148.
Fifth Ann Symp Comp Applic Med Care.
1981;5:1142-1148.
25.
Giere W, Moore GW.
Translating English into German using VA File Manager.
M Computing, 1:16-23, 1993.
26.
Wingert F.
Medical Linguistics: Automated Indexing into SNOMED.
CRC, Critical Reviews in Medical Informatics. 1988;1:333-403.
27.
Wingert F, Rothman D, Cote RA.
Automated Indexing into SNOMED and ICD.
In, Scherrer JR, Cote RA, and Mandil SH, eds.,
Computerized Natural Medical Language Processing
for Knowledge Representation.
Amsterdam: North-Holland. 1989;:201-239.
ISBN 0-444-87356-2, 296 pages.
28.
Moore GW, Polacsek RA, Erozan YS, de la Monte SM,
Miller RE, Hutchins GM, Riede UN.
Multilingual translation techniques in the analysis
of narrative medical text.
Comput Methods Programs Biomed. 1986 Mar;22(1):35-42.
29.
Zipf GK.
On the Economy of Words.
Chapter 2 in, Human Behavior and The Principle of Least Effort.
An Introduction to Human Ecology.
Cambridge, MA: Addison-Wesley Press, Inc. 1949;:19-55.
30.
Tersmette KWF, Scott AF, Moore GW, Matheson NW, Miller RE.
Barrier word method for detecting molecular biology multiple word terms.
Proc 12th Annu Symp Comput Appl Med Care. 1988;12:.
31.
Nelson SJ, Cole WG, Tuttle MS, Olson NE, Sherertz DD.
Recognizing new medical knowledge computationally.
Proc Annu Symp Comput Appl Med Care. 1993;17:409-413.
32.
Wilbur WJ.
Overview of Books at NCBI.
http://www.ncbi.nlm.nih.gov:80/books/mboc/bookshelp/bookover.html#link
33.
The Johns Hopkins Autopsy Resource:
Parsing formulas for the nine-million word
Johns Hopkins anatomic pathology files.
http://www.netautopsy.org/vhpsapsx.htm
34.
Sample sentences parsed from English to UMLS.
http://www.medparse.com
35.
Chomsky N.
Aspects of the Theory of Syntax.
Cambridge, MA: The MIT Press. 1965.
36.
Smith B.
Mereotopology: A Theory of Parts and Boundaries.
Data and Knowledge Engineering. 1996;20:287-303.
37.
Quine WVO.
Ontological relative, and other essays.
New York: Columbia University Press. 1969;:.
38.
Stewart I.
Flatterland. Like Flatland. Only More So.
Cambridge, MA: Perseus Publishing. 2001.
ISBN 0-7382-0442-0, 301 pages.
39.
U. S. Defense Advanced Research Projects Agency (DARPA).
Agent Markup Language.
http://www.daml.org
40.
U. S. Defense Advanced Research Projects Agency (DARPA).
Ontology Inference Layer.
http://www.ontoknowledge.org/oil
41.
Vigorita VJ, Moore GW, Hutchins GM.
Absence of correlation between coronary arterial atherosclerosis
and severity or duration of diabetes mellitus of adult onset.
Am J Cardiol. 1980;46:535-542.
42.
Moore GW, Hutchins GM.
Consistency versus completeness in medical decision making:
Application to 155 patients autopsied after
coronary artery bypass graft surgery.
Proc 6th Annu Symp Comput Appl Med Care. 1982;6:805-811.
50.
Complete listing of German Barrier Words:
http://www.medparse.com/guarbarr.htm
51.
Complete listing of German UMLS translations:
http://www.medparse.com/guarumls.htm
52.
Complete listing of German Zipf distribution:
http://www.medparse.com/guarzipf.htm
53.
Perl Script for Goethe University Autopy Register.
http://www.medparse.com/guarsrch.txt
Last Updated: October 7, 2001, by G. William Moore, MD, PhD.