GOETHE UNIVERSITY AUTOPSY REGISTER:
ANONYMIZED BILINGUAL DATABASE.

http://www.netautopsy.org/apep01gu.htm
Short Version: http://www.netautopsy.org/apsp01gu.htm


German English UMLS

W. Giere, MD. [1],
G. William Moore, MD, PhD [2,3,4].
Grover M. Hutchins, MD [3].

From: Center for Medical Informatics, J. W. von Goethe University, Frankfurt, Germany [1].
Pathology and Laboratory Medicine Service (113), Baltimore VA Maryland Health Care System, Baltimore, MD [2].
Department of Pathology, The Johns Hopkins Medical Institutions, Baltimore, MD [3].
Department of Pathology, University of Maryland School of Medicine, Baltimore, MD [4].



TABLE OF CONTENTS.


1. ABSTRACT.
2. INTRODUCTION.
3. MATERIALS AND METHODS.
4. BUILDING THE UMLS TERM LEXICON.
5. BUILDING THE ZIPF DISTRIBUTION.
6. BARRIER WORD METHOD.
7. BUILDING THE PARSING TABLE.
8. SYNTACTIC PROCESSING.
9. WORD ORDER REARRANGEMENT.
10. USER INTERFACE.
11. BUILDING MEDICAL ONTOLOGIES.
12. RESULTS: DISTRIBUTION OF WORDS AND UMLS CODES.
13. DISCUSSION.
14. REFERENCES.



1. ABSTRACT.

GOETHE UNIVERSITY AUTOPSY REGISTER:
ANONYMIZED BILINGUAL DATABASE.


NEXT PAGE.
RETURN TO TABLE OF CONTENTS.

W. Giere, M.D. [1],
G. William Moore, MD, PhD [2,3,4].
Grover M. Hutchins, MD, PhD [3].

From: Center for Medical Informatics, J. W. von Goethe University, Frankfurt, Germany [1].
Pathology and Laboratory Medicine Service (113), Baltimore VA Maryland Health Care System, Baltimore, MD [2].
Department of Pathology, The Johns Hopkins Medical Institutions, Baltimore, MD. [3].
Department of Pathology, University of Maryland School of Medicine, Baltimore, MD [4].

Background: There is a growing need for public inventories of anatomic pathology data, coded in standard nomenclatures, that could be used for epidemiologic investigations, outcome studies, and quality improvement projects; and combined with data from other institutions for comparative studies.

Design: The Goethe University Autopsy Register (GUAR) consists of 12,447 autopsy summaries, spanning 15 years, serving the Goethe University Medical Clinics. Reports are free-text with controlled orthography and grammar.

Results: The GUAR text file was a 34.2 MB, with a total of 2,996,154 words. There were 35,145 distinct words, ranging in frequency from 144,291 occurrences of `der' (the) to one occurrence apiece of 14,512 words. GUAR autopsy summaries are posted on the Internet, with a bilingual query engine and linkage to The Johns Hopkins Autopsy Resource (JHAR) in English.

Conclusions: GUAR is an example of autopsy summaries originally in German, with an interface across language domains, including English and the Unified Medical Language System (UMLS). This register could serve as a model for decentralized anatomic databases worldwide, with a potential for building medical ontologies, improving the dissemination of structured medical knowledge, and conducting cost-effective international collaborative projects.



2. INTRODUCTION.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      1. The idea of shared, multi-institutional anatomic pathology databases has been the dream of pathologists for a quarter-century (1,2,3). Such databases, coded in standard nomenclatures with structured grammars, could be used to inform interested persons regarding the range and frequency of diseases encountered by a particular institution, and can potentially be combined with data from other institutions for conducting comparative epidemiologic studies, outcome studies, and quality assurance studies (4,5).

      2. The Johns Hopkins Autopsy Resource (JHAR) is the first, large anatomic pathology database posted on the Internet. Founded as an institutional database in 1980 and available publicly since 1995, the JHAR lists over 50,000 autopsy summaries, on patients born over a span of two centuries, with an estimated one million tissue blocks (3,6,7,8).

      3. At the same time, there is a need to protect the identities of patients and medical providers (7,9,10,11,12,13,14). The GUAR is anonymized, i.e., all links to exact patient identifiers have been eliminated. The JHAR is de-identified, i.e., all links to exact patient identifiers have been encrypted. With one exception, the JHAR satisfies the computational-disclosure definition of k-anonymous de-identification, for k=4 (6,12).

      4. The Goethe University Autopsy Register (GUAR) consists of 12,447 autopsy summaries, spanning 15 years, serving the Goethe University Medical Clinics (15,16,17).

      5. This report describes the translation of these reports into the Unified Medical Language System (UMLS) of the U. S. National Library of Medicine (USNLM) (18,19,20,21,22). The primary language of the UMLS is English, but a subset of UMLS has been translated into German by our laboratory (23,24,25,26,27).



3. MATERIALS AND METHODS.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      1. Format of GUAR reports. The initial file was a 34.2 MB text file, and consisted of free-text autopsy summaries spanning 15 years, with user-determined, German-language vocabulary, with spell-check control. Each autopsy summary consisted of four divisions: principal findings. individual diagnoses. microscopic notes. and comment.

      2. Processing GUAR reports. The autopsy summary report was initially dropped to all lower case, and each punctuation mark was buffered on either side with a blankspace (ASCII 32). For patient-confidentiality reasons, all numerals were tokenized . A period forming part of an abbreviation (e.g., li. niere = left kidney) was handled as an exceptional case in the term lexicon. After preprocessing, each complete sentence ended with a period (ASCII 46), followed by a blankspace. Unpaired punctuation marks ( , ; : ) were handled grammatically as commas. Paired punctuation marks ( [] () [] <> `' ) were handled grammatically as parentheses.

      3. Rules for consistent medical German: Free-text autopsy summaries employed user-determined, German-language vocabulary, with spell-check control. Umlauts and sz-ligatures were expressed in expanded form only (i.e., ä always expressed as ae, etc.). German words were listed case-insensitively in the lexicon. The system includes a pretty-print dictionary for displaying case-sensitive words (German nouns beginning with upper-case), as well as the usual appearance of umlauts, sz-ligatures, etc. The text consisted of short sentences, grammatically correct, ending with a period (16).



4. BUILDING THE UMLS TERM LEXICON.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      1. Preparation of the UMLS Vocabulary. The Unified Medical Language System (UMLS) of the U. S. National Library of Medicine (USNLM) is by far the world's largest medical concept system (20), and is the best tool for research studies in controlled medical vocabularies. UMLS serves as an indexing tool for PubMed, a collection of over eleven million medical citations available on the Internet to the general public. UMLS provides a uniform, integrated distribution format from over one hundred biomedical vocabularies and classifications. The 2001 UMLS Metathesaurus contains over one million biomedical concepts and over two million different concept-names. Each Concept Unique Identifier (CUI) consists of C followed by seven decimal digits, with an accompanying synonym-name. Different synonym-names have the same CUI. UMLS is updated annually and made available cost-free to registered users. English is the primary language of UMLS.

      2. A German-to-UMLS Lexicon was prepared as follows. An uncopyrighted list of English-language autopsy-related words was harvested from The Johns Hopkins Autopsy Resource (JHAR) (6), which is keyed to the UMLS.

      3. This list was computer-transliterated into German by substitution of common suffixes and prefixes, according to Wingert, e.g., hematology ==> Haematologie (26,27,28).

      4. All words are assigned to: part-of-speech, UMLS code, English translation.

      5. Newly-encountered words are default-assigned to NOUN, except for words having common ADJECTIVAL endings, such as:
-ische -ischem -ischen -ischer -isches,
-ale -alem -alen -aler -ales, etc.
-aere -aerem -aeren -aerer -aeres
-oese -oesem -oesen -oeser -oeses




5. BUILDING THE ZIPF DISTRIBUTION.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      1. It is extremely helpful to prepare a descending-order frequency distribution, or Zipf distribution, of single words occurring in the source text document. The vertical axis (Y-axis) of a Zipf distribution is the frequency, f, of each word, and the horizontal axis (X-axis) is the RANK, r, for that word.

      2. Rank-one is assigned to the most-frequent word (English: of, and, the; German: der, des, im, und); rank-two is assigned to the second-most-frequent word, etc.

      3. Zipf (29) made the observation for humanities English-language literature, and subsequently observed in large healthcare text databases in English (8) and German (24), that there is an approximate inverse relationship between frequency and rank, namely, f = k / r, for some constant, k.

      4. The immediate consequence of Zipf's Law is that less than one hundred distinct words, ranked one through 100 (left X-axis), typically account for over half the words in a given, large free-text document; and conversely there are thousands of words (right X-axis) that occur rarely or only once.

      5. Left-Zipf-words, also known as barrier words or stop words, have almost no recall value in a computerized indexing system. On the other hand, right-Zipf-words are highly specific for indexing purposes.

      6. This understanding allows the lexicon builder to prioritize certain groups of words for attachment to UMLS codes.



6. BARRIER WORD METHOD.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

1. In the Barrier Word Method, short, low-information words serve as BARRIERS between indexable, multiple-word pathology concepts in free-text (22,30,31,32).
barrier words are in lower case. KEYWORDS ARE IN UPPER CASE.
 CHRONISHE PANCREATITIS , vorwiegend im bereich des CAPUT PANCREATIS .


multiple FETTGEWEBSNEKROSEN des CAPUT PANCREATIS und CORPUS PANCREATIS .

SCHLEIMBILDENDES ADENOCARCINOM , stellenweise mit SIEGELRINGZELLEN .

CARCINOMA SOLIDUM SCIRRHOSUM der MAMMA .

COLON SIGMOIDEUM : DIVERTIKEL mit CHRONISCHER DIVERTICULITIS .
2. BARRIER WORDS:
 RANK     UMLS-CUI             FREQUENCY     GERMAN             ENGLISH
    1 ... D:C0205447 ........... 144,291 ... der .............. the
    2 ... PD:C0456627-C0205447 .  87,031 ... des .............. of the
    3 ... Z:C0475207 ...........  72,404 ... cm ............... cm
    4 ... PD:C0332285-C0205447 .  65,237 ... im ............... in the
    5 ... C:C0332287 ...........  63,377 ... und .............. and
    6 ... P:C0332287 ...........  37,173 ... mit .............. with
    7 ... N:C0441949 ...........  34,900 ... X ................ X
    8 ... P:C0205105 ...........  28,725 ... bis .............. until
    9 ... P:C0231290 ...........  28,493 ... nach ............. after
   10 ... A:C0205172 ...........  18,485 ... mehrere .......... more
A=adjective, B=adverb, C=conjunction, D=determiner,
H=helpingverb, I=complementizer, N=noun, P=preposition,
Q=pronoun, V=mainverb, Z=number.

3. MULTIPLE-WORD TERMS DISCOVERED BY THE BARRIER WORD METHOD:
 RANK     FREQUENCY     GERMAN                               ENGLISH
    1 ...    12,391 ... AN:mikroskopische Untersuchungen ... microscopic studies
    2 ...    10,598 ... AN:akute Blutstauung ............... acute blood congestion
    3 ...     9,167 ... AN:linken Herzventrikels ........... left cardiac ventricle
    4 ...     7,301 ... AN:rechten Herzventrikels .......... right cardiac ventricle
    5 ...     5,881 ... AN:allgemeine Atherosklerose ....... general atherosclerosis
    6 ...     5,491 ... AN:disseminierte Herzmuskelschwielen . disseminated cardiac muscle scars
    7 ...     5,379 ... AN:klinischen Angaben .............. clinical data
    8 ...     5,318 ... AN:pathologischen Befund ........... pathologic finding
    9 ...     5,222 ... AN:truebe Schwellung ............... cloudy swelling
   10 ...     3,403 ... AN:katarrhalische Tracheobronchitis . catarrhal tracheobronchitis
   11 ...     3,175 ... AN:Gastromalacia acida ............. acidic gastromalacia
   12 ...     3,072 ... AN:linker Ventrikel ................ left ventricle
   13 ...     2,852 ... AN:coronare Atherosklerose ......... coronary atherosclerosis
   14 ...     2,731 ... AN:aryepiglottischen Falten ........ aryepiglottic folds
   15 ...     2,291 ... AN:flache Nierenrindennarben ....... flat renal cortical scars
   16 ...     2,203 ... AN:katarrhalische Bronchitis ....... catarrhal bronchitis
   17 ...     2,051 ... AN:mittelgradige allgemeine Atherosklerose . moderate-grade general atherosclerosis
   18 ...     2,043 ... AN:eitrige Bronchiolitis ........... purulent bronchiolitis
   19 ...     1,995 ... AN:strangfoermige Pleuraverwachsungen . necklace-like pleural growths
   20 ...     1,920 ... AN:cerebrale atherosklerose ........ cerebral atherosclerosis
Barrier words include: all punctuation, all numerals, and nearly all one-letter words and two-letter words, articles, prepositions, and common verbs and modifiers. A medical concept, or KEYWORD TERM, is a SEQUENCE OF KEYWORDS UNINTERRUPTED BY BARRIER WORDS.



7. BUILDING THE PARSING TABLE.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      1. Basic German sentence patterns are common and obvious, and may be obtained from elementary grammar texts.

      2. A MUMPS program attempts to parse each initial sentence.

      3. Each failed parse is examined manually, in order to enrich the parsing table, or verify that the sentence was incorrectly formed.

      4. Sample Zipf Distribution of parsing formulas for English surgical pathology documents reports (33).

      5. Sample parser for selected sentences (34).



8. SYNTACTIC PROCESSING.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      1. Chomsky's Generative grammar (33,35). Each word/phrase is assigned a part-of-speech, or choice of parts-of-speech. Therefore, each sentence points to a sequence of parts-of-speech . Each grammatically correct sentence pattern corresponds to a part-of-speech sequence, listed in the parsing table . A long sentence is parsed iteratively.

      2. For example, this German sentence: ` Hypertrophie und Dilatation des linken Herzventrikels ' ( English: Hypertrophy and dilatation of-the left cardiac-ventricle ) finds the following parts-of-speech and UMLS codes in the German Lexicon:
hypertrophie::N:C0020564
und::C:C0332287
dilatation::N:C0012356
des::PD:C0456627-C0205447
linken::A:C0208973
herzventrikels::N:C0018827
A=adjective, B=adverb, C=conjunction, D=determiner,
H=helpingverb, I=complementizer, N=noun, P=preposition,
Q=pronoun, V=mainverb, Z=number.

      3. The part-of-speech-sequence, or parsandum , for the example is: `NCNDAN' . This parsandum is known to the parser, so the sentence parses successfully (23,24). The final UMLS-coded sentence is:
N:C0020564 C:C0332287 N:C0012356 P-D:C0456627-C0205447 A:C0208973 N:C0018827

      4. Final sentence translated into English:
Hypertrophy and dilatation of-one left cardiac-ventricle.



9. WORD ORDER REARRANGEMENT.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      1. If necessary, stereotypic German word order is converted into stereotypic English word order (23,25). A common non-English grammatical construction found in German autopsy reports is illustrated in the following sentence:
Sigmadivertikulose mit mehreren bis 1-cm tiefen Divertikeln
(English: Sigmoid-colon-diverticulosis with multiple up-to 1-cm deep diverticula )
      2. The parsandum for the German sentence is: `NPAPZAN'

      3. The conventional English word-order for this sentence would be: Sigmoid-colon-diverticulosis with multiple diverticula up-to 1-cm deep. Thus, the parsandum for the English sentence would be: `NPANPZA'. in stereotypic English word order.

      4. The German parsandum is transformed into its corresponding English parsandum with the following formula:
1[2N3P4A6P7Z8A5N9]
The preceding superscript gives the new position to move each part-of-speech.

      5. Long-term use of this parser at the Goethe University Center for Medical Informatics, since 1990 (16).



10. USER INTERFACE.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      1. Assumes a slight knowledge of German medical terminology, or a willingness to learn. Most German words are Greco-Latin derivatives, and differ from their English counterparts by only a few spelling conventons (26,27,28).

      2. Choice of English or German input word.

      3. For German input word, browser goes directly to autopsy text containing those German words:

German English UMLS


      5. Bilingual display, UMLS codes if desired.

      4. For English input word, browser shows corresponding German words, and prompts the user for a German search word.

      6. Bilateral cross linkage to The Johns Hopkins Autopsy Resource (6)



11. BUILDING MEDICAL ONTOLOGIES.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      1. An ONTOLOGY is a (Platonic) description of essential reality, i.e., what actually is, as opposed to what one can see (observation, accident), or what one can know (epistemiology) (36). The term ontology was coined by two German philosophers, Göckel and Lorhard, in 1613, and first appeared in English in 1721. Quine (37) views ontology as the metaphysical commitments or presuppositions embodied in the different natural sciences. For example, the belief that a cancer can metastasize would be an ONTOLOGICAL COMMITMENT. In the philosophy and practice of science, ontology goes under various names: essence, reality, Mind of God, nature, gold standard, or mathiverse (38). In medical informatics, ontology has come to mean a structured list of concepts, typically prepared by an expert or panel of experts.

      2. With the ease of posting structured lists on the Internet, and with Extended Markup Language (XML) as an emerging standard for such lists, it is likely that the next decade will witness an explosion of public medical ontologies, both amateur and professional.

      3. The importance of ontologies has been recognized by the U. S. Defense Advanced Research Projects Agency (DARPA), the original sponsor of the Internet, which has proposed guidelines for a formal ontology AGENT MARKUP LANGUAGE, that employs the ONTOLOGY INFERENCE LAYER (39,40).

      4. A simple ontology is illustrated by the observation at autopsy that CHRONIC-PASSIVE-CONGESTION-LIVER (CPCL = C0700148-C0721399) often accompanies HEART HYPERTROPHY (HH = C0795691-C0333959). In an approximate sense, HH causes CPCL (41). Thus, one might expect a hypothetical collection, say, of 10,000 autopsy cases to distribute as follows:
              HEART HYPERTROPHY (C0795691-C0333959)
 ____________________________________________
|            |  HH (-) |   HH (+) |  Total  |
|___________________________________________|
|  CPCL (-)  |   7,000 |    1,000 |  8,000  |
|___________________________________________|
|  CPCL (+)  |       0 |    2,000 |  2,000  |
|___________________________________________|
|   Total    |   7,000 |    3,000 | 10,000  |
|___________________________________________|
 CHRONIC-PASSIVE
 -CONGESTION-LIVER
 (C0700148-C0721399).


      5. That is, most autopsies are negative for both features; HH anticipates CPCL in some autopsies; but there should be only rare autopsies with CPCL but without HH. Therefore, in the language of first-order propositional logic, CPCL+ IMPLIES HH+.

      6. Such correlations (2x2 CONTINGENCY TABLES), could be edited for redundancy and nonsense correlations (42).

      6. As necessary, a collection of such 2x2 contingency tables could be ORDERED BY IMPORTANCE, based upon the frequencies of autopsy cases appearing in the lower right corner of the table.



12. RESULTS: DISTRIBUTION OF
WORDS AND UMLS CODES.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      1. ZIPF DISTRIBUTION OF WORDS IN THE GUAR. Total of 2,996,154 words. There were 35,145 distinct words, ranging in frequency from 144,291 occurrences of `der' to one occurrence apiece of 14,512 single words. The hundred most frequent words are as follows: In some cases, the German word corresponds to a compound concept in UMLS, as for example, CARCINOMMETASTASEN (carcinomatous-metastases, C0007097-C0027627).
 RANK     UMLS-CUI           FREQUENCY     GERMAN               ENGLISH
    1 ... C0205447 ........... 144,291 ... D:der .............. the
    2 ... C0456627-C0205447 ..  87,031 ... D:des .............. of the
    3 ... C0475207 ...........  72,404 ... Z:cm ............... cm
    4 ... C0332285-C0205447 ..  65,237 ... PD:im .............. in the
    5 ... C0332287 ...........  63,377 ... C:und .............. and
    6 ... C0332517 ...........  38,481 ... N:Durchmesser ...... diameter
    7 ... C0549177 ...........  38,103 ... A:grosse ........... large
    8 ... C0332287 ...........  37,173 ... P:mit .............. with
    9 ... C0441949 ...........  34,900 ... N:X ................ X
   10 ... C0019080-C0700148 ..  33,576 ... N:Blutstauung ...... blood congestion
   11 ... C0439801 ...........  32,314 ... A:geringe .......... limited
   12 ... C0205105 ...........  28,725 ... P:bis .............. until
   13 ... C0231290 ...........  28,493 ... P:nach ............. after
   14 ... C0205090 ...........  27,177 ... A:rechten .......... right
   15 ... C0205091 ...........  26,941 ... A:linken ........... left
   16 ... C0721399 ...........  24,547 ... N:Leber ............ liver
   17 ... C0004153 ...........  23,484 ... N:Atherosklerose ... atherosclerosis
   18 ... C0205090 ...........  20,306 ... B:rechts ........... right
   19 ... C0205091 ...........  19,368 ... B:links ............ left
   20 ... C0012359 ...........  18,854 ... N:Dilatation ....... dilatation
   21 ... C0205172 ...........  18,485 ... A:mehrere .......... more
   22 ... C0348080 ...........  18,045 ... N:Zustand .......... condition
   23 ... C0439203 ...........  16,860 ... P:in ............... in
   24 ... C0018827 ...........  16,678 ... N:Herzventrikels ... cardiac ventricle
   25 ... C0443224 ...........  15,692 ... A:frische .......... fresh
   26 ... C0522501 ...........  15,552 ... A:ausgepraegte ..... striking
   27 ... C0024109 ...........  15,100 ... N:Lunge ............ lung
   28 ... C0439539 ...........  14,581 ... A:schwere .......... heavy
   29 ... C0430007 ...........  13,018 ... N:Untersuchungen ... investigations
   30 ... C0034063 ...........  12,987 ... N:Lungenoedem ...... pulmonary edema
   31 ... C0401925-C0243095 ..  12,729 ... N:Hauptbefund ...... principal finding
   32 ... C0238767 ...........  12,654 ... B:beiderseits ...... bilaterally
   33 ... C0237401-C0348026 ..  12,427 ... N:Einzeldiagnosen .. individual diagnoses
   34 ... C0205288 ...........  12,420 ... A:mikroskopische ... microscopic
   35 ... C0205178 ...........  12,233 ... A:akute ............ acute
   36 ... C0496927 ...........  12,162 ... N:Niere ............ kidney
   37 ... C0022646 ...........  12,142 ... N:Nieren ........... kidneys
   38 ... C0205081 ...........  12,099 ... A:mittelgradige .... moderate grade
   39 ... C0205246 ...........  12,046 ... A:allgemeine ....... general
   40 ... C0238767 ...........  11,765 ... A:beider ........... both
   41 ... C0812414 ...........  11,569 ... N:Milz ............. spleen
   42 ... C0205147 ...........  11,145 ... N:Bereich .......... region
   43 ... C0795691 ...........  10,993 ... N:Herz ............. heart
   44 ... C0549177 ...........  10,819 ... A:grosses .......... large
   45 ... C0333959 ...........  10,611 ... N:Hypertrophie ..... hypertrophy
   46 ... C0205191 ...........  10,409 ... A:chronische ....... chronic
   47 ... C0027361 ...........  10,238 ... N:Seite ............ page
   48 ... C0205160 ...........  10,151 ... B:nicht ............ not
   49 ... C0238767 ...........   9,980 ... B:beidseits ........ bilaterally
   50 ... C0439239 ...........   9,689 ... Z:ml ............... ml
   51 ... C0027061-C0008767 ..   9,295 ... N:Herzmuskelschwielen . cardiac muscle scars
   52 ... C0549177 ...........   9,094 ... A:grosser .......... large
   53 ... C0154054 ...........   9,084 ... N:Lymphknoten ...... lymph nodes
   54 ... C0018964-C0014448 ..   9,072 ... N:HE ............... hematoxylin-eosin
   55 ... C0024109 ...........   8,981 ... N:Lungen ........... lungs
   56 ... C0439751 ...........   8,804 ... B:ganz ............. entirely
   57 ... C0205210 ...........   8,357 ... A:klinischen ....... clinical
   58 ... C0001779 ...........   8,339 ... N:Alter ............ age
   59 ... C0242356 ...........   8,285 ... N:Angaben .......... data
   60 ... C0700321 ...........   8,006 ... A:kleine ........... small
   61 ... C0019080 ...........   7,980 ... N:Blutungen ........ hemorrhages
   62 ... C0004144 ...........   7,865 ... N:Atelektasen ...... atelectases
   63 ... C0236177 ...........   7,818 ... A:katarrhalische ... catarrhal
   64 ... C0205221 ...........   7,728 ... A:disseminierte .... disseminated
   65 ... C0018827 ...........   7,617 ... N:Herzventrikel .... cardiac ventricle
   66 ... C0549177 ...........   7,514 ... A:grossen .......... large
   67 ... C0016059 ...........   7,477 ... N:Fibrose .......... fibrosis
   68 ... C0003483 ...........   7,387 ... N:Aorta ............ aorta
   69 ... C0443224 ...........   7,210 ... A:frischer ......... fresher
   70 ... C0205094-C0441995 ..   7,165 ... N:Vorderwand ....... anterior wall
   71 ... C0006285 ...........   7,096 ... N:Bronchopneumonie . bronchopneumonia
   72 ... C0442504 ...........   7,066 ... V:stellen .......... place
   73 ... C0029456 ...........   7,011 ... N:Osteoporose ...... osteoporosis
   74 ... C0038999 ...........   6,814 ... N:Schwellung ....... swelling
   75 ... C0243095 ...........   6,765 ... N:Befund ........... finding
   76 ... C0205294 ...........   6,703 ... A:multiple ......... multiple
   77 ... C0205082 ...........   6,646 ... A:hochgradige ...... high grade
   78 ... C0149514 ...........   6,277 ... N:Bronchitis ....... bronchitis
   79 ... C0439126 ...........   6,268 ... D:eines ............ a
   80 ... C0205234 ...........   6,239 ... A:herdfoermige ..... focal
   81 ... C0205041 ...........   6,188 ... A:coronare ......... coronary
   82 ... C0007097-C0027627 ..   6,156 ... N:Carcinommetastasen . carcinomatous metastases
   83 ... C0443176 ...........   6,124 ... A:umschriebene ..... circumscribed
   84 ... C0012359 ...........   6,077 ... N:Ektasie .......... ectasia
   85 ... C0439665 ...........   6,059 ... A:eitrige .......... purulent
   86 ... C0332288 ...........   5,968 ... P:ohne ............. without
   87 ... C0024109-C0003850 ..   5,968 ... N:Pulmonalarteriensklerose . pulmonary arteriosclerosis
   88 ... C0750873 ...........   5,804 ... N:Colon ............ colon
   89 ... C0205091 ...........   5,695 ... A:linker ........... left
   90 ... C0302132 ...........   5,693 ... A:ausgedehnte ...... bulging
   91 ... C0205469 ...........   5,664 ... A:pathologischen ... pathologic
   92 ... C0813176 ...........   5,646 ... N:Pankreas ......... pancreas
   93 ... C0005682 ...........   5,481 ... N:Harnblase ........ urinary bladder
   94 ... C0205406 ...........   5,381 ... A:truebe ........... cloudy
   95 ... C0004372 ...........   5,199 ... N:Autolyse ......... autolysis
   96 ... C0205447 ...........   5,176 ... D:den .............. the
   97 ... C0439798-C0019080 ..   4,878 ... N:Schleimhautblutungen . mucosal hemorrhages
   98 ... C0332251 ...........   4,817 ... B:vorwiegend ....... predominantly
   99 ... C0205447 ...........   4,798 ... D:die ..............  the
  100 ... C0750519 ...........   4,797 ... I:wegen ............  because of
A=adjective, B=adverb, C=conjunction, D=determiner,
H=helpingverb, I=complementizer, N=noun, P=preposition,
Q=pronoun, V=mainverb, Z=number.

      2. Complete listing of: Barrier words (50), UMLS translations (51), Zipf Distribution (52).



13. DISCUSSION.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      1. There is a perceived need, and a potentially enormous opportunity, to share anatomic pathology records between institutions, and even between different medical language domains. Vast resources of data are currently available as text files in pathology departments worldwide. Possible uses for these shared resources include: epidemiologic studies, outcome studies, and quality assurance studies. These unstructured text files often contain highly specific information about an individual patient, which is both difficult to extract in a standardized fashion, and is potentially open to attack by hostile persons in search of confidential patient data.

      2. The serious issues of patient and provider confidentiality can be addressed by removing specific identifiers, and by Sweeney's k-anonymity test. The Johns Hopkins Autopsy Resource (JHAR) is already an example of how this can be achieved, for over 50,000 reports in the English language domain. Further anonymization can be achieved by standardization of terminology and translation into a coded language.

      3. GUAR is an example of over 12,000 reports in the German language domain, with linkages to the JHAR. There are semiautomated translation into English and UMLS, an interface across language domains, including English, German, UMLS, XML. GUAR serves as a model for decentralized autopsy databases worldwide, across institutions and languages. There is a decentralized database, with common codes and a bilingual output.

      4. Such coded registers can serve as a resource for building medical ontologies, and posting them on the Internet, for better understanding of medical knowledge. Initially, ontologies can be built from textbooks or textbook-summaries. Fine-tuning through examination of autopsy registers.

      5. GUAR can serve as a tool for broader understanding of medical relationships, and a wider forum for medical knowledge dissemination and discussion, through such Internet ontologies. GUAR is a model for inexpensive international collaborative projects.



14. REFERENCES.


PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.

      1. Peery TM.
The autopsy data bank. A proposal for pathologists to contribute to the health care of the nation.
Am J Clin Pathol 69 (Suppl): 258-259, 1978.

      2. Carter JR, Nash NP, Cechner RL, Platt RD.
Proposal for a national autopsy data bank. A potential major contribution of pathologists to the health care of the nation.
Am J Clin Pathol. 76 (Suppl): 597-617, 1981.

      3. Moore GW, Berman JJ, Hanzlick RL, Buchino JJ, Hutchins GM.
A prototype Internet autopsy database. 1625 consecutive fetal and neonatal autopsy facesheets spanning 20 years.
Arch Pathol Lab Med. 1996 Aug;120(8):782-785.

      4. Moore GW, Berman JJ.
Anatomic Pathology Data Mining. Chapter 4.
In: Cios KJ. Medical Data Mining and Knowledge Discovery. 2001.
Published within the series: "Studies in Fuzziness and Soft Computing", Physica-Verlag Heidelberg, a Springer-Verlag Company. Heidelberg: Springer-Verlag.
ISBN: 3-7908-1340-0, 502 pages.

      5. Shared Pathology Informatics Network.
http://grants.nih.gov/grants/guide/rfa-files/RFA-CA-01-006.html

      6. The Johns Hopkins Autopsy Resource:
http://www.netautopsy.org/

      7. Berman JJ, Moore GW, Hutchins GM.
Maintaining patient confidentiality in the public domain Internet Autopsy Database (IAD).
JAMIA (Suppl). 1996;20:328-332. Proc AMIA Annu Fall Symp. 1996;20:328-332.

      8. Moore GW, Boitnott JK, Miller RE, Eggleston JC, Hutchins GM.
Integrated anatomic pathology reporting system using natural language diagnoses.
Modern Pathol 1:44-50, 1988.

      9. U. S. Code of Federal Regulations. 1995. 45 CFR Subtitle A (10-1-95 Edition), part 46.101 (b) (4).
U. S. Department of Health and Human Services. Office of the Secretary.
The complete Common Rule document (45CFR46), at URL:
http://www.uaf.edu/oar/irb/45cfr46.html
or at URL:
http://ohrp.osophs.dhhs.gov/humansubjects/guidance/45cfr46.htm

      10. U. S. Code of Federal Regulations. 1999. 45 CFR Parts 160 - 164. Standards for Privacy of Individually Identifiable Health Information; Proposed Rule.
Department of Health and Human Services. Office of the Secretary.
Fed Regist. 1999 Nov 3;64(212):59917-59966. http://aspe.hhs.gov/admnsimp/

      11. National Cancer Institute's Confidentiality Brochure, at URL:
http://www-cdp.ims.nci.nih.gov/policy.html

       12. Sweeney L.
Computational Disclosure Control: A Primer on Data Privacy Protection.
PhD Thesis. Massachusetts Institute of Technology. Spring, 2001. Draft.
http://www.swiss.ai.mit.edu/classes/6.805/articles/privacy/sweeney-thesis-draft.pdf

       13. Sweeney L.
Privacy and medical-records research.
N Engl J Med. 1998 Apr 9;338(15):1077.
PMID: 9537887; UI: 98181820.

       14. Sweeney L.
Guaranteeing anonymity when sharing medical data, the Datafly System.
Proc AMIA Annu Fall Symp. 1997;:51-55.
PMID: 9357587; UI: 98020458.

      15. Goethe University Autopsy Resource.
http://www.medparse.com/guaruprt.htm

      16. Giere W, Moore GW.
Xmed-ED. EDV-gestützte Übersetzungen medizinischer Texte aus dem Englischen ins Deutsche.
Messe-Exponate der Uni-Frankfurt, 1994.
In: Kirsten W, Klar R, eds. Dokumentation und Informationsaufbereitung für den Arzt. Beiträge zur Medizinischen Informatik von Wolfgang Giere.
Darmstadt: Epsilon Verlag, 1996.
ISBN 3-9803214-7, 437 pages.

      17. Moore GW, Hutchins GM.
The persistent importance of autopsies.
Mayo Clin Proc. 2000 Jun;75(6):557-8.

      18. U.S. National Library of Medicine.
Unified Medical Language System.
http://www.nlm.nih.gov/research/umls/

      19. U. S. National Library of Medicine.
UMLS Knowledge Sources. Twelfth Edition. Unified Medical Language System.
U. S. Department of Health and Human Services. National Institutes of Health. National Library of Medicine. 2001. See also: Ninth Edition, 1998.

      20. Hahn U, Romacker M, Schulz S.
How knowledge drives understanding -- matching medical ontologies with the needs of medical language processing.
Artif Intell Med 1999; 15:25-51.

      21. Moore GW, Hutchins GM, Boitnott JK, Miller RE, Polacsek RA.
Word root translation of 45,564 autopsy reports into MeSH titles.
Proc Annu Symp Comput Appl Med Care. 1987;11:. Washington DC, November 1-4, 1987.

      22. Moore GW, Miller RE, Hutchins GM.
Indexing by MeSH titles of natural language pathology phrases identified on first encounter using the barrier word method.
In: Scherrer JR, Cote RA, and Mandil SH, eds., Computerized Natural Medical Language Processing for Knowledge Representation.
Amsterdam: North-Holland. 1989;:29-39.
ISBN 0-444-87356-2, 296 pages.

      23. Moore GW, Riede UN, Polacsek RA, Miller RE, Hutchins GM.
Automated translation of German to English medical text.
Am J Med. 1986 Jul;81(1):103-111.

      24. Giere W.
Foundations of clinical data automation in cooperative programs.
5th Ann Symp Comp Applic Med Care (Heffernan HG, ed). Washington, DC, 1981, pp. 1142-1148.
Fifth Ann Symp Comp Applic Med Care. 1981;5:1142-1148.

      25. Giere W, Moore GW.
Translating English into German using VA File Manager.
M Computing, 1:16-23, 1993.

      26. Wingert F.
Medical Linguistics: Automated Indexing into SNOMED.
CRC, Critical Reviews in Medical Informatics. 1988;1:333-403.

      27. Wingert F, Rothman D, Cote RA.
Automated Indexing into SNOMED and ICD.
In, Scherrer JR, Cote RA, and Mandil SH, eds., Computerized Natural Medical Language Processing for Knowledge Representation.
Amsterdam: North-Holland. 1989;:201-239.
ISBN 0-444-87356-2, 296 pages.

      28. Moore GW, Polacsek RA, Erozan YS, de la Monte SM, Miller RE, Hutchins GM, Riede UN.
Multilingual translation techniques in the analysis of narrative medical text.
Comput Methods Programs Biomed. 1986 Mar;22(1):35-42.

      29. Zipf GK.
On the Economy of Words.
Chapter 2 in, Human Behavior and The Principle of Least Effort. An Introduction to Human Ecology.
Cambridge, MA: Addison-Wesley Press, Inc. 1949;:19-55.

      30. Tersmette KWF, Scott AF, Moore GW, Matheson NW, Miller RE.
Barrier word method for detecting molecular biology multiple word terms.
Proc 12th Annu Symp Comput Appl Med Care. 1988;12:.

      31. Nelson SJ, Cole WG, Tuttle MS, Olson NE, Sherertz DD.
Recognizing new medical knowledge computationally.
Proc Annu Symp Comput Appl Med Care. 1993;17:409-413.

      32. Wilbur WJ.
Overview of Books at NCBI.
http://www.ncbi.nlm.nih.gov:80/books/mboc/bookshelp/bookover.html#link

      33. The Johns Hopkins Autopsy Resource: Parsing formulas for the nine-million word Johns Hopkins anatomic pathology files.
http://www.netautopsy.org/vhpsapsx.htm

      34. Sample sentences parsed from English to UMLS.
http://www.medparse.com

      35. Chomsky N.
Aspects of the Theory of Syntax.
Cambridge, MA: The MIT Press. 1965.

      36. Smith B.
Mereotopology: A Theory of Parts and Boundaries.
Data and Knowledge Engineering. 1996;20:287-303.

      37. Quine WVO.
Ontological relative, and other essays.
New York: Columbia University Press. 1969;:.

      38. Stewart I.
Flatterland. Like Flatland. Only More So.
Cambridge, MA: Perseus Publishing. 2001.
ISBN 0-7382-0442-0, 301 pages.

      39. U. S. Defense Advanced Research Projects Agency (DARPA). Agent Markup Language.
http://www.daml.org

      40. U. S. Defense Advanced Research Projects Agency (DARPA). Ontology Inference Layer.
http://www.ontoknowledge.org/oil

      41. Vigorita VJ, Moore GW, Hutchins GM.
Absence of correlation between coronary arterial atherosclerosis and severity or duration of diabetes mellitus of adult onset.
Am J Cardiol. 1980;46:535-542.

      42. Moore GW, Hutchins GM.
Consistency versus completeness in medical decision making: Application to 155 patients autopsied after coronary artery bypass graft surgery.
Proc 6th Annu Symp Comput Appl Med Care. 1982;6:805-811.

      50. Complete listing of German Barrier Words:
http://www.medparse.com/guarbarr.htm

      51. Complete listing of German UMLS translations:
http://www.medparse.com/guarumls.htm

      52. Complete listing of German Zipf distribution:
http://www.medparse.com/guarzipf.htm

      53. Perl Script for Goethe University Autopy Register.
http://www.medparse.com/guarsrch.txt



Last Updated: October 7, 2001, by G. William Moore, MD, PhD.