AUTOMATIC INDEXING OF A
PATHOLOGY IMAGE ARCHIVE
USING UMLS.

G. William Moore, MD, PhD [1,2,3].
David S. Brenner, MD. [2].
Jules J. Berman, PhD, MD. [3,4].



Pathology and Laboratory Medicine Service, Veterans Affairs Maryland Health Care System, Baltimore, Maryland [1].
Department of Pathology, University of Maryland School of Medicine, Baltimore, Maryland [2].
Department of Pathology, The Johns Hopkins Medical Institutions, Baltimore, Maryland [3];
Resource Development Branch, National Cancer Institute, National Institutes of Health, Bethesda, Maryland [4].

TABLE OF CONTENTS.


1. ABSTRACT.
2. INTRODUCTION.
3. DESIGN.
4. MATERIALS AND METHODS.
5. UNIFIED MEDICAL LANGUAGE SYSTEM.
6. SAMPLE UMLS RECORD.
7. REDUNDANT INDEXING OF CONCEPTS.
8. BARRIER WORD METHOD.
9. BARRIER WORD METHOD: SAMPLE TEXT.
10. AMBIGUITIES IN UMLS.
11. RESULTS.
12. DISCUSSION.
13. CONCLUSION.
14. REFERENCES.
15. ZIPF DISTRIBUTION OF UMLS CUIS.


1. ABSTRACT.


NEXT PAGE.
RETURN TO TABLE OF CONTENTS.

      Background: The value of any large image archive resides in the ability to select and retrieve images based upon features of interest in the images. Images can be automatically encoded from descriptive text (image-legends), into concept codes of the Unified Medical Language System (UMLS) of the U. S. National Library of Medicine. The technique permits powerful image categorization and retrieval, and is generalizable to image archives of enormous size.

      Design: A collection of 5,465 pathology image-legends was encoded into UMLS concept codes, via a computer translation program that parses and maps plain-text image-legends into lists of UMLS terms. Indexing software was written in M-language (formerly, MUMPS), and display software was written in the Practical Extraction and Reporting Language (PERL).

      Results: Each image-legend yielded an average of 15.6 index-terms per legend, ranging in frequency from five terms in the least-indexed legend to 58 terms in the most-indexed legend. The program assigned 3,016 distinct UMLS concepts to the entire image-legend text-file. Of the 3,016 concepts, 875 were assigned uniquely to a single image-legend; the remaining 2,141 concepts were each assigned to multiple image-legends. In a manual survey of image-legends, 3.1% of UMLS concepts which should have been assigned by the system were not (false-negative rate).

      Conclusion: Since the image-legends contain image descriptors (e.g., eosinophilic, small blue, green, electron microscopy), as well as pathologic terms (i.e., body-site and disease names), this UMLS index can be used to retrieve images by a wide variety of concepts. Since UMLS includes over one million internal links among synonymous and related terms, image retrieval via a UMLS-encoded index may succeed, even when a chosen query term is not included in the image-legend.

     


2. INTRODUCTION.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.


  • GROWING INTEREST IN IMAGE CAPTURE IN PATHOLOGY.

  • RELATIVELY LITTLE ATTENTION TOWARD IMAGE RETRIEVAL.

  • LARGE IMAGE ARCHIVES BECOME IMAGE CEMETERIES: BURIED IMAGES.

  • INDEXING BY SINGLE DIAGNOSTIC TERM MAY IMPEDE RETRIEVAL BY ALTERNATE TERMINOLOGY.

  • INFORMATION RETRIEVAL: MOST FUNDAMENTAL PROBLEM FACING ANY ARCHIVIST.

  • NO VALUE IN ARCHIVING UNRETRIEVABLE IMAGES.



  • 3. DESIGN.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • PATHOLOGY IMAGES INDEXED USING UMLS UNIFIED MEDICAL LANGUAGE SYSTEM (UMLS).

  • ENCODE IMAGES UNDER ALL PATHOLOGIC CONCEPTS IMAGE LEGEND-TEXTS.

  • IMAGES LOADED INTO JOHNS HOPKINS AUTOPSY RESOURCE IMAGE ARCHIVE (JHAR-IA).

  • www.netautopsy.org

  • CLICK ON: 5000 IMAGES.



  • 4. MATERIALS AND METHODS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • 6,241 LEGEND-TEXTS FROM ELECTRONIC FASCICLES OF AFIP.

  • NON-COPYRIGHTED 5,465 IMAGES COMPRESSED 1:10 AS JPEG FILES

  • INDEXING SOFTWARE WRITTEN IN: M-LANGUAGE (FORMERLY, MUMPS).

  • DISPLAY SOFTWARE WRITTEN IN:
    PRACTICAL EXTRACTION AND REPORTING LANGUAGE (PERL).



  • 5. UNIFIED MEDICAL LANGUAGE SYSTEM (UMLS).


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • UNIFIED MEDICAL LANGUAGE SYSTEM (UMLS) : DEVELOPED BY U.S. NATIONAL LIBRARY OF MEDICINE (USNLM) IN 1986.

  • PURPOSE: AID DEVELOPMENT OF SYSTEMS TO RETRIEVE ELECTRONIC BIOMEDICAL INFORMATION.

  • http://www.nlm.nih.gov/research/umls/

  • LAST UPDATED: March 19, 1999.

  • SIZE: 96,412,092 BYTES.

  • CONCEPT UNIQUE IDENTIFIERS (CUIs): 625,530, MAX=C0700344.

  • SYNONYMS: 1,362,823.

  • LANGUAGE: PRIMARILY ENGLISH.

  • PARTIAL TRANSLATIONS: GERMAN, FRENCH, SPANISH, ITALIAN, RUSSIAN, DUTCH, PORTUGUESE, HUNGARIAN, FINNISH, SWEDISH, NORWEGIAN, DANISH.

  • OVER 50 SOURCE-VOCABULARIES.



  • 6. SAMPLE UMLS RECORD:
    ADRENAL GLAND ( C0001625 )


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.
     C0001625|ENG|P|L0001625|PF|S0011239|Adrenal Glands|
     C0001625|ENG|P|L0001625|VC|S0352314|ADRENAL GLANDS|
     C0001625|ENG|P|L0001625|VC|S0354521|Adrenal glands|
     C0001625|ENG|P|L0001625|VO|S0354515|Adrenal gland, NOS|
     C0001625|ENG|P|L0001625|VO|S0799809|Adrenal gland <1>|
     C0001625|ENG|P|L0001625|VS|S0002402|Adrenal gland|
     C0001625|ENG|P|L0001625|VS|S0354508|Adrenal Gland|
     C0001625|ENG|P|L0001625|VS|S0414419|adrenal gland|
     C0001625|ENG|P|L0001625|VW|S0044868|Glands, Adrenal|
     C0001625|ENG|P|L0001625|VWS|S0044829|Gland, Adrenal|
     C0001625|ENG|S|L0579081|PF|S0740979|Suprarenal gland|
     C0001625|ENG|S|L0847296|PF|S0892764|Glandula suprarenalis|
    


  • EACH INDIVIDUAL RECORD DELINEATED BY NEWLINE BREAK.

  • WITHIN EACH RECORD, SEVEN FIELDS, VARIABLE IN LENGTH, SEPARATED BY VERTICAL PIPE, |, ASCII 124.

  • FIELD 1: CONCEPT UNIQUE IDENTIFIER, CUI, ( C0001625 ) .

  • FIELD 2: LANGUAGE DESIGNATION: ENG, GER, FRE, etc.

  • FIELD 4: LEXICAL UNIQUE IDENTIFIER, LUI, i.e., L0001625, L0579081, L0847296.

  • FIELD 6: STRING UNIQUE IDENTIFIER, SUI, i.e., S0011239, S0352314, S0354521,....

  • FIELD 7: TEXT FIELD, MATCHED TO A STRING IN IMAGE-LEGEND-TEXT.



  • 7. REDUNDANT INDEXING OF SUBCONCEPTS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

    CELLULAR BLUE NEVUS REDUNDANTLY INDEXED AS:


  • CELLULAR BLUE NEVUS (C0334448).

  • BLUE NEVUS (C0206736).

  • CELL (C0007634).

  • BLUE (C0332584).

  • NEVUS (C0027960).



  • 8. BARRIER WORD METHOD.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • NATURAL-LANGUAGE MEDICAL TEXT: SEQUENCE OF MEDICAL CONCEPTS SEPARATED BY GRAMMATICAL OBJECTS.

  • GRAMMATICAL OBJECTS, OR BARRIER WORDS: NUMERALS, PUNCTUATION, SINGLE LETTERS, ARTICLES, PREPOSITIONS, COMMON VERBS AND MODIFIERS.

  • MEDICAL CONCEPTS, OR KEYWORDS: ARE ONE-WORD OR MULTIPLE-WORD TERMS CONSISTING OF MEDICALLY SIGNIFICANT WORDS.



  • 9. BARRIER WORD METHOD: SAMPLE TEXT.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.
    LENTIGINOUS COMPOUND NEVUS . this LESION is an EARLY COMPOUND NEVUS , because a NEST has MIGRATED from the EPIDERMIS into the DERMIS ( lower right of c ) . elsewhere , the HISTOLOGY is that of a SIMPLE LENTIGO .


  • barrier words displayed in lower case.

  • KEYWORDS DISPLAYED IN UPPER CASE.



  • LEGEND
    NAME
    UMLS
    CODE
    UMLS
    NAME
    LENTIGINOUS C0023321 Lentigo, NOS
    COMPOUND NEVUS C0259781 Compound Nevus
    LESION C0012634 Lesion, NOS
    EARLY C0205085 Early
    COMPOUND NEVUS C0259781 Compound Nevus
    NEST C0205234 Focal
    MIGRATED C0232902 Migration, NOS
    EPIDERMIS C0014520 Epidermis, NOS
    DERMIS C0011646 Dermis, NOS
    LOWER C0205104 Inferior
    RIGHT C0205090 Right
    HISTOLOGY C0019638 Histologic
    SIMPLE LENTIGO C0302255 Lentigo Simplex






    10. AMBIGUITIES IN UMLS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • SIMILAR ENGLISH WORDS WITH DIFFERENT MEANING, DEPENDING ON CONTEXT.


  • IRIS (C0022077) AS PART OF EYE.

  • IRIS (C0331686) AS FLOWER.


  • IRIS (C0331686) AS FLOWER IS RETIRED.

  • ADNEXA WITHOUT NEARBY DISAMBIGUATING WORD:


  • SKIN ADNEXA (C0221943)

  • UTERINE ADNEXA (C0001575)

  • OCULAR ADNEXA (C0229243)



  • 11. RESULTS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • 5,465 SEPARATE IMAGE-LEGEND-TEXTS WERE ASSIGNED UMLS-CODES.

  • EVERY IMAGE ASSIGNED THE UMLS CODES FOR PHOTOGRAPHY (C0441468) AND PATHOLOGY (C0030664).

  • PAPILLARY (C0205312): 166 IMAGE-LEGENDS. (THYROID NEOPLASM, BREAST NEOPLASM, URINARY TRACT NEOPLASM, PAPILLARY FEATURE).

  • IRREGULAR (C0205271): 136 IMAGE-LEGENDS. ( GENERAL CONCEPT, NUCLEAR FEATURE, TUMOR BOUNDARY, CELLULAR DISTRIBUTION....)

  • NOT VERY SPECIFIC AS INDEXING TERMS.

  • MODIFIERS OCCUR AT HIGHER FREQUENCY THAN DIAGNOSES.

  • SPECIFIC TERMS: LOW FREQUENCY.

  • MALIGNANT MENINGIOMA (C0259785): NINE IMAGE-LEGENDS.

  • ROSAI DORFMAN DISEASE (C0019625): EIGHT IMAGE-LEGENDS.

  • DERMOID CYST (C0011649): EIGHT IMAGE-LEGENDS.

  • CHONDROBLASTOMA (C0008441): EIGHT IMAGE-LEGENDS.



  • 12. DISCUSSION.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • IMAGES ARE CONSTITUTIVELY NON-HIERARCHICAL.

  • EXAMPLE: MEDULLARY CARCINOMA OF THYROID (C0238462) JUSTIFIABLY FILED UNDER:


  • THYROID GLAND (C0040132), ORGAN OF ORIGIN.

  • TUMOR (C0027651)

  • TUMOR, MALIGNANT (C0006826)

  • C CELLS (C0229579), FROM WHICH IT ARISES.

  • MULTIPLE ENDOCRINE NEOPLASIA TYPE I SYNDROME (C0025267).


  • MUST BE DISTINGUISHED FROM: MEDULLARY CARCINOMA OF BREAST (C0206693).


  • ALPHABETIC SIMILARITY

  • NO PATHOGENETIC RELATIONSHIP



  • 13. CONCLUSION.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • NO HIERARCHICAL WAY OF ORGANIZING IMAGES.

  • IMAGE RETRIEVAL MECHANISM BY PATHOLOGIC CONCEPTS.

  • IMAGE RETRIEVAL SYSTEM USING ALL PATHOLOGY CONCEPTS WITHIN THE IMAGE IS ACHIEVABLE.

  • IMAGES AUTOMATICALLY UMLS-ENCODED FROM PRE-EXISTING TEXT DESCRIBING THE IMAGES.

  • AUTOMATIC UMLS ENCODING CAN USE THE ENTIRE UMLS NOMENCLATURE, OVER 700,000 DISTINCT CONCEPTS.

  • IMAGES STORED IN NON-HIERARCHICAL, NON-ORDERED FASHION.

  • ENCAPSULATION OF UMLS TERMS WITH IMAGES PERMITS LARGE MERGED IMAGE ARCHIVES.

  • IMAGE RETRIEVAL VIA UMLS-ENCODED INDEX MAY SUCCEED, EVEN WHEN A CHOSEN QUERY TERM NOT INCLUDED IN IMAGE-LEGEND.



  • 14. REFERENCES.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

    1. UMLS Knowledge Sources. 9th edition. 1998. DOCUMENTATION. National Institutes of Health. National Library of Medicine. Bethesda, Maryland 20854.

    2. College of American Pathologists. Systematized Nomenclature of Human and Veterinary Medicine (SNOMED International). College of American Pathologists, Northfield, IL, 1993.

    3. Berman JJ, Moore GW.
    SNOMED-encoded surgical pathology databases: A tool for epidemiologic investigation.
    Mod Pathol. 1996 Sep;9(9):944-950.

    4. Silverberg SG.
    SNOMED-encoded surgical pathology databases: 's no big deal - or is it?
    Mod Pathol. 1996 Sep;9(9):953-954.

    5. Moore GW, Berman JJ.
    Automatic SNOMED coding.
    Proc Annu Symp Comput Appl Med Care. 1994;18:225-229.

    6. Moore GW, Berman JJ.
    Performance analysis of manual and automated systematized nomenclature of medicine (SNOMED) coding.
    Am J Clin Pathol. 1994 Mar;101(3):253-256.

    7. Berman JJ, Moore GW.
    Object-oriented controlled-vocabulary translator using TRANSOFT + HyperPAD.
    Proc Annu Symp Comput Appl Med Care. 1991;15:973-975.

    8. Berman JJ, Moore GW, Donnelly WH, Massey JK, Craig B.
    A SNOMED analysis of three years accessioned cases (40,124) of a surgical pathology department: implications for pathology-based demographic studies.
    Proc Annu Symp Comput Appl Med Care. 1994;18:188-192.

    9. Moore GW, Berman JJ, Hanzlick RL, Buchino JJ, Hutchins GM.
    A prototype Internet autopsy database. 1625 consecutive fetal and neonatal autopsy facesheets spanning 20 years.
    Arch Pathol Lab Med. 1996 Aug;120(8):782-785.

    10. Berman JJ, Moore GW, Hutchins GM.
    Internet autopsy database.
    Hum Pathol. 1997 Apr;28(4):393-394.

    11. Moore GW, Miller RE, Hutchins GM. Indexing by MeSH titles of natural language pathology phrases identified on first encounter using the barrier word method. In: Scherrer JR, Côté RA, Mandil SH, eds. Computerized Natural Medical Language Processing for Knowledge Representation. Amsterdam: North-Holland; pp 29-39, 1989.

    12. Murphy GF, Elder DA. Armed Forces Institute of Pathology Atlas of Tumor Pathology. Non-Melanocytic Tumors of the Skin, Electronic Fascicle version 2.0. Washington, D.C. Armed Forces Institute of Pathology.

    13. Elder DA, Murphy GF. Armed Forces Institute of Pathology Atlas of Tumor Pathology. Melanocytic Tumors of the Skin, Electronic Fascicle version 2.0. Washington, D.C. Armed Forces Institute of Pathology.

    14. Murphy WM, Beckwith JB, Farrow GM. Armed Forces Institute of Pathology Atlas of Tumor Pathology. Tumors of the Kidney, Bladder and Related Urinary Structures, Electronic Fascicle version 2.0. Washington, D.C. Armed Forces Institute of Pathology.

    15. Rosai J, Carcangiu ML, DeLellis RA. Armed Forces Institute of Pathology Atlas of Tumor Pathology. Tumors of the Thyroid Gland, Electronic Fascicle version 2.0. Washington, D.C. Armed Forces Institute of Pathology.

    16. DeLellis RA. Armed Forces Institute of Pathology Atlas of Tumor Pathology. Tumors of the Parathyroid Gland, Electronic Fascicle version 2.0. Washington, D.C. Armed Forces Institute of Pathology.

    17. Kurman RJ, Norris HJ, Wilkinson EJ. Armed Forces Institute of Pathology Atlas of Tumor Pathology. Tumors of the Cervix, Vagina, and Vulva, Electronic Fascicle version 2.0. Washington, D.C. Armed Forces Institute of Pathology.

    18. Silverberg SG, Kurman RJ. Armed Forces Institute of Pathology Atlas of Tumor Pathology. Tumors of the Uterine Corpus and Gestational Trophoblastic Disease, Electronic Fascicle version 2.0. Washington, D.C. Armed Forces Institute of Pathology.

    19. Rosen PP, Oberman HA. Armed Forces Institute of Pathology Atlas of Tumor Pathology. Tumors of the Mammary Gland, Electronic Fascicle version 2.0. Washington, D.C. Armed Forces Institute of Pathology.

    20. Burger PC, Scheithauer BW. Armed Forces Institute of Pathology Atlas of Tumor Pathology. Tumors of the Central Nervous System, Electronic Fascicle version 2.0. Washington, D.C. Armed Forces Institute of Pathology.

    21. McLean EW, Burnier MN, Zimmerman LE, Jakobiec FA. Armed Forces Institute of Pathology Atlas of Tumor Pathology. Tumors of the Eye and Ocular Adnexa, Electronic Fascicle version 2.0. Washington, D.C. Armed Forces Institute of Pathology.

    22. Colby TV, Koss MN, Travis WD. Armed Forces Institute of Pathology Atlas of Tumor Pathology. Tumors of the Lower Respiratory Tract, Electronic Fascicle version 2.0. Washington, D.C. Armed Forces Institute of Pathology.

    23. Brunning RD, McKenna RW. Armed Forces Institute of Pathology Atlas of Tumor Pathology. Tumors of the Bone Marrow, Electronic Fascicle version 2.0. Washington, D.C. Armed Forces Institute of Pathology.

    24. Fechner RE, Mills SE. Armed Forces Institute of Pathology Atlas of Tumor Pathology. Tumors of the Bones and Joints, Electronic Fascicle version 2.0. Washington, D.C. Armed Forces Institute of Pathology.



    15. ZIPF DISTRIBUTION OF UMLS CUIS.


    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

    RANK FREQUENCY UMLS
    CODE
    UMLS
    NAME
    1 5465 C0030664 PATHOLOGY
    2 5465 C0441468 PHOTOGRAPH
    3 2016 C0007634 CELL
    4 1812 C0441469 PICTURE
    5 1140 C0027651 NEOPLASM
    6 1102 C0024109 LUNG
    7 644 C0012634 DISEASE
    8 617 C0030705 PATIENT
    9 581 C0205165 SMALL
    10 569 C0205234 FOCAL
    11 549 C0205164 LARGE
    12 528 C0233426 APPEAR
    13 522 C0006141 BREAST
    14 487 C0205397 OBSERVE
    15 466 C0038128 STAIN
    16 458 C0150312 PRESENT
    17 421 C0015392 EYE
    18 413 C0445247 SAME
    19 408 C0010834 CYTOPLASM
    20 407 C0205392 SOME
    21 401 C0205182 ATYPICAL
    22 387 C0022646 KIDNEY
    23 375 C0205402 PROMINENT
    24 365 C0449774 PATTERN
    25 347 C0205091 LEFT
    26 341 C0040132 THYROID
    27 336 C0205160 NEGATIVE
    28 324 C0205090 RIGHT
    29 323 C0042149 UTERUS
    30 320 C0449470 TYPE
    31 316 C0007097 CANCER
    32 300 C0205172 MANY
    33 300 C0370003 SPECIMEN
    34 294 C0014609 EPITHELIUM
    35 292 C0262950 BONE
    36 285 C0332285 ARISING FROM
    37 285 C0444186 SMEAR
    38 279 C0005953 BONE MARROW
    39 275 C0017542 GIEMSA STAIN
    40 272 C0018964 HEMATOXYLIN
    41 270 C0205428 AFFECTING
    42 269 C0007874 CERVIX
    43 268 C0431085 TUMOR CELLS
    44 262 C0042591 VESSEL
    45 258 C0014448 EOSIN
    46 253 C0205250 ELEVATED
    47 249 C0024264 LYMPHOCYTE
    48 247 C0205308 OLD
    49 236 C0439508 YEAR
    50 234 C0392746 WELL
    FREQUENCY DISTRIBUTION OF
    50 MOST FREQUENT UMLS CONCEPTS
    IN AFIP LEGEND-TEXTS.