UMLS CONCORDANCE FOR HUMAN EMBRYOLOGY.

Gladys L. G. Alonsozana, MD [1,2],
G. William Moore, MD, PhD [1,2,3],
Grover M. Hutchins, MD [3].
From: Department of Pathology, Baltimore VA Medical Center [1], Baltimore, MD.
Department of Pathology, University of Maryland School of Medicine [2], Baltimore, MD.
Department of Pathology, The Johns Hopkins Medical Institutions [3], Baltimore, MD.

U. S. Government Work, published in:
the Johns Hopkins Autopsy Resource,
www.netautopsy.org





TABLE OF CONTENTS.


1. ABSTRACT.
2. INTRODUCTION.
3. DESIGN.
4. UNIFIED MEDICAL LANGUAGE SYSTEM.
5. REDUNDANT INDEXING OF SUBCONCEPTS.
6. BARRIER WORD METHOD.
7. BARRIER WORD METHOD: SAMPLE TEXT.
8. AMBIGUITIES IN UMLS.
9. RESULTS.
10. DISCUSSION.
11. FUTURE DIRECTIONS: SYNONYMS FOR UMLS.
12. REFERENCES.
13. ZIPF DISTRIBUTION OF UMLS CUIS.
14. ZIPF DISTRIBUTION OF UNMATCHED CUIS.
15. ZIPF DISTRIBUTION OF COLLOCATIONS.


1. ABSTRACT.


NEXT PAGE.
RETURN TO TABLE OF CONTENTS.

UMLS Concordance for Human Embryology.

Gladys L. G. Alonsozana, MD [1,2],
G. William Moore, MD, PhD [1,2,3],
Grover M. Hutchins, MD [3].
From: Departments of Pathology, Baltimore VA Medical Center [1], University of Maryland School of Medicine [2], and The Johns Hopkins Medical Institutions [3], Baltimore, MD.

Background. The Unified Medical Language System (UMLS) of the U. S. National Library of Medicine is the world's largest system of medical concepts, with over 700,000 concept-unique-identifiers (CUIs) and 1.5 million synonyms in the year 2000 edition. Human embryology concepts are employed in describing the pathogenesis and morphology of many neoplasms and congenital malformations in pathology. We sought to determine the inclusiveness of the UMLS for concepts in human embryology.

Design. The complete text of Streeter's Developmental Horizons in Human Embryos, as well as related texts, was optically scanned and converted into plain-text files. A distribution of single words, as well as multiple-word terms (collocations), was obtained by the Barrier Word Method, a method employed in automated Medline indexing. Exact-matches were made to the synonym-field in the UMLS Metathesaurus, and additional approximate-matches were obtained manually.

Results. The input text was 1.26 MB, consisting of 110,314 words, 9,087 of them distinct. There were 5,323 (4.8%) misspellings, i.e., optical mistranslations. Words ranged in frequency from 10,394 occurrences of 'the', to 4,776 words occurring only once, an average of 12.1 = 110,314/9,087 occurrences per word. There were 401 collocations with exact or approximate UMLS matches. Among correctly spelled words, there were 48,758 (46.4%) exact matches to a UMLS synonym, and 46,250 (44.0%) additional, approximate matches to UMLS CUIs, resulting in 9.5% unmatched concepts.

Conclusion. Results suggest that UMLS is a highly inclusive concept system for human embryology, with 90.5% exactly or approximately matched concepts in classic references in embryology. However, the UMLS is synonym-poor, and many synonyms must be added manually to accommodate embryology free-text. Thus, the UMLS appears to be a sufficiently rich concept system for inter-institutional exchange of embryology data.


2. INTRODUCTION.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.


  • UNIFIED MEDICAL LANGUAGE SYSTEM METATHESAURUS (UMLS-M) OF THE U. S. NATIONAL LIBRARY OF MEDICINE (USNLM).

  • MOST COMPREHENSIVE, PUBLICLY-AVAILABLE LIST OF STANDARDIZED MEDICAL TERMINOLOGY IN THE WORLD.

  • WHAT IS THE CONCORDANCE RATE FOR CLASSICAL EMBRYOLOGY TEXT?



  • 3. DESIGN.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • G. L. STREETER: DEVELOPMENTAL HORIZONS IN HUMAN EMBRYOS.
    HEUSER & CORNER: STAGE 9.
    O'RAHILLY: STAGES 1-8.

  • COMPUTER-ENCODED INTO UMLS, WITH ENRICHED SYNONYM LIST.

  • CONCORDANCE: MEDICALLY-SIGNIFICANT TERM, PRESENT IN THE TEXTS, ALSO CAPTURED BY ENCODING PROGRAM.

  • FALSE NEGATIVE: UMLS CONCEPT NOT PRESENT.

  • AMBIGUOUS TERMS AND COMPOUND TERMS CONTAINING SUBCONCEPTS INDEXED REDUNDANTLY.

  • BY DESIGN, ENCODER CAPTURED NO FALSE POSITIVES.



  • 4. UNIFIED MEDICAL LANGUAGE SYSTEM (UMLS).


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • UNIFIED MEDICAL LANGUAGE SYSTEM (UMLS) : DEVELOPED BY U.S. NATIONAL LIBRARY OF MEDICINE (USNLM) IN 1986.

  • PURPOSE: AID DEVELOPMENT OF SYSTEMS TO RETRIEVE ELECTRONIC BIOMEDICAL INFORMATION.

  • http://www.nlm.nih.gov/research/umls/

  • LAST UPDATED: January 1, 2000.

  • METATHESAURUS SIZE: 113,699,627 BYTES.

  • CONCEPT UNIQUE IDENTIFIERS (CUIs): 729,248, MAX=C0813178, RETIRED=83,930.

  • SYNONYMS: 1,598,176

  • OVER 50 SOURCE-VOCABULARIES.

  • OVER 20 PARTIAL TRANSLATIONS INTO FOREIGN LANGUAGES.



  • 5. REDUNDANT INDEXING OF SUBCONCEPTS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

    CELLULAR BLUE NEVUS REDUNDANTLY INDEXED AS:


  • CELLULAR BLUE NEVUS (C0334448).

  • BLUE NEVUS (C0206736).

  • CELL (C0007634).

  • BLUE (C0332584).

  • NEVUS (C0027960).



  • 6. BARRIER WORD METHOD.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • NATURAL-LANGUAGE MEDICAL TEXT: SEQUENCE OF MEDICAL CONCEPTS SEPARATED BY GRAMMATICAL OBJECTS.

  • THE GRAMMATICAL OBJECTS, OR BARRIER WORDS: NUMERALS, PUNCTUATION, SINGLE LETTERS, ARTICLES, PREPOSITIONS, AND COMMON VERBS AND MODIFIERS.

  • MEDICAL CONCEPTS, OR KEYWORDS: ARE ONE-WORD OR MULTIPLE-WORD TERMS CONSISTING OF MEDICALLY SIGNIFICANT WORDS.



  • 7. BARRIER WORD METHOD: SAMPLE TEXT.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.
    GUT TRACT and its DERIVATIVES . at this same time the PHARYNGEAL POUCHES , which heretofore have been relatively simple LATERAL EXPANSIONS of GUT EPITHELIUM intervening between the AORTIC ARCHES , are taking the form of SPECIALIZED STRUCTURES . one can RECOGNIZE the beginning TRANSFORMATION into an AUDITORY TUBE and TYMPANUM , also the PRIMORDIA of the THYMUS , LATERAL THYROID , and SUPERIOR and INFERIOR PARATHYROID GLANDS.


  • barrier words in lower case.

  • KEYWORDS IN UPPER CASE.


  •                TEXT NAME             UMLS CUI
                   GUT TRACT    C0699818 C0332208*
                         and             C0332287*
                         its             C0027344*
                 DERIVATIVES             C0243070
                          at             C0332285*
                        this             C0205435*
                        same             C0445243
                        time             C0040213
                         the             C0205435*
          PHARYNGEAL POUCHES             C0231067*
                       which             C0043237*
                  heretofore             C0332152*
                        have             C0605770*
                        been             C0392148*
                  relatively             C0205345*
                      simple             C0205347
          LATERAL EXPANSIONS    C0205091 C0205229*
                          of             C0456627
              GUT EPITHELIUM    C0699818 C0014603
                 intervening             C0205102
                     between             C0205102
                         the             C0205435*
               AORTIC ARCHES             C0442005
                         are             C0392148*
                      taking      
                         the             C0205435*
                        form             C0376315
                          of             C0456627
      SPECIALIZED STRUCTURES    C0205548 C0678594*
                         one             C0205429
                         can             C0808716
                   RECOGNIZE             C0524637*
                         the             C0205435*
                   beginning             C0439657
              TRANSFORMATION             C0040682
                        into             C0332285
                          an             C0205447*
               AUDITORY TUBE    C0439822 C0175730
                         and             C0332287*
                    TYMPANUM             C0242251
                        also             C0332287*
                         the             C0205435*
                   PRIMORDIA             C0678727*
                          of             C0456627
                         the             C0205435*
                      THYMUS             C0496916
             LATERAL THYROID    C0205091 C0795756
                         and             C0332287*
                    SUPERIOR             C0205103
                         and             C0332287*
     INFERIOR PARATHYROID GLANDS C0678975 C0030518 C0225352
    






    8. AMBIGUITIES IN UMLS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • ADNEXA WITHOUT NEARBY DISAMBIGUATING WORD:


  • SKIN ADNEXA (C0221943)

  • UTERINE ADNEXA (C0001575)

  • OCULAR ADNEXA (C0229243)



  • 9. RESULTS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • INPUT TEXT: 1.26 MB.

  • 110,314 WORDS, 9,087 DISTINCT WORDS.

  • 5,323 (4.8%) MISSPELLINGS (OPTICAL MISTRANSLATIONS).

  • WORDS RANGED IN FREQUENCY FROM 10,394 OCCURRENCES OF 'THE', TO 4,776 WORDS OCCURRING ONLY ONCE.

  • AVERAGE: 12.1 = 110,314/9,087 OCCURRENCES PER WORD.

  • 401 COLLOCATIONS WITH EXACT OR APPROXIMATE UMLS MATCHES.

  • AMONG CORRECTLY SPELLED WORDS: 48,758 (46.4%) EXACT MATCHES TO A UMLS SYNONYM; 46,250 (44.0%) ADDITIONAL, APPROXIMATE MATCHES TO UMLS CUIS.

  • 9.5% UNMATCHED CONCEPTS.



  • 10. DISCUSSION.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • 9.5% UNMATCHED CONCEPTS.

  • FALSE-NEGATIVE UMLS CONCEPTS TENDED TO BE DESCRIPTIVE TERMS IN EMBRYOLOGY THAT CHARACTERIZE MICROSCOPIC FINDINGS.

  • UMLS: NEARLY-COMPREHENSIVE METATHESAURUS FOR EMBRYOLOGY TEXT.



  • 11. FUTURE DIRECTIONS:
    SOURCES FOR SYNONYMS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • LEXICAL VARIANTS: NUCLEI ==> CELL NUCLEUS.

  • OBVIOUS SYNONYMS: CLUSTER ==> AGGREGATE.

  • OBVIOUS MISSPELLINGS: WILM'S ==> WILMS'.
    BRONCHITS ==> BRONCHITIS.

  • OBVIOUS CONTRACTIONS: ADDISON ==> ADDISON'S DISEASE.
    CUSHING ==> CUSHING'S DISEASE.
    SQUAMOUS ==> SQUAMOUS CELL.

  • COMPOUNDS: WITHOUT ==> NEGATIVE-WITH.



  • 12. REFERENCES.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

  • 1. UMLS Knowledge Sources. 11th edition. 2000. Documentation. National Institutes of Health. National Library of Medicine. Bethesda, Maryland 20854.

  • 2. College of American Pathologists. Systematized Nomenclature of Human and Veterinary Medicine (SNOMED International). College of American Pathologists, Northfield, IL, 1993.

  • 3. Moore GW, Berman JJ, Hanzlick RL, Buchino JJ, Hutchins GM.
    A prototype Internet autopsy database. 1625 consecutive fetal and neonatal autopsy facesheets spanning 20 years.
    Arch Pathol Lab Med. 1996 Aug;120(8):782-785.

  • 4. Moore GW, Miller RE, Hutchins GM. Indexing by MeSH titles of natural language pathology phrases identified on first encounter using the barrier word method.
    In: Scherrer JR, Côté RA, Mandil SH, eds. Computerized Natural Medical Language Processing for Knowledge Representation. Amsterdam: North-Holland; pp 29-39, 1989.

  • 5. Streeter GL.
    Developmental Horizons in Human Embryos. Age Groups XI to XXIII.
    Carnegie Institution of Washington, Lord Baltimore Press, Washington, D.C. 1951; : .

  • 6. Streeter GL.
    Developmental horizons in human embryos.
    Description of age groups XI, 13 to 20 somites, and age group XII, 21 to 29 somites.
    Contrib Embryol Carnegie Inst Wash. 30:211-245, 1942.

  • 7. Streeter GL.
    Development horizons in human embryos. Description of age groups XIII, embryos about 4 or 5 millimeters long, and age group XIV, period of indentation of the lens vesicle.
    Contrib Embryol Carnegie Inst Wash. 1945; 31:27-63.

  • 8. Streeter GL.
    Developmental horizons in human embryos. Description of age groups XV, XVI, XVII, and XVIII, being the third issue of a survey of the Carnegie Collection.
    Contrib Embryol Carnegie Inst Wash. 1945; 32:133-203.

  • 9. Heuser CH, Corner GW.
    Developmental horizons in human embryos. Description of age group X, 4 to 12 somites.
    Contrib Embryol Carnegie Inst Wash. 1957; 36:29-39.

  • 10. O'Rahilly R.
    Developmental Stages in Human Embryos, Including a Survey of the Carnegie Collection. Part A. Embryos of the First Three Weeks (Stages 1 to 9).
    Washington, D.C., Carnegie Institution of Washington, 1973; : .

  • 11. The Johns Hopkins Autopsy Resource: http://www.netautopsy.org/

  • 12. Netembryo. http://www.netembryo.org/



    13. ZIPF DISTRIBUTION OF UMLS CUIS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

       RANK  FREQUENCY                WORD        UMLS CUI 
          1     10,394                 the        C0205435*
          2      5,441                  of        C0456627
          3      3,574                  in        C0439203
          4      3,123                 and        C0332287*
          5      1,982                  to        C0332285*
          6      1,947                  is        C0441912
          7      1,042                that        C0205435*
          8        959                 are        C0392148*
          9        947                  by        C0336807
         10        919                  as        C0003818
         11        919                  be        C0014121
         12        903                  it        C0027361*
         13        756                this        C0205435*
         14        695                from        C0332285*
         15        602               which        C0043237*
         16        597                  mm        C0439266
         17        560                with        C0332287
         18        545                  at        C0332285*
         19        541             embryos        C0013935
         20        534               cells        C0007625
         21        533                  no        C0205160*
         22        515                 age        C0001774
         23        505              embryo        C0013932
         24        438               group        C0441832
         25        422               stage        C0684248
         26        417                 one        C0205429
         27        404                  an        C0205447*
         28        403                 for        C0521117
         29        389                 its        C0027344*
         30        388                  or        C0332270*
         31        372                 has        C0605674
         32        348                  on        C0332285*
         33        339                 not        C0205160*
         34        330                been        C0392148*
         35        302               these        C0205392*
         36        296                form        C0376315
         37        295                more        C0205171
         38        294                 can        C0808716
         39        294                 fig        C0349932
         40        288               human        C0020102
         41        287                have        C0605770*
         42        269           embryonic        C0521444
         43        269               plate        C0005971
         44        268           specimens        C0370003*
         45        247              figure        C0441469*
         46        240                into        C0332285
         47        238               their        C0027361*
         48        234                 was        C0392148*
         49        233           primitive        C0033153*
         50        230               shown        C0332265*
    




    13. ZIPF DISTRIBUTION OF UNMATCHED WORDS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

       RANK  FREQUENCY                WORD
          1        101               stalk      
          2         52                 way      
          3         48                anat      
          4         48                germ      
          5         46                 pit      
          6         44                wash      
          7         43          prechordal      
          8         43               until      
          9         42               could      
         10         42               pairs      
         11         39               order      
         12         37            presumed      
         13         36                ones      
         14         36               taken      
         15         35                bars      
         16         34                cord      
         17         33             profile      
         18         32                come      
         19         32               shell      
         20         31                free      
         21         30             chordal      
         22         30             example      
         23         29             details      
         24         29             passage      
         25         29               polar      
         26         28           neuropore      
         27         28             sharply      
         28         27            consists      
         29         27                 how      
         30         27       intercellular      
         31         27             lacunae      
         32         27               takes      
         33         27               tubal      
         34         26            epiblast      
         35         26             instead      
         36         25              detail      
         37         25               folds      
         38         25                just      
         39         25                meso      
         40         25              partly      
         41         24          gelatinous      
         42         24              manner      
         43         24               owing      
         44         23               cited      
         45         23         conspicuous      
         46         23               quite      
         47         22               field      
         48         22          particular      
         49         22        particularly      
         50         22              proper      
    




    14. ZIPF DISTRIBUTION OF COLLOCATIONS.


    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

       RANK  FREQUENCY                   TERM       UMLS CUI 
          1      2,613                 of the       C0332285*
          2      1,037                 in the       C0332285*
          3        346               from the       C0332285*
          4        267              age group       C0596048
          5        147               has been       C0392148*
          6        141                of this       C0332285*
          7        127                for the       C0521125*
          8        123               yolk sac       C0042893
          9        110         embryonic disc       C0231003
         10        100       primitive streak       C0033153
         11         93               into the       C0332285*
         12         78              have been       C0392148*
         13         78               there is       C0332287*
         14         69            through the       C0332273*
         15         59       chorionic cavity       C0230966
         16         58            between the       C0205103*
         17         55               of these       C0332285*
         18         49              there are       C0332287*
         19         47             age groups       C0027362
         20         38         nervous system       C0027763
         21         36            neural tube       C0231024
         22         33          sinus venosus       C0231084
         23         32              the other       C0205394*
         24         30 central nervous system       C0007679
         25         29        chorionic villi       C0008508
         26         29             within the       C0332285*
         27         28          blood vessels       C0005847
         28         27                stage 5       C0441777
         29         27         zona pellucida       C0043519
         30         26            referred to       C0205543
         31         24            in addition       C0332287*
         32         24               over the       C0205136*
         33         22       cloacal membrane       C0231056
         34         22        vascular system       C0489903
         35         20              along the       C0205428*
         36         19                 due to       C0678226
         37         19         in addition to       C0332287
         38         18                site of       C0449643
         39         18                stage 2       C0441767
         40         18                stage 3       C0441771
         41         17              about the       C0475806*
         42         17              under the       C0542339*
         43         16        amniotic cavity       C0230976
         44         16             as well as       C0332287*
         45         16            rather than       C0489693*
         46         15            blood cells       C0005773
         47         14           chick embryo       C0008046
         48         14             germ cells       C0017471
         49         14               in vitro       C0021135
         50         14                of that       C0332285*
    




    Last Revised: 10/23/2000 by G. William Moore.