WEB-BASED FREE-TEXT QUERY SYSTEM
FOR SURGICAL PATHOLOGY REPORTS WITH
AUTOMATIC CASE DE-IDENTIFICATION.

ENTER JOHNS HOPKINS SEARCH.

Robert E. Miller, MD [1].
John K. Boitnott, MD [1].
G. William Moore, MD, PhD [1,2,3].
From: Departments of Pathology, The Johns Hopkins Medical Institutions, Baltimore, MD [1];
Baltimore VA Maryland Health Care System, Baltimore, MD [2];
and University of Maryland School of Medicine, Baltimore, Maryland [3].

U. S. Government Work, published in:
the Johns Hopkins Autopsy Resource,
www.netautopsy.org



TABLE OF CONTENTS.


1. ABSTRACT.
2. INTRODUCTION.
3. PATIENT DEMOGRAPHICS.
4. ANNUAL DISTRIBUTION OF CASES.
5. DISTRIBUTION OF ORGAN SYSTEMS.
6. DE-IDENTIFIERS.
7. FUTURE DIRECTIONS.
8. REFERENCES.



1. ABSTRACT.

WEB-BASED FREE-TEXT QUERY SYSTEM
FOR SURGICAL PATHOLOGY REPORTS WITH
AUTOMATIC CASE DE-IDENTIFICATION.


NEXT PAGE.
RETURN TO TABLE OF CONTENTS. Robert E. Miller, MD [1].
John K. Boitnott, MD [1].
G. William Moore, MD, PhD [1,2,3].

From: Departments of Pathology, The Johns Hopkins Medical Institutions [1];
Baltimore VA Maryland Health Care System [2];
and University of Maryland School of Medicine, Baltimore, Maryland [3].

Background. There is increasing interest in inter-institutional sharing of free-text surgical pathology reports. However, it is necessary to de-identify proper names (providers, institutions) that are sometimes included in the native text of such reports.

Design. Free-text surgical pathology reports at The Johns Hopkins Hospital are indexed and available to hospital staff. Proper names in the free-text database were identified either from available lists of persons, places, and institutions, or else by their proximity to keywords, such as 'Dr.' or 'hospital'. The free-text was parsed, and all proper names were substituted with a suitable token, prior to display on the web-based query system.

Results. On June 1, 2000, the Johns Hopkins surgical pathology database index contained 159,071 patients with surgical pathology cases; 361,957 surgical pathology cases; and 694,443 surgical pathology specimens. Age/sex demographics were complete for 99.3% of patients, with 60.1% females and 39.2% males, and a predominance of patients in the fourth (15.3%), fifth (14.2%), sixth (14.7%), and seventh (16.0%) decades. Race/ethnicity data were available on 77.3% of patients, including 50.6% whites and 24.1% African-Americans. In the most recent complete year (1999), there were 29,596 cases and 61,531 specimens. Organ-systems represented in the web-indexed database included: gastrointestinal, 28.7%; lymphoreticular, 15.1%; gynecologic, 14.0%; bone, 7.1%; and breast, 5.8%. There were 23,911 (6.6%) cases containing proper-names, that were tokenized by the de-identification system.

Conclusion. This study demonstrates that free-text surgical pathology reports can be de-identified for proper names (providers, institutions) that are sometimes included in these reports.



2. INTRODUCTION.


NEXT PAGE.
PREVIOUS PAGE.
RETURN TO TABLE OF CONTENTS.


  • INCREASING INTEREST IN PUBLIC INDEXES OF SURGICAL PATHOLOGY REPORTS.

  • MOST SURGICAL PATHOLOGY REPORTS EXIST AS UNEDITED FREE-TEXT.

  • SCRUBBING REPORTS OF PROPER NAMES.

  • AUTOMATED TRANSLATION OF SURGICAL PATHOLOGY REPORTS INTO STANDARDIZED LANGUAGES, SUCH AS UMLS.

  • FREE-TEXT REPORTS INDEXED AND AVAILABLE TO JH HOSPITAL STAFF.

  • PROPER NAMES IDENTIFIED FROM LISTS OF PERSONS, PLACES, AND INSTITUTIONS,

  • PROPER NAMES IDENTIFIED BY PROXIMITY TO KEYWORDS, SUCH AS 'DR.' OR 'HOSPITAL'.

  • PROPER NAMES SUBSTITUTED WITH SUITABLE TOKEN.

  • DISPLAY ON THE WEB-BASED QUERY SYSTEM.




  • 3. PATIENT DEMOGRAPHICS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

  • JUNE 1, 2000: DEMOGRAPHIC AND LINGUISTIC CONTENT
    OF JOHNS HOPKINS SURGICAL PATHOLOGY (JH-SP) DATABASE.
                      FEMALE       MALE    UNKNOWN     TOTAL  % PATIENTS
        0-9 years      3,763      5,906         13     9,682        6.1%
      10-19 years      6,409      2,841          7     9,257        5.8%
      20-29 years     17,318      3,341         16    20,675       13.0%
      30-39 years     18,743      5,618         13    24,374       15.3%
      40-49 years     15,149      7,405         26    22,580       14.2%
      50-59 years     12,269     11,057         18    23,344       14.7%
      60-69 years     10,873     14,501         26    25,400       16.0%
      70-79 years      8,198      9,377         14    17,589       11.1%
      80-89 years      2,650      2,239          9     4,898        3.1%
      90-99 years        225        121          0       346        0.2%
    100-109 years          5          2          0         7        0.0%
      Unknown age                              919       919        0.6%
            Total     95,602     62,408      1,061   159,071      100.1%
                       60.1%      39.2%       0.7%
    


  • RACE/ETHNICITY DATA AVAILABLE ON 77.3% = 122,946/159,071 PATIENTS.
  • BLACK: 38,341 (31.2%).
  • WHITE: 80,524 (65.5%).



    4. ANNUAL DISTRIBUTION OF CASES.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.
     YEAR     CASES  SPECIMENS
     1984    14,942     22,112
     1985    18,969     29,846
     1986    19,046     31,835
     1987    19,051     33,133
     1988    19,705     35,437
     1989    20,253     37,166
     1990    21,052     39,274
     1991    21,645     40,766
     1992    22,003     43,660
     1993    21,006     43,180
     1994    21,351     43,967
     1995    22,139     44,974
     1996    23,174     47,576
     1997    26,502     54,824
     1998    27,528     57,197
     1999    29,596     61,531
     2000    13,995     27,965
    Total   361,957    694,443
    




    5. DISTRIBUTION OF ORGAN SYSTEMS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

    ORGAN-SYSTEMS REPRESENTED AMONG THE 361,957 CASES.
         ORGAN SYSTEM        CASES  PERCENT
     Gastrointestinal      103,819    28.7%
      Lymphoreticular       54,597    15.1%
          Gynecologic       50,579    14.0%
                 Bone       25,578     7.1%
               Breast       20,939     5.8%
         Dermatologic       20,747     5.7%
            Obstetric       19,167     5.3%
        Genitourinary       18,916     5.2%
                Blood       17,787     4.9%
               Marrow       16,576     4.6%
                Heart       14,490     4.0%
                 Lung        8,015     2.2%
                  CNS        6,320     1.7%
        Neuromuscular        4,789     1.3%
            Endocrine        3,288     0.9%
    




    6. DE-IDENTIFIERS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • 23,911 (6.6%) CASES CONTAINING PROPER-NAMES.

  • TOKENIZED BY THE DE-IDENTIFICATION SOFTWARE.




  • 7. FUTURE DIRECTIONS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • DEMONSTRATION THAT FREE-TEXT SURGICAL PATHOLOGY REPORTS CAN BE DE-IDENTIFIED.

  • ALL NAMES AND ALL MISSPELLINGS MUST BE CAPTURED AND INCLUDED IN THE DICTIONARY: DICTIONARY POLICEPERSON.

  • EPONYMOUS DISEASES ARE MANAGED AS MULTIPLE-WORD TERMS:
    DR. BARRETT; BARRETT ESOPHAGUS.
    DR. CLARK; CLARK LEVEL.

  • NO PROTECTION AGAINST A VERY UNUSUAL COMBINATION OF DISEASES THAT MIGHT BE KNOWN PUBLICLY.




  • 8. REFERENCES.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

          1. Giere W.
    Foundations of clinical data automation in cooperative programs.
    Proc 5th Ann Symp Comp Applic Med Care. 1981;5:1142-1148.

          2. Hutchins GM, Berman JJ, Moore GW, Hanzlick R, the Autopsy Committee of the College of American Pathologists.
    Practice Guidelines for Autopsy Pathology.
    Arch Pathol Lab Med. 1999; 123:1085-1092.

          3. Joseph DM, Wong RL.
    Correction of misspellings and typographical errors in a free-text medical English information storage and retrieval system.
    Methods Inf Med. 1979 Oct;18(4):228-234.

          4. Manning CD, Schuetze H.
    Foundations of Statistical Natural Language Processing.
    Cambridge, MA: The MIT Press. ISBN: 0262133601. 2000.

          5. Masarie FE jr, Miller RA, Bouhaddou O, Guise NB, Warner HR.
    An Interlingua for Electronic Interchange of Medical Information: Using Frames to Map Between Clinical Vocabularies.
    Comp Biomed Res 1991; 24(4):379-400.

          6. Moore GW, Boitnott JK, Miller RE, Eggleston JC, Hutchins GM.
    Integrated anatomic pathology reporting system using natural language diagnoses.
    Modern Pathol 1988;1:44-50.

          7. Moore GW, Berman JJ, Hanzlick RL, Buchino JJ, Hutchins GM.
    A prototype internet autopsy database: 1625 consecutive fetal and neonatal autopsy facesheets spanning twenty years.
    Arch Pathol Lab Med. 1996;120:782-785.

          8. Moore GW, Berman JJ.
    Anatomic Pathology Data Mining.
    In: Cios KJ, ed. Medical Data Mining and Knowledge Discovery. Heidelberg: Springer Verlag. 2000 (in press).

          9. Taylor M, Saltz J, Nichols JH.
    Design of an Integrated Clinical Data Warehouse.
    J Assn Lab Automation. 2000. in press.

          10. Tersmette KWF, Scott AF, Moore GW, Matheson NW, Miller RE.
    Barrier word method for detecting molecular biology multiple word terms.
    Proc 12th Annu Symp Comput Appl Med Care. 1988;12:.

          11. U.S. National Library of Medicine.
    Unified Medical Language System.
    http://www.nlm.nih.gov/research/umls/

          12. U. S. National Library of Medicine.
    UMLS Knowledge Sources. Eleventh Edition. Unified Medical Language System.
    U. S. Department of Health and Human Services. National Institutes of Health. National Library of Medicine. 2000.

          13. Wilbur WJ.
    Overview of Books at NCBI.
    http://www.ncbi.nlm.nih.gov:80/books/mboc/bookshelp/bookover.html#link

          14. Wong RL, Gaynon P.
    An automated parsing routine for diagnostic statements of surgical pathology reports.
    Methods Inf Med. 1971 Jul;10(3):168-175.

          15. Wong RL, Reno JD, Hain TC, Platt RC, Gaynon PS, Joseph DM.
    Profile of a dictionary compiled from scanning over one million words of surgical pathology narrative text.
    Comput Biomed Res. 1980 Aug;13(4):382-398.



    Last Revised: 10/24/2000 by G. William Moore.