WEB-BASED FREE-TEXT QUERY SYSTEM
FOR SURGICAL PATHOLOGY REPORTS WITH
AUTOMATIC CASE DE-IDENTIFICATION.


Robert E. Miller, MD [1].
John K. Boitnott, MD [1].
Lawrence A. Brown, MD [2,3]
G. William Moore, MD, PhD [1,2,3].
From: Departments of Pathology, The Johns Hopkins Medical Institutions [1], Baltimore, MD.
Baltimore VA Maryland Health Care System [2], Baltimore, MD.
and University of Maryland School of Medicine, Baltimore, Maryland [3].

U. S. Government Work, published in:
the Johns Hopkins Autopsy Resource,
www.netautopsy.org



TABLE OF CONTENTS.


1. INTRODUCTION.
2. PATIENT DEMOGRAPHICS.
3. ANNUAL DISTRIBUTION OF CASES.
4. DISTRIBUTION OF ORGAN SYSTEMS.
5. DE-IDENTIFIERS.
6. RAW WORD COUNTS.
7. UMLS PART-OF-SPEECH LIST.
8. DISCOVERY METHODS.
9. BACKUS NAUR PARSING MODEL.
10. FREQUENCY DISTRIBUTION OF BARRIER WORS.
11. FREQUENCY DISTRIBUTION OF COLLOCATIONS.
12. ZIPF GRAMMAR.
13. ZIPF DISTRIBUTION OF BACKUS NAUR FORMS.
14. DATABASE DEFINITION.
15. SET THEORY DEFINITION.
16. TRUTH TABLE.
17. n-POSTING.
18. ALGORITHM.
19. RESULTS.
20. DISCUSSION.
21. DISCUSSION.
22. REFERENCES.



1. INTRODUCTION.


NEXT PAGE.
RETURN TO TABLE OF CONTENTS.


  • INCREASING INTEREST IN PUBLIC INDEXES OF SURGICAL PATHOLOGY REPORTS.

  • MOST SURGICAL PATHOLOGY REPORTS EXIST AS UNEDITED FREE-TEXT.

  • SCRUBBING REPORTS OF PROPER NAMES.

  • AUTOMATED TRANSLATION OF SURGICAL PATHOLOGY REPORTS INTO STANDARDIZED LANGUAGES, SUCH AS UMLS.

  • FREE-TEXT REPORTS INDEXED AND AVAILABLE TO JH HOSPITAL STAFF.

  • PROPER NAMES IDENTIFIED FROM LISTS OF PERSONS, PLACES, AND INSTITUTIONS,

  • PROPER NAMES IDENTIFIED BY PROXIMITY TO KEYWORDS, SUCH AS 'DR.' OR 'HOSPITAL'.

  • PROPER NAMES SUBSTITUTED WITH SUITABLE TOKEN.

  • DISPLAY ON THE WEB-BASED QUERY SYSTEM.




  • 2. PATIENT DEMOGRAPHICS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

  • JUNE 1, 2000: DEMOGRAPHIC AND LINGUISTIC CONTENT
    OF JOHNS HOPKINS SURGICAL PATHOLOGY (JH-SP) DATABASE.
                      FEMALE       MALE    UNKNOWN     TOTAL  % PATIENTS
        0-9 years      3,763      5,906         13     9,682        6.1%
      10-19 years      6,409      2,841          7     9,257        5.8%
      20-29 years     17,318      3,341         16    20,675       13.0%
      30-39 years     18,743      5,618         13    24,374       15.3%
      40-49 years     15,149      7,405         26    22,580       14.2%
      50-59 years     12,269     11,057         18    23,344       14.7%
      60-69 years     10,873     14,501         26    25,400       16.0%
      70-79 years      8,198      9,377         14    17,589       11.1%
      80-89 years      2,650      2,239          9     4,898        3.1%
      90-99 years        225        121          0       346        0.2%
    100-109 years          5          2          0         7        0.0%
      Unknown age                              919       919        0.6%
            Total     95,602     62,408      1,061   159,071      100.1%
                       60.1%      39.2%       0.7%
    


  • RACE/ETHNICITY DATA AVAILABLE ON 77.3% = 122,946/159,071 PATIENTS.
  • BLACK: 38,341 (31.2%).
  • WHITE: 80,524 (65.5%).



    3. ANNUAL DISTRIBUTION OF CASES.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.
     YEAR     CASES  SPECIMENS
     1984    14,942     22,112
     1985    18,969     29,846
     1986    19,046     31,835
     1987    19,051     33,133
     1988    19,705     35,437
     1989    20,253     37,166
     1990    21,052     39,274
     1991    21,645     40,766
     1992    22,003     43,660
     1993    21,006     43,180
     1994    21,351     43,967
     1995    22,139     44,974
     1996    23,174     47,576
     1997    26,502     54,824
     1998    27,528     57,197
     1999    29,596     61,531
     2000    13,995     27,965
    Total   361,957    694,443
    




    4. DISTRIBUTION OF ORGAN SYSTEMS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

    ORGAN-SYSTEMS REPRESENTED AMONG THE 361,957 CASES.
         ORGAN SYSTEM        CASES  PERCENT
     Gastrointestinal      103,819    28.7%
      Lymphoreticular       54,597    15.1%
          Gynecologic       50,579    14.0%
                 Bone       25,578     7.1%
               Breast       20,939     5.8%
         Dermatologic       20,747     5.7%
            Obstetric       19,167     5.3%
        Genitourinary       18,916     5.2%
                Blood       17,787     4.9%
               Marrow       16,576     4.6%
                Heart       14,490     4.0%
                 Lung        8,015     2.2%
                  CNS        6,320     1.7%
        Neuromuscular        4,789     1.3%
            Endocrine        3,288     0.9%
    




    5. DE-IDENTIFIERS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • 23,911 (6.6%) CASES CONTAINING PROPER-NAMES.

  • TOKENIZED BY THE DE-IDENTIFICATION SOFTWARE.

  • ALL NAMES AND ALL MISSPELLINGS MUST BE CAPTURED AND INCLUDED IN THE DICTIONARY: DICTIONARY POLICEPERSON.

  • EPONYMOUS DISEASES ARE MANAGED AS MULTIPLE-WORD TERMS:
    DR. BARRETT; BARRETT ESOPHAGUS.
    DR. CLARK; CLARK LEVEL.

  • NO PROTECTION AGAINST A VERY UNUSUAL COMBINATION OF DISEASES THAT MIGHT BE KNOWN PUBLICLY.




  • 6. RAW WORD COUNTS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • 9,004,337 WORDS.

  • 27,139 DISTINCT WORDS.

  • 11,550 SINGLY OCCURRING WORDS (HAPAX LEGOMENA).

  • 15,589 MULTIPLY OCCURRING WORDS.

  • 222,175 OCCURRENCES OF WORD 'AND'.

  • TO THE 11,550 SINGLY OCCURRING WORDS.

  • MISSPELLING RATE OF 0.1% = 11,550/9,004,337.




  • 7. UMLS PART-OF-SPEECH LIST.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • EACH WORD ASSIGNED TO SYNTACTIC CATEGORY OR PART-OF-SPEECH:
  • A=Adjective;
  • B=adverB;
  • C=Conjunction (and, or, ...);
  • D=Determiner (the, this, ...);
  • H=Helpingverb;
  • I=Interrogative (who, which, why, how,..., including complementizers);
  • N=Noun;
  • P=Preposition (at, by, to, for, from,...);
  • R=pRonoun (he, she, it, we, they,...);
  • V=Verb.
    POS NAME	POS LETTER	POS DECIMAL	NO. OCCURRENCES
    Adjective	A		1		2,187,808
    adverB		B		2		204,278
    Conjunction	C		16		262,589
    Determiner	D		32		164,435
    Helpingverb	H		4		174,048
    Interrogative	I		8		18,338
    Noun		N		128		4,458,102
    Preposition	P		256		709,617
    pRonoun		R		512		11,824
    mainVerb	V		1024		21,205
    

    Adj or Adv A|B 3 4,586 Adj or Noun A|N 129 115,075 Adj, Noun, Vb A|N|V 1153 114,981 Adj or Verb A|V 1025 259,917 Dtrmr or Int D|I 40 10,263 Noun or Verb N|V 1152 275,683 Unassigned 11,588
    TOTAL 9,004,337




  • 8. DISCOVERY METHODS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  • ZIPF'S LAW: HIGH-FREQUENCY WORDS IN LARGE FREE-TEXT CORPUS ARE EXTREMELY COMMON.

  • WORD-RANK (r) INVERSELY PROPORTIONAL TO WORD-FREQUENCY (f): FOR CONSTANT, k, r = k/f.

  • APPROXIMATELY ONE HUNDRED BARRIER WORDS: OVER HALF OF ALL WORD-OCCURRENCES.

  • BARRIER WORDS OR STOP WORDS: COMMONLY-OCCURRING, INCLUDING ARTICLES, CONJUNCTIONS, INTERROGATIVES, COMPLEMENTIZERS, INTERROGATIVES, PREPOSITIONS, PRONOUNS, AUXILIARY VERBS.

  • BARRIER WORD METHOD: BARRIER WORDS ARE SEPARATORS, OR BARRIERS, BETWEEN MULTIPLE-WORD MEDICAL TERMS.

  • MULTIPLE-WORD TERMS, OR COLLOCATIONS, BOUNDED ON EITHER SIDE BY BARRIER WORDS.

  • TERMINAL ILEUM , CECUM , APPENDIX and COLON ( RIGHT HEMICOLECTOMY ) ; MODERATELY DIFFERENTIATED COLONIC ADENOCARCINOMA , with extension through MUSCULARIS PROPRIA into PERICOLIC SOFT TISSUE , and with involvement of PERINEURAL SPACES . TUBULOVILLOUS ADENOMA and associated VASCULAR MALFORMATION in the TRANSVERSE COLON ; TUBULAR ADENOMA in the DESCENDING COLON . recent COLOSTOMY SITE with SUBMUCOSAL FIBROSIS and INFLAMED GRANULATION TISSUE in the SEROSA . multiple ADHESIONS and SEROSAL ABSCESSES with GRANULATION TISSUE , FOREIGN BODY GIANT CELLS , SCARRING , focal OSSIFICATION , and FAT NECROSIS . ISCHEMIC BOWEL DISEASE diffusely involving ILEAL MUCOSA , with focal TRANSMURAL NECROSIS and ACUTE INFLAMMATION .


  • ZIPF'S LAW FOR GRAMMAR: BACKUS NAUR FORM AKIN TO WORDS IN TEXT.

  • CANONICAL FORM: PREFERRED NOTATION, ENCAPSULATES ALL EQUIVALENT FORMS OF SAME CONCEPT.

  • ARBITRARY LEVELS OF RECURSION OF XML TAGS:
      <code-section>
        <c> ... <c> ... <c > ... </c></c></c>
      </code-section>
    


  • INTEGRATED CLINICAL DATA WAREHOUSE:
      <patient>
        <case>
          <specimen>
            <report-section>
              <code-section>
                <c> ... <c> ... < <c> ... </c></c></c>
              </code-section>
            </report-section>
          </specimen>
        </case>
      </patient>
    




  • 9. BACKUS NAUR PARSING MODEL.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.


  •   1. [] ==> [Nx]  
      2. [Nx] ==> [N]     Example:  HEMANGIOMA.
      3. [Nx] ==> [AN]    Example:  ACTINIC KERATOSIS.
      4. [Nx] ==> [NPN]   Example:  ADENOCARCINOMA OF COLON.
    


  • REVERSE BACKUS-NAUR-FORM PARSING: PARSER BEGINS WITH THE MORE COMPLEX EXPRESSION AND WORKS BACKWARD TO SIMPLER EXPRESSION.




  • 10. FREQUENCY DISTRIBUTION OF BARRIER WORDS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.
    RANK	FREQUENCY   BARRIER WORD
       1      222,175   and
       2      196,153   of
       3      189,799   with
       4      107,039   for
       5      104,067   the
       6       82,104   note
       7       80,740   in
       8       78,549   right
       9       77,885   left
      10       70,923   is
      11       70,261   see
      12       67,917   are
      13       53,071   mild
      14       49,987   identified
      15       47,804   to
      16       41,467   consistent
      17       39,792   this
      18       30,352   present
      19       27,189   seen
      20       25,371   at
      21       25,097   there
      22       24,657   on
      23       24,284   or
      24       23,021   be
      25       21,243   associated
      26       19,515   was
      27       18,376   one
      28       16,122   but
      29       16,057   case
      30       16,057   from
      31       16,036   these
      32       15,672   show
      33       15,396   separate
      34       15,135   by
      35       13,776   as
      36       13,730   an
      37       13,542   has
      38       13,074   only
      39       12,615   shows
      40       11,735   portion
      41       11,487   involving
      42       10,803   two
      43       10,718   which
      44       10,448   features
      45       10,263   that
      46       10,119   low
      47       10,097   three
    




    11. FREQUENCY DISTRIBUTION
    OF MULTIPLE-WORD TERMS (COLLOCATIONS).


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.
    RANK	FREQUENCY   COLLOCATION
       1       38,401   chronic inflammation
       2       20,328   lymph nodes
       3       18,428   diff quik
       4       16,104   soft tissue
       5       14,456   bone marrow
       6       13,104   non diagnostic
       7       13,021   diagnostic findings
       8       13,004   non diagnostic findings
       9       12,868   helicobacter pylori
      10       12,328   crypt distortion
      11       12,316   lymph node
      12       12,292   quik stain
      13       12,284   diff quik stain
      14       11,080   mild chronic
      15       10,229   epithelial changes
      16       10,004   fibroadipose tissue
      17        9,967   non specific
      18        9,052   left breast
      19        8,893   inflammatory disease
      20        8,741   gastroesophageal reflux
      21        8,234   gleason grade
      22        7,994   squamous metaplasia
      23        7,797   tubular adenoma
      24        7,237   reactive epithelial
      25        6,944   reactive epithelial changes
      26        6,793   active chronic
      27        6,714   granulation tissue
      28        6,634   seminal vesicles
      29        6,312   surgical margins
      30        6,199   lamina propria
      31        6,086   acute rejection
      32        6,038   fallopian tube
      33        6,019   cell metaplasia
      34        5,990   pelvic lymph
      35        5,932   chronic gastritis
      36        5,928   secretory endometrium
      37        5,790   hematopoietic elements
      38        5,648   bile reflux
      39        5,633   chronic cervicitis
      40        5,595   pelvic lymph nodes
      41        5,538   no helicobacter
      42        5,521   no helicobacter pylori
      43        5,422   anti inflammatory
      44        5,418   small bowel
      45        5,379   type indeterminate
      46        5,247   inflammatory drugs
      47        5,243   anti inflammatory drugs
      48        5,183   chronic inflammatory
      49        5,173   hernia sac
      50        5,003   antral mucosa
    




    12. ZIPF GRAMMAR.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.
      RANK  FREQUENCY       SENTENCE-PATTERN   EXAMPLE 
         1    423,177                    [N]   hemangioma
         2    106,034                 [N[N]]   liver [needle]
         3     98,958                   [AN]   left foot
         4     85,908                  [N|V]   scar
         5     79,741                 [NN|V]   skin scar
         6     62,042                  [AAN]   epidermal inclusion cyst
         7     50,461                [AN[N]]   laryngeal mass [biopsy]
         8     41,958                  [NCN]   decidua and villi
         9     38,689                [A|NPN]   negative for actinomyces
        10     26,745               [N[NPN]]   cervix [biopsy at 9:00] 
        11     22,097                [N[NN]]   cervix [biopsy 9:00]
        12     21,704                 [NPAN]   skin of left ear
        13     21,102                   [NN]   ear lobe
        14     20,638                  [BAN]   non diagnostic findings
        15     16,864               [AAN[N]]   left chest wall [biopsy]
        16     13,674                 [AAAN]   left axillary soft tissue
        17     12,798              [NCAN[N]]   skin , left flank [biopsy]
        18     12,692                [ANCAN]   soft tissue , inguinal region
        19     12,596               [ANPAAN]   fibrous plaque from left carotid artery
        20     12,507   [N[N]ANCA|VANCA|NPN]   leg [ bka ] old thrombus and calcified atherosclerotic plaque , negative for osteomyelitis 
        21     12,136         [BAANHA|VPAAN]   no helicobacter pylori organisms are identified on diff quik stain
        22     12,097              [AAAN[N]]   left true vocal cord [biopsy]
        23     10,555                [ANPAN]   soft tissue of right wrist 
        24     10,257              [A|NPNCN]   negative for fungi or afb
        25      9,952                  [ANN]   left ear lobe
        26      9,650                [N[AN]]   colon [biopsy left]
        27      9,533               [N[NCN]]   cervix [biopsy , 9:00]
        28      8,732               [ANN[N]]   left ear lobe [biopsy]
        29      8,239                  [NAN]   skin right ear
        30      7,937                    [A]   void
        31      7,550                  [NPN]   biopsy at 9:00
        32      6,862               [NCN[N]]   skin, face [biopsy]
        33      6,675                 [ANPN]   left head of femur
        34      6,121                [NCNCN]   placenta, membranes and cord 
        35      6,111               [AN[NN]]   left breast [core biopsy]
        36      6,096               [N[NPA]]   colon [biopsy of left]
        37      5,728                [N[NA]]   colon [biopsy right] 
        38      5,685              [N[NPAN]]   colon [biopsy of right colon] 
        39      5,650             [ANCAN[N]]   soft tissue , left chest [excision]
        40      5,422          [DNHA|VPDAAN]   this case was shown at the quality assurance conference
    




    13. ZIPF DISTRIBUTION OF BACKUS NAUR FORMS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.
      RANK   FREQUENCY      BNF FORMULA   EXAMPLE
         1     689,478       [N] ==> []   [prostate]
         2     313,234      [AN] ==> []   [actinic keratosis]
         3     117,039     [AAN] ==> []   [hypertrophic actinic keratosis]
         4      86,762     [N|V] ==> []   [scar]
         5      80,127    [NN|V] ==> []   [skin scar]
         6      66,816     [NAN] ==> []   [skin soft tissue]
         7      60,129     [NCN] ==> []   [decidua and villi]
         8      55,728       [AN ==> [N   [actinic KERATOSIS
         9      52,777     [A|N] ==> []   [negative]
        10      47,375      [NN] ==> []   [granulation tissue]
        11      47,139       [A] ==> []   [void]
        12      42,661     [NPN] ==> []   [adenocarcinoma of colon]
        13      36,076    [AAAN] ==> []   [focal bowenoid actinic keratosis]
        14      31,946    [NPAN] ==> []   [skin with actinic keratosis]
        15      25,168     [BAN] ==> []   [focally invasive tumor]
        16      22,761    [NCAN] ==> []   [ulcer and acute inflammation]
        17      22,276     [ANN] ==> []   [exuberant granulation tissue]
        18      16,791       [NN ==> [N   [lung CARCINOMA
        19      15,577    [NAPN] ==> []   [carcinoma metastatic to lung]
        20      13,764     [NNN] ==> []   [liver gallbladder pancreas]
        21      13,212      [AAN ==> [N   [hypertrophic actinic KERATOSIS
        22      12,417     [BAN ==> [BN   [FOCALLY active GASTRITIS
        23      12,053      [NCN ==> [N   [decidua and VILLI
    




    14. DATABASE DEFINITION.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

  • DATABASE DEFINITION: ROWS=PATIENTS; COLUMNS=FEATURES.

  • n > 2 PATIENTS.

  • (q+r) > 2 BINARY (POSITIVE/NEGATIVE) FEATURES.

  • SOME VARIABLES UNKNOWN OR EXCLUDED FOR CONFIDENTIALITY REASONS (MISSING-VALUES).

  • BINARY MEDICAL VARIABLES CONSECUTIVELY NUMBERED FROM 1 TO (q+r).

  • VARIABLES FROM 1 TO q ARE 'PUBLIC' (MNEMONIC: PUBLIQ).

  • VARIABLES FROM r+1 TO q+r ARE 'PRIVATE' (MNEMONIC: PRIVATE).

  • PUBLIC VARIABLES: FEATURES ORDINARILY ACCESSIBLE TO THE PUBLIC (GENDER, AGE, PHYSICAL STIGMATA, ETC.).

  • PRIVATE FEATURES OF PUBLIC PERSONS (e.g., U. S. PRESIDENT LYNDON B. JOHNSON'S CHOLECYSTECTOMY SCAR) CONCEALED AS MISSING VALUES ('MISSINGVALUIZED') ON A CASE-BY-CASE BASIS.



    15. SET THEORY DEFINITION.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

  • SET THEORY DEFINITION: n > 2 PATIENTS;

  • VARIABLES FROM 1 TO q ARE 'PUBLIC'.

  • VARIABLES FROM r+1 TO q+r ARE 'PRIVATE'.

  • 'PUBLIC SET', Q = {1,-1,...,q,-q}.

  • 'PRIVATE SET', R = {q+1,-q-1,...,q+r,-q-r}.

  • 'TRUTH TABLE', T: SET OF ALL T c T , k c T OR -k c T, BUT NOT BOTH, FOR ALL NON-ZERO K.

  • SUBTRUTH TABLE, S: SET OF ALL SUBSETS OF TRUTH TABLE ELEMENTS.



    16. TRUTH TABLE.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

  • 'TRUTH TABLE', T: SET OF ALL T c T , SUCH THAT k c T OR -k c T, BUT NOT BOTH,
    FOR ALL NON-ZERO k.

  • FOR EACH TRUTH-TABLE-ELEMENT, T, EACH FEATURE, k, IS EITHER
    'TRUE' (+k c T ) OR 'FALSE' (-k c T ).

  • TRUTH TABLE IS THE SET OF 'COMPLETE PATIENT DESCRIPTIONS'.

  • EVERY PATIENT IS, IN PRINCIPLE, COMPLETELY DESCRIBED BY
    EXACTLY ONE TRUTH-TABLE-ELEMENT, T.

  • SUBTRUTH TABLE, S: SET OF ALL SUBSETS OF TRUTH TABLE ELEMENTS.

  • FOR EXAMPLE, FOR q=2 AND r=1, 'TRUTH TABLE', T:
     { 1, 2, 3} = f
     { 1, 2,-3} = g
     { 1,-2, 3} = e
     { 1,-2,-3} = b
     {-1, 2, 3} = g
     {-1, 2,-3} = d
     {-1,-2, 3} = h
     {-1,-2,-3} = a
    


  • Figure 1. Venn Diagram for a three-variable truth table.
                           ________________2___________
                   a       |                          |
                           |                          |
                           |                          |
           ________1_______|________                  |
           |               |       |                  |
           |               |       |                  |
           |       b       |   c   |       d          |
           |       ________|_______|______________    |
           |       |       |       |             |    |
           |       |       |       |             |    |
           |       |   e   |   f   |       g     |    |
           |       |       |       |             |    |
           |       |       |_______|_____________|____|
           |       |               |             |
           |       |               |             |
           |_______|_______________|             |
                   |                       h     |
                   |                             |
                   |                             |
                   |_______________3_____________|
    




    17. n-POSTING.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

  • SUBTRUTH TABLE, S: SET OF ALL SUBSETS OF TRUTH TABLE ELEMENTS.

  • 'n-POSTING', P: ORDERED COLLECTION OF n > 2 'SINGLE-POSTINGS', ( P1 ,..., Ph ,..., Pn ).

  • SINGLE-POSTING, OR 'POSTING', EQUALS A MEMBER OF THE SUBTRUTH TABLE, S,
    AND REPRESENTS ONE PATIENT.

  • NOTATION: Ph \ P: Ph IS A POSTING IN P.

  • TWO POSTINGS, Ph AND Pi, WHERE h ¬= i, REPRESENT TWO DIFFERENT PATIENTS, BUT MAY HAVE THE SAME VALUE IN THE SUBTRUTH TABLE.

  • Ph \ P , IS 'WEAKLY PRIVATE' IF AND ONLY IF THERE EXISTS A DISTINCT POSTING, Pi \ P , ( Ph ^ Q ) = ( Ph ^ Q ) ( ^ = set intersection); AND Ph c Pi .

  • Ph \ P , IS 'STRONGLY PRIVATE' IF Ph = Pi .

  • ALGORITHM OPERATES BY TARGETING CERTAIN DATA-ELEMENTS AND 'MISSINGVALUIZING' THEM.

  • FOR WEAK PRIVACY: A PERSON IGNORANT OF A GIVEN PATIENT'S PRIVATE DATA-ELEMENTS CANNOT IDENTIFY THE PATIENT FROM INFORMATION AVAILABLE IN THE PUBLIC POSTING (THEOREM 3).

  • FOR STRONG PRIVACY: EVEN THE PATIENT HIMSELF/HERSELF CANNOT IDENTIFY HIS/HER OWN RECORD FROM THE PUBLIC POSTING (THEOREM 5).

  • SET THEORY PROOFS: CLAIM THAT A GIVEN PATIENT IS IDENTIFIED FOR A GIVEN POSTING; SHOW THAT ANTOHER POSTING COULD LIKEWISE REPRESENT THAT PATIENT.



    18. ALGORITHM.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

  • FORM A RECTANGULAR MEDICAL DATABASE (PATIENTS=ROWS; VARIABLES=COLUMNS).

  • 1. ASSIGN PUBLIC/PRIVATE STATUS TO EACH VARIABLE.

  • 2. APPLY MEDICAL TAXONOMY TO THE DATABASE.

  • 3. INSTANTIATE MEDICALLY REDUNDANT VARIABLE VALUES.

  • 4. SORT PATIENTS IN DESCENDING ORDER OF NUMBER OF NONMISSING VALUES PER PATIENT; SORT VARIABLES IN DESCENDING ORDER OF NUMBER OF NONMISSING VALUES PER VARIABLE.

  • 5. ALL VARIABLES THAT ARE NONMISSING IN ONLY ONE PATIENT MUST BE MISSINGVALUIZED.

  • 6. START AT THE TOP (MOST VARIABLES NONMISSING) OF THE PATIENT-LIST; FOR EACH PATIENT, FIND SUBSET-PATIENT (LATER ON THE LIST). FOR STRONG PRIVACY, LOOK FOR AN EQUAL-PATIENT RATHER THAN SUBSET-PATIENT.

  • 7. MISSINGVALUIZE ADDITIONAL DATA-ELEMENTS, TO SATISFY SUBSET OR EQUALITY RELATION REQUIRED BY STEP 5.

  • 8. GO TO 3. CONTINUE TO EXHAUSTION.



    19. RESULTS.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

  • PRIVACY DEFINITIONS IN THIS REPORT THUS SPECIFY AN ALGORITHM FOR REMOVING ('SCRUBBING') DATA-ELEMENTS FROM THE PUBLIC POSTING.

  • NO RECORD CAN BE MATCHED TO A SPECIFIC PATIENT, EVEN IF MANY PRIVATE FACTS MAY BE INCLUDED ABOUT EACH PATIENT.

  • ALGORITHM SPECIFIES WHICH DATA-ELEMENTS MUST BE MISSINGVALUIZED IN ORDER TO DE-IDENTIFY THE DATABASE UNDER TWO PRIVACY DEFINITIONS.

  • WEAK PRIVACY: KNOWLEDGE OF PUBLIC INFORMATION ABOUT A PATIENT DOES NOT DISCLOSE THE IDENTITY OF THE POSTED RECORD (THEOREM 3).

  • STRONG PRIVACY: EVEN THE PATIENT CANNOT KNOW WITH CERTAINTY THAT A PARTICULAR RECORD BELONGS TO HIMSELF/HERSELF (THEOREM 5).



    20. DISCUSSION.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

  • EVOLVING HIPAA GUIDELINES: CLEARLY STATED PARADOX BETWEEN PROTECTING A PATIENT'S CONFIDENTIAL MEDICAL RECORDS AND INHIBITING PROGRESS OF MEDICAL RESEARCH.

  • TWO GENERAL APPROACHES TO PROBLEM: (SWEENEY, 1996, 1997, 1998). EITHER MISSINGVALUIZE (SCRUB) INDIVIDUAL DATA-ELEMENTS; OR ELSE INSERT ADDITIONAL, PHANTOM PATIENTS (DOPPELGANGERS).

  • SOCIAL VALUE OF PUBLIC MEDICAL RECORDS: SPECIAL INTEREST GROUPS; LOW-BUDGET RESEARCH; NON-STANDARD RESEARCH PARADIGMS.

  • DOPPELGANGER (DILUTION) APPROACH DESCRIBED IN THE CRYPTOGRAPHY LITERATURE.

  • HOWEVER, DOPPELGANGER SOLUTION IS SUBSTANTIALLY FRAUDULENT, POTENTIALLY MISLEADING TO THE LIKELY USERS OF A PUBLIC MEDICAL DATABASE (TISSUE-ARCHIVISTS, EPIDEMIOLOGISTS).



    21. DISCUSSION.


    NEXT PAGE.
    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

  • THIS REPORT INTRODUCES A SET THEORY DEFINITION AND ALGORITHM THAT TARGETS WHICH DATA-ELEMENTS MUST BE MISSINGVALUIZED.

  • SET THEORY IS AN UNDERAPPRECIATED FORMALISM FOR EXAMINING CONFIDENTIALITY ISSUES IN MEDICINE.

  • PROBLEM WITH SCRUBBING: STATISTICAL METHODS DO NOT WORK WELL WITH MEDICAL DATABASES HAVING NUMEROUS MISSING VALUES [CIOS AND MOORE, 2000].

  • DIFFICULTIES IN UMLS-ENCODING A DATABASE.

  • MEDICAL LOGIC / TAXONOMY NECESSARY.

  • PUBLIC PERSONS (LBJ'S CHOLECYSTECTOMY) MUST BE SELECTIVELY MISSINGVALUIZED.

  • NEW PARADIGM: STRONG PRIVACY, WEAK PRIVACY.





  • 22. REFERENCES.


    PREVIOUS PAGE.
    RETURN TO TABLE OF CONTENTS.

           1. U. S. Code of Federal Regulations, 45 CFR Subtitle A (10-1-95 Edition), part 46.101 (b) (4).

           2. U. S. Department of Health and Human Services. Standards for Privacy of Individually Identifiable Health Information.
    Fed Regist. 1999 Nov 3;64(212):59917-59966. http://aspe.hhs.gov/admnsimp/

           3. U. S. Government Documents: http://thomas.loc.gov

           4. Berman JJ, Moore GW, Hutchins GM.
    Maintaining patient confidentiality in the public domain Internet Autopsy Database (IAD).
    Proc AMIA Annu Fall Symp. 1996;:328-332.
    PMID: 8947682; UI: 97103310.

           5. Berman JJ, Moore GW, Hutchins GM.
    U. S. Senate Bill 422. The Genetic Confidentiality and Nondiscrimination Act of 1997.
    Diagn Mol Pathol. 1998 Aug;7(4):192-196.
    PMID: 9917128; UI: 99114200.

           6. Sweeney L.
    Privacy and medical-records research.
    N Engl J Med. 1998 Apr 9;338(15):1077.
    PMID: 9537887; UI: 98181820.

           7. Sweeney L.
    Guaranteeing anonymity when sharing medical data, the Datafly System.
    Proc AMIA Annu Fall Symp. 1997;:51-55.
    PMID: 9357587; UI: 98020458.

           8. Sweeney L.
    Replacing personally-identifying information in medical records, the Scrub system.
    Proc AMIA Annu Fall Symp. 1996;:333-337.
    PMID: 8947683; UI: 97103311.

           9. Carter JR, Nash NP, Cechner RL, Platt RD.
    Proposal for a national autopsy data bank. A potential major contribution of pathologists to the health care of the nation.
    Am J Clin Pathol. 76 (Suppl): 597-617, 1981.

           10. National Bioethics Advisory Commission (NBAC).
    http://bioethics.gov/general.html
    Executive Order 12975, October 3, 1995.
    Federal Register: October 5, 1995. v. 60.; no. 193. pp. 52063-52065

           11. U.S. National Library of Medicine.
    Unified Medical Language System.
    http://www.nlm.nih.gov/research/umls/

           12. Schneier B.
    Applied Cryptography, Second Edition. Protocols, Algorithms, and Source Code in C.
    New York: John Wiley & Sons, 1996.

          17. Johns Hopkins Autopsy Resource.
    http://www.netautopsy.org/

          18. Moore GW, Berman JJ.
    Anatomic Pathology Data Mining.
    In: Cios KJ, ed. Medical Data Mining and Knowledge Discovery. Heidelberg: Springer Verlag. 2000 (in press).

          19. Suppes PV.
    Axiomatic Set Theory.
    New York: Dover Publications. 1972.

          20. Bernays P.
    Axiomatic Set Theory.
    New York: Dover Publications. 1968.

          21. Bundy A, ed.
    Artificial Intelligence Techniques: A Comprehensive Catalogue. Fourth, Revised Edition.
    Heidelberg: Springer Verlag. ISBN: 3540593233. 1997.

          22. Cios KJ, Moore GW. 2000.
    Medical Data Mining and Knowledge Discovery: An Overview.
    In: Medical Data Mining and Knowledge Discovery, Cios KJ (ed.), Heidelberg: Springer Verlag. to appear in November, 2000.

          23. Moore GW, Hutchins GM.
    Effort and demand logic in medical decision making.
    Metamedicine 1980;1:277-304.

          24. Moore GW, Hutchins GM, Miller RE.
    Token swap test of significance for serial medical data bases.
    Am J Med. 1986;80:182-190.

          25. Moore GW, Berman JJ, Hanzlick RL, Buchino JJ, Hutchins GM. 1996.
    A prototype Internet autopsy database. 1625 consecutive fetal and neonatal autopsy facesheets spanning 20 years.
    Arch Pathol Lab Med. 1996;120:782-785.

          26. Saul JM.
    Legal Policy and Security Issues in the Handling of Medical Data.
    In: Cios KJ, (ed.). Medical Data Mining and Knowledge Discovery. Heidelberg: Springer Verlag. 2000.

          27. U. S. Department of Health and Human Services:
    Standards for Privacy of Individually Identifiable Health Information. Fed Regist. 1999 Nov 3;64(212):59917-59966.
    http://aspe.hhs.gov/admnsimp/

          28. U. S. National Library of Medicine.
    UMLS Knowledge Sources. Eleventh Edition. Unified Medical Language System.
    U. S. Department of Health and Human Services. National Institutes of Health. National Library of Medicine. 2000.

          29. Giere W.
    Foundations of clinical data automation in cooperative programs.
    Proc 5th Ann Symp Comp Applic Med Care. 1981;5:1142-1148.

          30. Hutchins GM, Berman JJ, Moore GW, Hanzlick R, the Autopsy Committee of the College of American Pathologists.
    Practice Guidelines for Autopsy Pathology.
    Arch Pathol Lab Med. 1999; 123:1085-1092.

          31. Joseph DM, Wong RL.
    Correction of misspellings and typographical errors in a free-text medical English information storage and retrieval system.
    Methods Inf Med. 1979 Oct;18(4):228-234.

          32. Manning CD, Schuetze H.
    Foundations of Statistical Natural Language Processing.
    Cambridge, MA: The MIT Press. ISBN: 0262133601. 2000.

          33. Masarie FE jr, Miller RA, Bouhaddou O, Guise NB, Warner HR.
    An Interlingua for Electronic Interchange of Medical Information: Using Frames to Map Between Clinical Vocabularies.
    Comp Biomed Res 1991; 24(4):379-400.

          34. Moore GW, Boitnott JK, Miller RE, Eggleston JC, Hutchins GM.
    Integrated anatomic pathology reporting system using natural language diagnoses.
    Modern Pathol 1988;1:44-50.

          35. Moore GW, Berman JJ, Hanzlick RL, Buchino JJ, Hutchins GM.
    A prototype internet autopsy database: 1625 consecutive fetal and neonatal autopsy facesheets spanning twenty years.
    Arch Pathol Lab Med. 1996;120:782-785.

          36. Moore GW, Berman JJ.
    Anatomic Pathology Data Mining.
    In: Cios KJ, ed. Medical Data Mining and Knowledge Discovery. Heidelberg: Springer Verlag. 2000 (in press).

          37. Taylor M, Saltz J, Nichols JH.
    Design of an Integrated Clinical Data Warehouse.
    J Assn Lab Automation. 2000. in press.

          38. Tersmette KWF, Scott AF, Moore GW, Matheson NW, Miller RE.
    Barrier word method for detecting molecular biology multiple word terms.
    Proc 12th Annu Symp Comput Appl Med Care. 1988;12:.

          39. U.S. National Library of Medicine.
    Unified Medical Language System.
    http://www.nlm.nih.gov/research/umls/

          40. U. S. National Library of Medicine.
    UMLS Knowledge Sources. Eleventh Edition. Unified Medical Language System.
    U. S. Department of Health and Human Services. National Institutes of Health. National Library of Medicine. 2000.

          41. Wilbur WJ.
    Overview of Books at NCBI.
    http://www.ncbi.nlm.nih.gov:80/books/mboc/bookshelp/bookover.html#link

          42. Wong RL, Gaynon P.
    An automated parsing routine for diagnostic statements of surgical pathology reports.
    Methods Inf Med. 1971 Jul;10(3):168-175.

          43. Wong RL, Reno JD, Hain TC, Platt RC, Gaynon PS, Joseph DM.
    Profile of a dictionary compiled from scanning over one million words of surgical pathology narrative text.
    Comput Biomed Res. 1980 Aug;13(4):382-398.

          44. Aitchison J.
    Teach Yourself Linguistics. Fifth Edition.
    Chicago: NTC/Contemporary Publishing Co. 2000. ISBN: 0844226688.

          45. Bundy A (Ed).
    Artificial Intelligence Techniques: A Comprehensive Catalogue. Fourth, Revised Edition.
    Heidelberg: Springer Verlag. 1997. ISBN: 3540593233.

          46. Chomsky N.
    Aspects of the Theory of Syntax.
    Cambridge, MA: The MIT Press. 1965.

          47. Giere W.
    Foundations of clinical data automation in cooperative programs.
    Proc 5th Ann Symp Comp Applic Med Care. 1981;5:1142-1148.

          48. Manning CD, Schuetze H.
    Foundations of Statistical Natural Language Processing.
    Cambridge, MA: The MIT Press. ISBN: 0262133601. 2000.

          49. Masarie FE jr, Miller RA, Bouhaddou O, Guise NB, Warner HR.
    An Interlingua for Electronic Interchange of Medical Information: Using Frames to Map Between Clinical Vocabularies.
    Comp Biomed Res 1991; 24(4):379-400.

          50. Moore GW, Boitnott JK, Miller RE, Eggleston JC, Hutchins GM.
    Integrated anatomic pathology reporting system using natural language diagnoses.
    Modern Pathol 1988;1:44-50.

          51. Moore GW, Miller RE, Hutchins GM.
    Indexing by MeSH titles of natural language pathology phrases identified on first encounter using the Barrier Word Method.
    In: Scherrer JR, Cote RA, Mandil SH, eds. Computerized Natural Medical Language Processing for Knowledge Representation. North-Holland. 1989;29-39.

          52. Moore GW, Berman JJ, Hanzlick RL, Buchino JJ, Hutchins GM.
    A prototype internet autopsy database: 1625 consecutive fetal and neonatal autopsy facesheets spanning twenty years.
    Arch Pathol Lab Med. 1996;120:782-785.

          53. Moore GW, Berman JJ.
    Anatomic Pathology Data Mining.
    In: Cios KJ, ed. Medical Data Mining and Knowledge Discovery. Heidelberg: Springer Verlag. 2000 (in press).

          54. Nagao M.
    Machine Translation.
    In: Shapiro SC, ed. Encyclopedia of Artificial Intelligence. Volume 2. M-Z. New York: Wiley-Interscience. 1992;:898-902.

          55. Nelson SJ, Olson NE, Fuller L, Tuttle MS, Cole WG, Sherertz DD.
    Identifying concepts in medical knowledge.
    Medinfo. 1995;8:33-36.

          56. Newmeyer FJ.
    Generative Linguistics. A historical Perspective.
    London: Routledge. 1996.

          57. Salton G, Buckley C.
    Global text matching for information retrieval.
    Science. 991;253:1012-1015.

          58. Taylor M, Saltz J, Nichols JH.
    Design of an Integrated Clinical Data Warehouse.
    J Assn Lab Automation. 2000. in press.

          59. Tersmette KWF, Scott AF, Moore GW, Matheson NW, Miller RE.
    Barrier word method for detecting molecular biology multiple word terms.
    Proc 12th Annu Symp Comput Appl Med Care. 1988;12:.

          60. Tymoczko T (ed.).
    New Directions in the Philosophy of Mathematics.
    Princeton, NJ: Princeton University Press. 1998.

          61. U.S. National Library of Medicine.
    Unified Medical Language System.
    http://www.nlm.nih.gov/research/umls/

          62. U. S. National Library of Medicine.
    UMLS Knowledge Sources. Eleventh Edition. Unified Medical Language System.
    U. S. Department of Health and Human Services. National Institutes of Health. National Library of Medicine. 2000.

          63. Wilbur WJ.
    Overview of Books at NCBI.
    http://www.ncbi.nlm.nih.gov:80/books/mboc/bookshelp/bookover.html#link

          64. Zipf GK.
    Human Behavior and The Principle of Least Effort. An Introduction to Human Ecology.
    Reading, MA: Addison-Wesley Press. 1949;:19-55.



    Last Revised: 10/22/2000 by G. William Moore.