.
Submitted text should be terse, without syntactical complexities, and
should end with an unambiguous sentence-terminator. That is, the
sentence-terminator should not appear anywhere on the facesheet
except at the end of a sentence.
We recommend:
period-space-space
period-carriagereturn-linefeed
semicolon-space
semicolon-carriagereturn-linefeed
IAD records are rendered anonymous so that neither the investigator,
the IAD database administrator, nor the contributing institution
alone can trace the identity of patients included in the IAD.
First, the contributing institution strips or encodes patient
identifiers from their submitted records, so that the IAD database
administrator cannot know the identiity of the patient. The IAD
database administrator then provides a new, encoded identifier for
the IAD. The resulting record is anonymous to the institution that
contributed the autopsy, as well as to the IAD database administrator
and to anyone retrieving the autopsy record from the IAD web page.
Anyone desiring further information, glass slides or tissue blocks
from a particular case would e-mail the IAD database administrator,
identifying the (doubly encoded) IAD autopsy record of interest and
his/her research objective. The database administrator then decodes
the published record number and restores the contributor record code
provided by the contributing institution. The database administrator
then forwards the institutionally coded record to the institution.
At this point, the institution may decide to do nothing, or to establish
a collaboration, with or without divulging the patient's identity,
according to its own internal procedures.
TRANSLATION PROGRAM.
The computer translation program for converting free-text
English diagnoses into corresponding SNOMED diagnoses is based upon
the public-domain computer translation program, TRANSOFT [6,
M source code provided at IAD website]. In the initial processing,
the translator separates the free-text portion of the autopsy facesheet
into distinct sentences, using the separators described above, as well
as additional terms which often serve as concept separators in an
autopsy facesheet, as follows:
by
with
showing
through
demonstrating
consistent with
Second, the translator expands text fragments which might
otherwise be lost in subsequent steps. Ordinarily, numerals
and one-letter and two-letter words are removed in subsequent
steps, so that essential numerals and words must be preserved
through prior expansion. For example, 'no', 'in', and '21' are
ordinarily removed, but may be preserved by the following substitutions:
no => negative
in situ => insitu
in vitro => invitro
21 trisomy => twentyonetrisomy
Third, the translator drops all letters in a sentence into lower case;
removes all punctuation, numerals, 1-letter and 2-letter words; and
removes all stop words, namely, articles, prepositions, conjunctions,
common modifiers, and other low-information words [6]. These three steps
leave behind a residual free-text, which can more readily be converted
into SNOMED-compatible terms.
Finally, the translator attempts a match between a single-word
and a corresponding SNOMED-compatible term;
then, a match between a two-word term and a corresponding
SNOMED-compatible term; then, a match between a three-word term and
a corresponding SNOMED-compatible term; until no more matches are
possible. The largest successful match is used for translation.
Large, unmatched autopsy facesheet sentences are placed on a list
for review by the database administrator, who performs a manual match,
and updates the translator dictionary. In many cases, the
database administrator can make an obvious match between
facesheet-free-text and SNOMED-compatible terms, such as inflectional
and adjectival forms (cyst, cysts, cystic); or synonyms and common
abbreviations (ALS, amyotrophic lateral sclerosis, Lou Gehrig's disease).
In addition, the database administrator can anticipate multiple-word
medical phrases which might occur in medical texts, using the barrier
word method [4,6,7].
TABLE 1A. SAMPLE AUTOPSY FACESHEET FOR SUBMISSION TO IAD.
###123456123456^67^W^M^1985^2^NONE
CLINICAL HISTORY: Hypertension.
Massive cardiomegaly.
Heart failure.
ANATOMICAL DIAGNOSIS:
Hypertrophy and dilatation, left ventricular myocardium.
Generalized atherosclerosis, severe.
Abdominal visceral congestion.
Pulmonary congestion.
Pulmonic artery atherosclerosis.
Focal pulmonary emphysema.
Bronchopneumonia.
Gallstones.
Benign hyperplasia, prostate.
Adenomatous polyp, rectum.
Diverticula, colon.
TABLE 1B. SAMPLE AUTOPSY FACESHEET,
TRANSLATED INTO SNOMED-COMPATIBLE TERMS FOR INCLUSION IN IAD.
###54321^67^W^M^1985^2^NONE^
Hypertensive disease, NOS^ .....
Massive^ Cardiomegaly^ .....
Heart failure, NOS^ .....
Hypertrophy, NOS^ Dilatation, NOS^
Left^ Ventricle, NOS^
Myocardium, NOS^ .....
Generalized^
Atherosclerosis, NOS^
Severe^ .....
Abdominal viscera, NOS^
Congestion, NOS^ ....
Pulmonary congestion, NOS^ .....
Pulmonary artery, NOS^
Atherosclerosis, NOS^ .....
Focal^ Pulmonary emphysema, NOS^ .....
Bronchopneumonia, NOS^ .....
Biliary calculus, NOS^ .....
Benign^ Hyperplasia of prostate, NOS^ .....
Adenomatous polyp, NOS^
Rectum, NOS^ .....
Diverticulum, NOS^
Colon, NOS^ .....
RESULTS.
On July 20, 1996, the Internet Autopsy Database consisted of 49,351
autopsy facesheets from over a dozen academic medical institutions.
There were 99 files containing autopsy facesheets, comprising
59,455,676 bytes of data. In addition, there were 12 supplementary
files containing explanatory materials, translation tables, and
search demonstration software (Perl source code included).
Patients ranged in age from stillborn to 112 years old, with autopsy
dates ranging from 1889 to 1995. There were 956,272 sentence
terminators, 2,905,520 SNOMED-compatible terms and 11,333 distinct
(used once or more) SNOMED-compatible terms. A summary of these
statistics is given in Table 2.
TABLE 2. INTERNET AUTOPSY DATABASE, 7/20/96
Size of database in bytes 59,455,676
No. of cases 49,351
No. of sentence terminators 956,272
No. of SNOMED-compatible terms 2,905,520
No. of unique SNOMED-compatible terms 11,333
Number of patients in each decade...
0 - 9 years 16,425
0 - 19 years 1,839
20 - 29 years 2,665
30 - 39 years 3,833
40 - 49 years 5,412
50 - 59 years. 6,411
60 - 69 years 6,370
70 - 79 years 4,219
80 - 89 years 1,544
90 - 99 years 181
> 99 years 9
age unknown 443
DISCUSSION.
The importance of databases composed of anatomic pathology records
(surgical pathology report databases and autopsy databases) has been
discussed previously [4,7]. Public access to an autopsy database
extends beyond the role of the individual autopsy in patient care,
to quality assurance, research, and disease surveillance [7]. In
addition to studies that might derive wholly from the Internet
Autopsy Database, additional studies might also be conducted that
compare data from a private database with data from the public
database. In other words, hypotheses derived from a single autopsy
or from a series of autopsies could be compared with data collected
from a large number of similar cases. Since the patient's age, sex,
and year of autopsy are provided with each facesheet, the results on
a large, potentially biased autopsy sample could be age-adjusted and
sex-adjusted by standard epidemiologic methods.
What would be involved in tracing an autopsy record to an individual patient?
An autopsy "spy" might know that an individual of a certain age was
autopsied in a specific institution on a certain date. The spy
wishes to acquire additional, confidential information from the
autopsy database. Names of patients and institutions are omitted
from the database, and there is no way of knowing whether any
particular institution contributes to the IAD. Even if a particular
institution were a known contributor to the IAD, there is no way of
knowing whether the institution contributes all its autopsies to the
IAD or only selects certain types of autopsies. The spy would have
to query the database on the three known patient identifiers: the
first digit of the U. S. postal zipcode or country code of the
institution, the age of the patient, and the year that the autopsy
was performed. Although this might reduce the possible matches to
a relatively small number, further inquiry would require the spy to
have specific pathologic information on the patient that could reduce
the size of the matching population. If the spy reached a point where
a reasonable guess might be made that an IAD record matches the patient,
the spy could never be certain of the match, because the database
contains no mechanism to confirm identity. In other words, a spy
who holds some confidential information of an individual's autopsy
record has a chance of acquiring additional autopsy-related confidential
information from the IAD, but the additional information obtained would
not be verifiable. The additional, unverifiable information would
consist only of a listing of SNOMED-compatible terms,
devoid of textual details.
One potential weakness in the confidentiality of the database lies
in the mechanism proposed to retrieve tissue from autopsies of
scientific interest. For instance, a researcher might wish to
embark on a molecular biologic study of tissue samples of a rare
neoplasm. Using the publicly available Internet Autopsy Database,
he notes that there are 22 autopsies in which this rare lesion was
found. Institutions maintain paraffinized tissue blocks of autopsy
material that may be suitable for molecular biology studies [8].
The researcher contacts the database administrator (email address
available at the IAD website), who forwards the researcher's message
to the contributing institution. The researcher might then contact
the institution and ask for the tissue blocks of interest, as well
as the autopsy report. An unscrupulous person might pose as a
researcher to obtain information under false pretenses. Under
current guidelines, inquiries to the IAD are all referred to the
database administrator, who then contacts the institution(s) that
contributed the autopsy facesheets of interest and gives them the
name and contact information pertaining to the researcher. The
institution then contacts the researcher at its own discretion.
Institutions that do not wish to pursue contact need not do so.
Institutions that contact the researcher must take any necessary
precautions to protect the confidentiality of their patients.
The IAD can be regarded as an experiment into a new era in which
patient data records are made available on the Internet. The
challenge in developing such databases is to protect patient
confidentiality, attract contributors to the database, and to
provide data of value to the public.
TABLE 3. METHODS FOR MAINTAINING PATIENT CONFIDENTIALITY
1. Encode autopsy/patient identiers by the contributing institution
and again by the IAD database administrator, so that each autopsy
appears with a doubly encoded identifer number that cannot be linked
to a patient by either the IAD database administrator, the contributing
institution, or by any user of the IAD.
2. Include autopsy data from a worldwide collection of institutions,
and omit the names of the contributing institutions.
3. Identify patient location only as the first digit of the postal zip code
(in the case of U.S. autopsies), or as the multiple-digit
international telephone exchange in the case of contributions
from foreign countries.
4. Use a large database (in excess of 40,000 cases).
5. Omit the exact dates of autopsy and ages of patient autopsied
(permitting only the age in years and year of autopsy).
6. Omit all free text, restricting pathologic findings to a listing
of SNOMED-compatible terms derived from the original autopsy facesheet.
REFERENCES.
1. Carter JR, Nash NP, Cechner RL, Platt RD.
Proposal for a national autopsy data bank.
A potential major contribution of pathologists to
the health care of the nation.
Am J Clin Pathol. 1981; 76 (Suppl):597-617.
2. Kircher T, Carter JR, Sinton E.
The national autopsy databank.
Pathologist. 1985;39:22-26.
3. Peery TM.
The autopsy data bank. A proposal for pathologists
to contribute to the health care of the nation.
Am J Clin Pathol. 1978; 69 (Suppl): 258-259.
4. Moore GW, Berman JJ, Hanzlick RL, Buchino JJ, Hutchins GM.
A prototype internet autopsy database: 1625 consecutive fetal
and neonatal autopsy facesheets spanning twenty years.
Arch Pathol Lab Med 120:782-785, 1996.
5. Court C.
GMC finds doctors not guilty in consent case.
British Medical Journal. 1995;311:1245-146.
6. Moore GW, Berman JJ.
Object-oriented English-to-SNOMED translator
using Transoft + Hyperpad.
Symposium on Computer Applications
in Medical Care, 15:973-975.
7. Berman JJ, Moore GW.
SNOMED-Encoded surgical pathology databases:
a tool for epidemiologic investigation.
Modern Pathol. 1996;9:944-950.
8. Kleiner DE, Emmert-Buck MR, Liotta LA.
Necropsy as a research method
in the age of molecular pathology.
Lancet 1995; 346:945-948.
Last updated: 9/15/2005, by G. William Moore, MD, PhD.