Appendix 6. DEMOGRAPHIC AND LINGUISTIC INVENTORY
OF THE JOHNS HOPKINS HOSPITAL
SURGICAL PATHOLOGY DATABASE.

G. William Moore, MD, PhD
Robert E. Miller, MD
http://www.netautopsy.org/vhpsapsx.htm


Send comments and correspondence to: George.Moore4@va.gov
See also: http://www.netautopsy.org/gwmcv.htm ............. http://www.netautopsy.org/protoiad.htm ............. http://www.netautopsy.org/snomedsp.htm ............. http://www.netautopsy.org/natlngpr.htm ............. http://www.netautopsy.org/apdmchap.htm

1. DISCLAIMER.



United States Government Work, uncopyrighted, public-domain. This document does not necessarily represent the views or policies of any United States Government agency. This document is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the document or the use or other dealings made with the document. Published in The Johns Hopkins Autopsy Resource.

2. TABLE OF CONTENTS.


1. Disclaimer.
2. Table of Contents.
3. Introduction.
4. Levels of Counting.
5. Patient Demographics.       Table 5.
6. Race/Ethnicity.       Table 6.
7. Annual Distribution of Cases.       Table 7.
8. Estimate of Future Specimens.
9. Distribution of Organ Systems.       Table 9.
10. Distribution of Queries.       Table 10.
11. Sensitivity/Specificity Analysis.
12. Raw Word Counts.
13. UMLS Part-of-Speech List.       Table 13.
14. History of Computational Linguistics.
15. Description of VHP Encoder.
16. Canonical Form.
17. Discovery Methods.
18. Reverse Backus-Naur Form Parsing.
19. Frequency Distribution of Barrier Words.       Table 19.
20. Frequency of Multiple-Word Terms.       Table 20.
21. Zipf Grammar.       Table 21a.       Table 21b.
22. Potential Pitfalls.
23. References.
24. Additional Suggested Readings.

3. INTRODUCTION.



On June 1, 2000, the demographic and linguistic content of the entire Johns Hopkins surgical pathology (JH-SP) database was inventoried according to methods of machine translation (MT) and quantitative natural language processing (QNLP) (Nagao, 1992; Moore and Berman, 2000; Manning and Schütze, 2000). We expect to perform similar evaluations of the Vanderbilt and Pittsburgh databases for the VHP-project, as part of the development and performance-evaluation of the free-text surgical pathology report parser. In addition, these methods will be applied to all free-text medical reports that will be made available to the project.

The JH-SP database spans sixteen years, from March, 1984, to the present, with patient identifiers, accession and release dates, a free-text brief clinical history (usually a short sentence), and the official free-text surgical pathology diagnosis included for all cases. As a software-requirement for electronically signing out a case, there must be an accession-date and a release-date for each case. Age-or-birthdate and sex are collected on nearly all cases (see below). The JH-SP database contained 159,071 patients with surgical pathology cases; 361,957 surgical pathology cases; and 694,443 surgical pathology specimens. Thus, the average patient in the JH-SP database had 2.3 = 361,957 / 159,071 cases apiece; and the average case in the JH-SP database contained 1.9 = 694,443 / 361,957 specimens.

4. LEVELS OF COUNTING.



It is important to note that there are at least five levels of counting in surgical pathology: patients, cases, specimens, concepts, and words. Counting at different levels is appropriate for different quantitative evaluations (Moore and Berman, 2000). This is sometimes confusing, and can result in underestimation or overestimation of pathology laboratory workloads. For example, some clinical pathology laboratories count by specimen-received; and some by how many tests are performed on that specimen (Taylor et al, 2000). This different counting can amount to an order-of-magnitude difference. Incorrect counting can also lead to false conclusions in research evaluations, since statistical tests of significance depend upon counting the number of occurrences of various events. For the discussion that follows, unless otherwise stated, all counts are by case. In some instances, it actually makes more sense to tabulate by another counter. However, for the present discussion, the confusion created by evaluating different counters would exceed the slight improvement in accuracy.

5. PATIENT DEMOGRAPHICS.



Age/sex demographics are complete for 99.3% of patients, as follows. There are 919 patients with missing age/birthdates, representing 0.6% = 919/159,071 of all age/birthdates in the JH-SP database. There are 142 patients, representing 0.09% = 142/159,071 of all patients, with a recorded age/birthdate but no recorded sex. There are 60.1% females; 39.2% males; and 0.7% = 1,061/159,071 of all patients with some age/sex demographic information missing.

For patients with a known age/birthdate, the breakdown by age-in-decades and sex for patients with surgical pathology cases is as follows. In this tabulation, the age given is the age at which the first case was received for that patient. In other tabulations, it makes more sense to tabulate a different age for each case, and either count by cases or else count by patients, where patients with multiple accessions are prorated as fractions. That is, in the latter instance, in a patient with four accessions, each age-at-biopsy is reckoned as one-fourth patient (Sawyer et al, 1996). The correct way to perform various counts and evaluations may be discussed in Consortium meetings.
	FEMALE	MALE	UNKNOWN	TOTAL	% PATIENTS
 0-9 years	3,763	5,906	13	9,682	6.1%
10-19 years	6,409	2,841	7	9,257	5.8%
20-29 years	17,318	3,341	16	20,675	13.0%
30-39 years	18,743	5,618	13	24,374	15.3%
40-49 years	15,149	7,405	26	22,580	14.2%
50-59 years	12,269	11,057	18	23,344	14.7%
60-69 years	10,873	14,501	26	25,400	16.0%
70-79 years	8,198	9,377	14	17,589	11.1%
80-89 years	2,650	2,239	9	4,898	3.1%
90-99 years	225	121	0	346	0.2%
100-109 years	5	2	0	7	0.0%
Unknown age			919	919	0.6%
Total	95,602	62,408	1,061	159,071	100.1%


6. RACE/ETHNICITY.



Race/ethnicity data are available on 77.3% = 122,946/159,071 patients with surgical pathology accessions, as follows:
 Asian or Pacific Islander             662          0.5%
 Black, NOS                         38,341         31.2%
 Hispanic, NOS                         340          0.3%
 Native American or Alaskan Native      88          0.1%
 White, NOS                         80,524         65.5%
 Other                               2,991          2.4%
 Total                             122,946        100.0%


7. ANNUAL DISTRIBUTION OF CASES.



The annual distribution of cases and specimens at The Johns Hopkins Hospital Department of Pathology is as follows:
 YEAR     CASES  SPECIMENS
 1984    14,942     22,112
 1985    18,969     29,846
 1986    19,046     31,835
 1987    19,051     33,133
 1988    19,705     35,437
 1989    20,253     37,166
 1990    21,052     39,274
 1991    21,645     40,766
 1992    22,003     43,660
 1993    21,006     43,180
 1994    21,351     43,967
 1995    22,139     44,974
 1996    23,174     47,576
 1997    26,502     54,824
 1998    27,528     57,197
 1999    29,596     61,531
 2000    13,995     27,965
Total   361,957    694,443


8. ESTIMATE OF FUTURE SPECIMENS.



In the last three, complete years of data from the JH-SP database (1997, 1998, 1999), there were 173,552 = 54,824 + 57,197 + 61,531 specimens obtained over a period of 36 months. If the VHP-SPIN project is active for the full five years, then upon the completion date (March 31, 2006), an additional 70 months of data will be included in the JH-SP database, corresponding to an estimated additional 337,462 = (173,552 x 70) /36 specimens, or an estimated total of 337,462 + 694,443 = 1,031,905 specimens at the end of the project period.

9. DISTRIBUTION OF ORGAN SYSTEMS.



Organ-systems represented among the 361,957 cases are shown below, using keywords that are associated with each organ-system. Since some accessions involve more than a single organ-system, the accessions add up to 106%.
     ORGAN SYSTEM        CASES  PERCENT
 Gastrointestinal      103,819    28.7%
  Lymphoreticular       54,597    15.1%
      Gynecologic       50,579    14.0%
             Bone       25,578     7.1%
           Breast       20,939     5.8%
     Dermatologic       20,747     5.7%
        Obstetric       19,167     5.3%
    Genitourinary       18,916     5.2%
            Blood       17,787     4.9%
           Marrow       16,576     4.6%
            Heart       14,490     4.0%
             Lung        8,015     2.2%
              CNS        6,320     1.7%
    Neuromuscular        4,789     1.3%
        Endocrine        3,288     0.9%


10. DISTRIBUTION OF QUERIES.



There is an existing query-system at JH-SP that allows staff members to obtain lists of cases, with different clinicopathologic diagnoses. There are 2,945 persons on staff at Johns Hopkins who are allowed to query the JH-SP system. At the time of inventory, the number of active queriers was 242. A snapshot of the query software taken on that date contained the last query from each of the 242 active queriers, as well as any saved queries. There were 4,288 queries, 2,307 distinct query-words, and 419 query-words requested at least twice. The most commonly requested word was "prostate" (153 requests), followed by "carcinoma" (130 requests), "adenocarcinoma" (117 requests), and "gleason" (114 requests). All other words were requested fewer than one hundred times. There were 1,888 words requested only once. A descending-order frequency distribution (Zipf Distribution) of the 26 words that were requested at least thirty times is shown as follows:
   RANK   FREQUENCY    WORD
      1         153    prostate
      2         130    carcinoma
      3         117    adenocarcinoma
      4         114    Gleason
      5          92    liver
      6          90    breast
      7          83    kidney
      8          71    margins
      9          68    cervix
     10          58    colon
     11          58    squamous
     12          51    lung
     13          47    thyroid
     14          46    esophagus
     15          42    bladder
     16          42    stomach
     17          41    ovary
     18          39    biopsy
     19          38    pancreas
     20          37    brain
     21          37    metastatic
     22          37    radical
     23          33    lymphoma
     24          31    Barrett
     25          30    lymph
     26          30    small


11. SENSITIVITY / SPECIFICITY ANALYSIS.



Frequent-query data will be used in the VHP-SPIN project as a tool to guarantee that there will be a high success rate in encoding the surgical pathology text-corpus, and in interpreting free-text queries to the system. In particular, for sensitivity/specificity analysis, there will be an independent measure of truth, or "gold standard,", that is obtained manually by a skilled coder. The skilled coder will use tools of natural language processing, including: exhaustive natural language searches by various synonyms, hypernyms, and hyponyms; and Key Word in Context (KWIC) indexes to find near-misses (Manning and Schütze, 2000).

Various 'training sets' will be specified by consortial agreement (e.g., at random, by age, by complexity of diagnosis, etc.). The Consortium will then specify a set of "binary classifiers," based upon scientific interest and quantitative natural language processing data. These binary classifiers might include such things as: prostatic adenocarcinoma (yes or no), infiltrating ductal carcinoma of breast with axillary node metastases (yes or no), Dukes' B colon adenocarcinoma with ten-year survival (yes or no), etc. Lexicon and grammar tables for the surgical pathology encoder and the query-interpreter will be rebuilt using only the data within each training set. Then their performance (false negatives, false positives) will be evaluated on complementary test sets that contain none of the cases in the respective training sets. The encoding or query software will be optimized while constrained to the specified training-set. Then the training-set-optimized system will be run on the "test set" (complementary set, containing no training cases), and its performance will be evaluated by counting false positives and false negatives against the gold standard.

12.RAW WORD COUNTS.



In the JH-SP, there are 9,004,337 words, 27,139 distinct words, 11,550 singly occurring words, and 15,589 multiply occurring words. The words ranged in frequency from 222,175 occurrences of the word 'and' to the 11,550 singly occurring words. The singly-occurring words are known in QNLP as: hapax legomena (Greek: read only once) (Manning and Schütze, 2000). As a first approximation, all singly occurring words are misspellings, and all multiply occurring words are correctly spelled (Gaynon and Wong, 1971; Joseph and Wong, 1979; Wong et al, 1980). This yields a misspelling rate of 0.1% = 11,550/9,004,337. Correctly spelled words and multiple-word terms (in QNLP: collocations) were assigned to standard codes, as described below. Frequent-word data will be used in the VHP-SPIN project as a tool to guarantee that there will be a high success rate for encoding of surgical pathology free-text reports in the system, by repeatedly targeting and correcting high-frequency errors. Large query logs will be investigated with similar methods.



13. UMLS PART-OF-SPEECH LIST.



In English, each word may be assigned to a "syntactic category," or part-of-speech (Manning and Schütze, 2000). The public-domain UMLS part-of-speech list is adapted from the UMLS Specialist Lexicon [U. S. National Library of Medicine, 1998], where we have assigned the following one-letter-codes:
A=Adjective;
B=adverB;
C=Conjunction (and, or, ...);
D=Determiner (the, this, ...);
H=Helpingverb;
I=Interrogative (who, which, why, how,..., including complementizers);
N=Noun;
P=Preposition (at, by, to, for, from,...);
R=pRonoun (he, she, it, we, they,...);
V=Verb.


In the UMLS Specialist Lexicon, each unambiguous part-of-speech has a decimal number and a binary number, each a unique power of 2. For example, the decimal number for adjective is 1 (2-to-the-power-0), and the binary number for adjective is 00000000001; the decimal number for adverb is 2 (2-to-the-power-1), and the binary number for adverb is 00000000010, .... In the case of words that point to ambiguous parts-of-speech, such as WELL=adjective-or-adverb, the single-letter-notation is A|B; the decimal notation is the decimal sum, namely, 1+2=3, and the binary notation is the binary sum, namely, 00000000001 + 00000000010 = 00000000011. Many of the words in JH-SP have been provisionally assigned to parts-of-speech, as shown in the following table:
 POS NAME	POS LTR	   POS DCML    NO. OCCURRENCES
 Adjective           A            1          2,187,808
 adverB              B            2            204,278
 Conjunction         C           16            262,589
 Determiner          D           32            164,435
 Helpingverb         H            4            174,048
 Interrogative       I            8             18,338
 Noun                N          128          4,458,102
 Preposition         P          256            709,617
 pRonoun             R          512             11,824
 mainVerb            V         1024             21,205


Adj or Adv A|B 3 4,586 Adj or Noun A|N 129 115,075 Adj, Noun, Vb A|N|V 1153 114,981 Adj or Verb A|V 1025 259,917 Dtrmr or Int D|I 40 10,263 Noun or Verb N|V 1152 275,683 Unassigned 11,588
TOTAL 9,004,337




14. HISTORY OF COMPUTATIONAL LINGUISTICS.



Computational Linguistics was invented as a modern intellectual discipline with the appearance of Prof. Noam Chomsky's thesis on the linguistics of Hebrew [Chomsky, 1949]. The central paradigm of computational linguistics is that all human languages may be represented as an interconnected collection of production rules [Bundy, 1997], in so-called Backus Naur Form (BNF), originally used by computer scientists to describe computer languages, such as ALGOL [Naur, 1960]. The development of this field over the past half-century has been dominated by Chomsky's intellectual leadership and academic trends at large universities with computational linguistics departments. The mainstream computational linguistics literature devotes an inordinate amount of attention to interpretation of peculiar natural-language sentences, that would appear strange even to a native speaker and incomprehensible to a non-native speaker (Newmeyer, 1996). By contrast, scant attention has been paid to the quantitative and statistical behavior of technically precise sentences, say, in surgical pathology free-text, which employ a restricted vocabulary, and are intended to be unambiguous in written form, even when read by an adversary (i.e., the proverbial plaintiff's pathologist in a malpractice action).

      The negative impact of these general trends in computational linguistics has been summarized by Nagao [1992], who has devoted his career to machine translation of narratives in electrical engineering from Japanese-to-English:
"Linguistic theories ... do not cover varieties of exceptional expressions which practical machine translation systems have to handle. A machine translation system, which is still imperfect and will never be completed, is exposed to very crude tests when the system construction reaches a certain stage. At that stage of development, the system is given a comparatively simple sentence for translation, with structures that can be analyzed by a grammar given to the system. After completion, people other than those who developed the system are asked to translate a variety of texts such as newspaper articles, science magazines, patent documents, contract documents, and commercial letters. Because the documents have not been adequately tested at the development stage, users are disappointed by the poor translation results produced by the system. Many of the failures of the system come from the fact that the dictionary and the grammar are not sufficient to accept such unexpected input sentences."
In our view, it would have been more fruitful for emerging technical fields such as pathology informatics if the computational linguists had devoted more attention to quantitative studies, and had attempted to delineate the quantitative behavior of domain-specific translations. It should not suffice to show that an unusual counterexample spoils a translation system; but rather that a frequent class of counterexamples spoils a translation system.

15. DESCRIPTION OF THE VHP-ENCODER.



The VHP-encoder is designed to embed standardized coding language, such as SNOMED or UMLS, into coherent free-text sections of surgical pathology reports extracted from production information systems. The extracted files will be presented to the encoder with XML tags in place denoting free text areas for processing. The encoder will add additional XML tags containing standardized identifiers for the concepts contained in the text. An XML-tagged file contains data in specific, tagged locations, so that the data can readily be recovered for indexing and for quantitative studies, including the sensitivity / specificity studies planned for the VHP-SPIN project. For example, an XML-tagged record containing "adenocarcinoma would tie the UMLS concept unique identifier for adenocarcinoma (C0001418) to the word "adenocarcinoma" in the record. Such XML-tagged records, when compared to a gold standard, can readily be used to calculate false negative/false positive rates required for sensitivity/specificity analysis. This approach also has the advantage of maintaining the integrity of the original pathology report text: all metadata added to the report to create the tagged record is internal to XML tags. The original report text can be regenerated by stripping the tags from the record.

The examples given here employ UMLS as the target language, but the principles should be the same for any coding language that is rich enough in pathology concepts and grammatical relationships to contain the essential ideas of surgical pathology. The XML tagging format shown above is compatible with multiple coding schemes because it allows specification of a coding scheme by name (scheme) and the value of the code for the appropriate concept (value). Although the VHP-encoder has been tested initially on the Johns Hopkins surgical pathology (JH-SP) free-text corpus, it is expected that the general principles will be applicable across institutions, to other medical free-text available to the VHP-project, and to query free-text as well. The query engine is then expected to match coded queries with coded surgical pathology diagnoses.

The VHP-encoder employs methods of machine translation (MT) and quantitative natural language processing (QNLP) (Nagao, 1992; Manning and Schütze, 2000, Moore and Berman, 2000). In QNLP, it is acceptable to mistranslate a few sentences, but not too many sentences. The VHP-encoder has detailed bookkeeping mechanisms, which are required by the underlying QNLP methods, and which will eventually assist in quantitative performance studies to be conducted by the VHP-project. First, the VHP-encoder attempts to obtain a grammatical parse for each sentence in the surgical pathology text-corpus, or in a given free-text query. If the sentence passes the grammatical parser, then the VHP-encoder matches each word or term in the sentence to its corresponding UMLS code.

      A preliminary list of words, multiple-word terms (so-called collocations), their corresponding parts-of-speech and UMLS codes has been prepared for the high-frequency words and collocations in JH-SP text-corpus. However, this list will be reviewed and enriched by participants from each institution, as part of the proposed VHP-project.

      In QNLP, a 'word' is operationally defined as a string of letters and numerals, bounded on either side by blanks or punctuation (Kucera and Francis, 1967). A list of words in descending order of word-frequency is called a "Zipf distribution." The most-frequent word has rank one; the second-most-frequent word has rank two, etc. According to Zipf's Law, rank, r, is inversely proportional to frequency, f (Zipf, 1949). Although Zipf's Law is inexact in detail (Mandelbrot, 1954; Moore et al, 1988), it encapsulates the general truth that a few hundred high-frequency words account for over half the word-occurrences in a large text-corpus. These high-frequency words are typically articles, conjunctions, complementizers, interrogatives, prepositions, pronouns, auxiliary verbs, etc., and serve as boundaries, or "barriers," on either side of medically-significant words, or "keywords" (Tersmette et al, 1988; Moore et al, 1989; Nelson et al, 1995; Wilbur, 2000). Punctuation marks are operationally defined as barrier words. For example, the collocation 'squamous cell carcinoma' is likely to occur in a free-text corpus, bounded on either side by barrier words. Thus, a list of barrier words can be used to discover nearly all the collocations in a surgical pathology or query-log free-text corpus.

      For the VHP-encoder, each word and each collocation has been assigned to a part-of-speech (possibly ambiguous), and to a UMLS code. In many instances, UMLS code-assignments have an exact character-for-character match in the official UMLS database (U. S. National Library of Medicine, 2000) or correspond to obvious synonyms. High-frequency terms are assigned manually by an experienced coder; mid-frequency and low-frequency terms are assigned either from previously published lists, or algorithmically. Each sentence-to-be-translated is pointed to its corresponding part-of-speech sequence. Then a series of formulas (production rules, expressed in so-called Backus-Naur-Form (BNF)) attempt to reduce stepwise the initial part-of-speech sequence down to a null-sentence. This is the same process by which a computer compiler is designed to interpret a syntactically correct computer program. (The essential difference between computer grammars and human grammars is that human grammars 'leak'.) Failure to reduce the part-of-speech sequence to a null-sentence suggests that the sentence is ungrammatical. Success supplies a formula that points the source sentence to positions for UMLS codes in an XML-tagged file.

      For example:
   [adenocarcinoma   of   colon   metastatic   to   lung]
   [       N                    P       N          A            P     N   ]
yields a sentence-pattern of [NPNAPN] , where A=adjective, N=noun, P=preposition. The VHP-encoder determines, either stepwise or in a single step, that [NPNAPN] is a grammatical sentence-pattern. For each grammatical sentence, the encoder assigns UMLS codes, for example:
  [ ADENOCARCINOMA     OF       COLON   METASTATIC     TO          LUNG   ]
  [    C0001418     C0332285  C0009368   C0027627    C0332286    C0024109 ]


      Finally, the VHP-encoder creates an XML tag sequence containing the codes:
  <code section scheme="UMLS">
    <c type="morph" value="C0001418>adenocarcinoma
      >c type="topo" value="C0009368">colon
        <c type="morph" value="C0027627">metastatic
          <c type="topo" value="C0024109">lung
          </c>
        </c>
      </c>
    </c>
  </code-section>
where <coding tag> tag denotes a section of coding data that is added by the encoder after each parsed text section of the document. The scheme attribute specifies the particular coding scheme to be used in the section, allowing the XML tagging format to be used with multiple coding schemes. A single document could also contain tags representing individual concepts; their type (morphology or topology, here) and value attributes represent the category of the code (or coding axis in SNOMED) and the code value. Concept tags also contain the text word or phrase from which the concept was derived and optionally contain other concept tags. The pattern of containment expresses hierarchical relationships. The coding sections do not alter the free text of the parsed sections and therefore the text and structure of the original document is always available if needed. The details of this final encoding step will be enriched and modified by the consortium members during Phase I of the project.



16. CANONICAL FORM.



In mathematics, a "canonical form" is a preferred notation that encapsulates all equivalent forms of the same concept. For example, the canonical form for one-half is 1/2, and there is an algorithm for reducing the infinity of equivalent expressions, or "aliases," namely 2/4, 3/6, 4/8, 5/10, ... , down to 1/2. Agreement upon a canonical form is one of the features of any mature intellectual discipline. For example, the importance of a canonical form became apparent to the English-language-dictionary writers of the eighteenth century, who realized that one couldn't write a dictionary without consistent orthography. The variable orthography, say, of fourteenth-century English poet, Geoffrey Chaucer, could not be supported by the need to have each individual English word appear and be defined at only one place in the dictionary. Investigators working with raw medical text have reached the same conclusion.

      Unfortunately, in anatomic pathology, even simple concepts, consisting entirely of correctly spelled words, have no canonical form. For example, even such a simple idea as "adenocarcinoma of colon metastatic to lung" has no consistent form of expression. The distinct medical terms, of course, all have a unique spelling and meaning; but the following expressions (and many others, easy to imagine) are all equivalent in a surgical pathology report:
 Colon adenocarcinoma metastatic to lung 
 Colonic adenocarcinoma metastatic to lung 
 Large bowel adenocarcinoma metastatic to lung  
 Large intestine adenocarcinoma metastatic to lung 
 Large intestinal adenocarcinoma metastatic to lung  
 Colon's adenocarcinoma metastatic to lung
 Adenocarcinoma of colon with metastasis to lung 
 Adenocarcinoma of colon with lung metastasis
 Adenocarcinoma of colon with pulmonary metastasis 
      Of course, some of these forms would likely never be encountered in either a surgical pathology report or a query (e.g., colon's adenocarcinoma metastatic to lung), but both the surgical pathology encoder and the query engine should nonetheless be able to cope with oddball sentences consisting of all correctly-spelled words, at least to the point of suggesting to the user what he/she is really asking for. If there is no canonical form for equivalent ideas in biomedicine, then how are indexes and statistical tables constructed, when these query methods depend upon equivalent concepts being assembled together? (Masarie et al, 1991).



17. DISCOVERY METHODS.



There are a number of practical methods that one may use to discover the quantitative properties and build dictionaries and grammars for long, free-text medical documents (text-corpora). The methods were not widely known in the computational linguistics community until a decade ago, because of the neglect among computational linguists for using quantitative methods and statistics [Chomsky, 1965; Chomsky, 1968; Manning and Schütze, 2000)

      'Zipf's Law' formalizes the observation that the high-frequency words in a large free-text document, or text-corpus, are extremely common (Zipf, 1949; Mandelbrot, 1954; Fedorowicz, 1982; Giere, 1981; Zhang, 1981; Moore et al, 1988; Moore et al, 1989). The frequency of words in a free-text document may be ranked, with rank=1 for the most frequent word, rank=2 for the second-most frequent word, etc. According to Zipf's Law, word-rank is inversely proportional to word-frequency. That is, for some constant, k, r = k/f, for r=rank and f=frequency. Thus in a large text-corpus (over a million words), approximately one hundred barrier words (ranks 1 through 100) account for over half the word-occurrences in the document.

      "Barrier words," or "stop words," are commonly-occurring, typically short (up to five letters) words in English, that have low information content in medicine. They include articles, conjunctions, interrogatives, complementizers, interrogatives, prepositions, pronouns, auxiliary verbs, etc. Punctuation marks are operationally defined as barrier words. Most high-frequency Zipf words are barrier words. Barrier words typically express grammatical relationships, as in "adenocarcinoma of..." or "metastastic to...."

      The "Barrier Word Method" exploits the fact that barrier words (low-information words in pathology) serve as separators, or "barriers," between words in a multiple-word medical term. Multiple-word terms, or "collocations," in pathology free-text (e.g., tubular adenoma, prostatic adenocarcinoma), are typically bounded on either side by barrier words. Therefore, a pathology collocation may be discovered at first encounter by obtaining all word-sequences bounded on either side by barrier words. In the following example, the barrier words are shown in lower case, and the pathology collocations are shown in upper case:
TERMINAL ILEUM , CECUM , APPENDIX and COLON ( RIGHT HEMICOLECTOMY ) ; MODERATELY DIFFERENTIATED COLONIC ADENOCARCINOMA , with extension through MUSCULARIS PROPRIA into PERICOLIC SOFT TISSUE , and with involvement of PERINEURAL SPACES . TUBULOVILLOUS ADENOMA and associated VASCULAR MALFORMATION in the TRANSVERSE COLON ; TUBULAR ADENOMA in the DESCENDING COLON . recent COLOSTOMY SITE with SUBMUCOSAL FIBROSIS and INFLAMED GRANULATION TISSUE in the SEROSA . multiple ADHESIONS and SEROSAL ABSCESSES with GRANULATION TISSUE , FOREIGN BODY GIANT CELLS , SCARRING , focal OSSIFICATION , and FAT NECROSIS . ISCHEMIC BOWEL DISEASE diffusely involving ILEAL MUCOSA , with focal TRANSMURAL NECROSIS and ACUTE INFLAMMATION .
      Salton pioneered various statistical methods for discovering collocations (Salton and McGill, 1983; Salton and Buckley, 1991). In such methods, the frequency of each multiple-word sequence is compared against the expected frequency of independently-occurring single words in the sequence. A multiple-word sequence that has substantially higher than expected frequency is harvested as a collocation. The principal deficiency with this method for discovering new collocations is that a given collocation must occur multiple times before it is discovered. The barrier word method, by contrast, discovers a new collocation at first encounter [Moore et al, 1989].

      "Collocative Filtering" discovers new collocations by selecting collocations with a desirable part-of-speech sequence, such as: AN (e.g., actinic keratosis); NN (e.g., ear lobe); NPN (e.g., adenocarcinoma of colon), etc. [Justeson and Katz, 1995]. The principal deficiency with this method for discovering new collocations is that it presupposes a prior list of parts-of-speech for each word in the analyzed text-corpus. Thus, collocations are not discovered until after the parts-of-speech are assigned, which is sometimes cumbersome. Again, the barrier word method is not subject to this limitation.

      "Zipf's Law for Grammar" is the assertion that the production rules (Backus Naur Form) necessary for parsing a long, free-text document satisfy similar quantitative relationships as those observed for words in classical Zipf's Law. In particular, there are a few parsing formulas that are highly repetitive (left-Zipf-grammar). If these left-Zipf parsing formulas are well-constructed, then the quantitative performance of the parser should be satisfactory. In the VHP-parser, one can track the usage of each BNF parsing formula.

      "Canonical form" is a preferred notation that encapsulates all equivalent forms of the same concept. For the VHP project, the canonical form will be an XML specification, and a standard translation from XML into generic English. A simplified illustration of the suggested canonical form is:
  <code-section>
    <c> ... <c> ... <c > ... </c></c></c>
  </code-section>
with arbitrary levels of recursion of XML tags, where <code-section> contains a span of free text, and <c> contains a concept word or phrase with a concept code. The coding scheme in each tag sequence is identified by an attribute in the <code-section> tag. The concept codes within <c> tags could represent a variety of different types of concepts, including quantifiers and relationships. These concept types are identified by a "type" attribute within the tag and the concept code value is specified by a second attribute. Hierarchical relationships between concepts are expressed in the containment pattern of the concept tags. The order and content of the tags will be developed further during Phase 1 of the project. The general pattern described above may be embedded into the XML-hierarchy suggested by Taylor et al (2000), for the design of an integrated clinical data warehouse, as follows:
  <patient>
    <case>
      <specimen>
        <report-section>
          <code-section>
            <c> ... <c> ... < <c> ... </c></c></c>
          </code-section>
        </report-section>
      </specimen>
    </case>
  </patient>


      Pathology reports extracted from production systems will be tagged as XML to the level of the report section tags. The VHP-encoder will add the <code-section>, <c> , and any other concept-related tags at the end of each of the parsed free-text sections.



18. REVERSE BACKUS NAUR FORM PARSING.



The VHP-encoder is based upon the principles of computer translation (also known as machine translation) [Nagao, 1992; Hutchins, 1986]. Computer translation may be regarded as a branch of artificial intelligence [Bundy, 1997]. The fundamental operation of artificial intelligence is the production rule, X ==> Y, read, "X produces Y." Artificial intelligence production rules are akin to if-then statements in symbolic logic ( X ==> Y, read, "if X then Y") [Lewis and Langford, 1959; Suppes, 1957; Tymoczko, 1998]; and to set-inclusion statements in mathematical set theory (Y c X, read, "X includes Y") [Suppes, 1972; Tymoczko, 1998]; BNF is a common notation in computational linguistics [Aitchison, 2000; Naur, 1960], initially developed for designing computer languages, such as ALGOL, but eventually employed for parsing well-controlled free-text English, and for translating this parsed text into structures with common data elements. For example:
  1. [] ==> [Nx]  
  2. [Nx] ==> [N]  
  3. [Nx] ==> [AN] 
  4. [Nx] ==> [NPN]  
where []=null-sentence, Nx=noun-phrase, N=noun, P=preposition, A=adjective, etc. In this simple BNF model, the null-sentence, [], points to a noun-phrase (Nx) (expression 1); and the noun-phrase, in turn, points to one of three choices: noun-only (N) (expression 2); adjective-noun (AN) (expression 3); or noun-preposition-noun (NPN) (expression 4). The term to the left of ==> is called the "argument" the term to the right of ==> is called the "value." This simplified BNF supports surgical pathology diagnoses such as "hemangioma" (=N, expression 2); "actinic keratosis" (=AN, expression 3); or "adenocarcinoma of colon" (=NPN, expression 4). Perhaps a majority of surgical pathology diagnoses, particularly in a pathology practice with predominantly biopsy specimens, can be parsed with such a simple BNF model. However, many surgical pathology reports, especially in academic institutions with large resection specimens, are much more complex than would be suggested by this simple model. The VHP-encoder Perl script supports a BNF grammar of arbitrary size and complexity.

The VHP-encoder employs "Reverse Backus-Naur-Form Parsing," where the parser begins with the more complex expression (the value, right of the arrow ==> ), and works backward to the simpler expression (the argument, left of the arrow ==> ), until it reaches the null-sentence. If the backward-parse fails to reach back to the null-sentence, then the sentence is presumed to be ungrammatical.

The VHP-encoder was initially written, based upon the examination of autopsy reports (Moore et al, 1988; Moore et al, 1989; Moore et al, 1996; Hutchins et al, 1999). The prototype design is publicly available, but individual lexicons and RBNF tables must be supplied by the user, as these tend to be highly domain-specific. The lexicon and UMLS maps for common words in JH-SP reports are shown below. Additional lexicons will be developed as part of the proposed project.

The basic design for the VHP-encoder is very simple, and has been implemented with a short Perl script. The null-sentence is denoted []. Each RBNF formula is represented as a sequence of upper-case and lower-case letters, corresponding to one-letter parts-of-speech, as shown in the part-of-speech table. The upper-case letter represents context, and the lower case letter represents reduction. By convention, the square-bracket sentence boundaries are regarded as upper-case. For example:
    adenocarcinoma of  colon metastatic to lung 
   [      N        P     N       A       P   N  ]
can be parsed with two RBNF steps, namely:
 [] ==> [NPN] 
 N] ==> NAPN] 
The BNF that reduces down the adjectival clause, namely, "metastatic to lung," is:
 N] ==> NAPN] 
That is, a N followed by ] may be expanded into NAPN] , by inserting the adjectival clause, APN , between N and ] . Conversely, the first RBNF production rule is denoted as "Napn]" in the Perl parsing table, to indicate that ...NAPN] may be reduced to ...N] . In the second reduction step, [NPN] is reduced to [], namely, the null-sentence, by the RBNF , [npn] . Thus, in two reduction steps, the initial sentence, [NPNAPN] is reduced to [] , the null-sentence, so that the initial sentence is deemed to be parsable. This method will be applied to all surgical pathology free-text reports, and likewise to all other free-text reports available to this project, such as clinic-visit notes and radiology reports.



19. FREQUENCY DISTRIBUTION OF BARRIER WORDS.



Barrier words that occur in more than 10,000 JH-SP accessions are listed as follows. In the JH-SP, 'and' appears in 222,175 cases; 'of' appears in 196,153 cases; 'with' appears in 189,799 cases; 'for' appears in 107,039 cases; 'the' appears in 104,067 cases, etc.
RANK	FREQUENCY   BARRIER WORD
   1      222,175   and
   2      196,153   of
   3      189,799   with
   4      107,039   for
   5      104,067   the
   6       82,104   note
   7       80,740   in
   8       78,549   right
   9       77,885   left
  10       70,923   is
  11       70,261   see
  12       67,917   are
  13       53,071   mild
  14       49,987   identified
  15       47,804   to
  16       41,467   consistent
  17       39,792   this
  18       30,352   present
  19       27,189   seen
  20       25,371   at
  21       25,097   there
  22       24,657   on
  23       24,284   or
  24       23,021   be
  25       21,243   associated
  26       19,515   was
  27       18,376   one
  28       16,122   but
  29       16,057   case
  30       16,057   from
  31       16,036   these
  32       15,672   show
  33       15,396   separate
  34       15,135   by
  35       13,776   as
  36       13,730   an
  37       13,542   has
  38       13,074   only
  39       12,615   shows
  40       11,735   portion
  41       11,487   involving
  42       10,803   two
  43       10,718   which
  44       10,448   features
  45       10,263   that
  46       10,119   low
  47       10,097   three




20. FREQUENCY DISTRIBUTION OF
MULTIPLE-WORD TERMS (COLLOCATIONS).



Multiple-word-terms (collocations) discovered by the barrier word method, are shown as follows in the JH-SP file:
RANK	FREQUENCY   COLLOCATION
   1       38,401   chronic inflammation
   2       20,328   lymph nodes
   3       18,428   diff quik
   4       16,104   soft tissue
   5       14,456   bone marrow
   6       13,104   non diagnostic
   7       13,021   diagnostic findings
   8       13,004   non diagnostic findings
   9       12,868   helicobacter pylori
  10       12,328   crypt distortion
  11       12,316   lymph node
  12       12,292   quik stain
  13       12,284   diff quik stain
  14       11,080   mild chronic
  15       10,229   epithelial changes
  16       10,004   fibroadipose tissue
  17        9,967   non specific
  18        9,052   left breast
  19        8,893   inflammatory disease
  20        8,741   gastroesophageal reflux
  21        8,234   gleason grade
  22        7,994   squamous metaplasia
  23        7,797   tubular adenoma
  24        7,237   reactive epithelial
  25        6,944   reactive epithelial changes
  26        6,793   active chronic
  27        6,714   granulation tissue
  28        6,634   seminal vesicles
  29        6,312   surgical margins
  30        6,199   lamina propria
  31        6,086   acute rejection
  32        6,038   fallopian tube
  33        6,019   cell metaplasia
  34        5,990   pelvic lymph
  35        5,932   chronic gastritis
  36        5,928   secretory endometrium
  37        5,790   hematopoietic elements
  38        5,648   bile reflux
  39        5,633   chronic cervicitis
  40        5,595   pelvic lymph nodes
  41        5,538   no helicobacter
  42        5,521   no helicobacter pylori
  43        5,422   anti inflammatory
  44        5,418   small bowel
  45        5,379   type indeterminate
  46        5,247   inflammatory drugs
  47        5,243   anti inflammatory drugs
  48        5,183   chronic inflammatory
  49        5,173   hernia sac
  50        5,003   antral mucosa




21. ZIPF GRAMMAR.



The 'Zipf grammar' paradigm is the idea that the sequence of parts-of-speech for a sentence, or "sentence-pattern," behaves like a word in the Zipf distribution. That is, the frequency of sentence-patterns may be ranked, and the high-frequency (low-rank) sentence-patterns are extremely frequent. The Perl script for calculating the Zipf Grammar Distribution works as follows:
Obtain the microscopic diagnosis for an individual case.
Drop the text within each sentence to all lower case.
Translate each of the following into blank-space (ASCII 32): - _ @ # $ * " < >
Translate each of the following to semicolon (ASCII 59): . : ! ?
Translate each of the following to left-square bracket (ASCII 91): { (
Translate each of the following to right-square bracket (ASCII 93): } )
Translate all numerals into 0. Translate ;0 into 00.
Translate 's into blank-space (ASCII 32).
Translate ` ' into blank-spaces (ASCII 32).
Translate ; into ][ .
Intercalate a blank-space on either side of each punctuation mark.
Collapse all redundant consecutive blanks into a single blank.
Point each word or collocation to the corresponding parts-of-speech. A longer collocation always supersedes a shorter collocation.
Arbitrarily balance the left and right parentheses.
Break up the sequence at each ][ .


The Zipf grammar for JH-SP is shown in the following table. There are 161 sentence-patterns that occur in at least 1,000 accessions of JH-SP. This table shows that the most-frequent sentence-pattern occurring in JH-SP is the sentence consisting of a single noun ( [N] , 423,177 occurrences ); the second-most-frequent sentence-pattern is the sentence consisting of noun[noun] ( [N[N]] , 106,034 occurrences ) the third-most-frequent sentence-pattern is the sentence consisting of adjective-noun ( [AN] , 98,958 occurrences ), etc. In the inventory database, there were 398,378 total distinct sentence-patterns; 56,664 distinct sentence-patterns occurring more-than-once; and 341,714 distinct sentence-patterns occurring exactly-once. If one regards each singly-appearing sentence-pattern as the grammatical equivalent of a misspelling (so-called hapax legomena), then there are 2,302,366 grammatical sentences in the JH-SP database (Gaynon and Wong, 1971; Joseph and Wong, 1979; Wong et al, 1980; Manning and Schuetze, 2000) In principle, every grammatical sentence-pattern should reduce to [], by repeated application of the VHP-encoder. A sentence that does not reduce to [] either lacks an appropriate RBNF formula available in the parsing table, has incomplete word-entries in the lexicon, or is ungrammatical. The commonly-occurring long sentences are particularly interesting. For example, there are 12,596 occurrences of the sentence-pattern [ANPAAN], for example, "fibrous plaque from left carotid artery." In another long-sentence example, there are 12,136 occurrences of the sentence-pattern [BAANHA|VPAAN], for example, "no helicobacter pylori organisms are identified on diff quik stain." In another long-sentence example, there are 5,422 occurrences of the sentence-pattern, [DNHA|VPDAAN], for example: " this case was shown at the quality assurance conference."

The sentence-patterns occurring at least 5,000 times in the JH-SP database are as follows:
  RANK  FREQUENCY       SENTENCE-PATTERN   EXAMPLE 
     1    423,177                    [N]   hemangioma
     2    106,034                 [N[N]]   liver [needle]
     3     98,958                   [AN]   left foot
     4     85,908                  [N|V]   scar
     5     79,741                 [NN|V]   skin scar
     6     62,042                  [AAN]   epidermal inclusion cyst
     7     50,461                [AN[N]]   laryngeal mass [biopsy]
     8     41,958                  [NCN]   decidua and villi
     9     38,689                [A|NPN]   negative for actinomyces
    10     26,745               [N[NPN]]   cervix [biopsy at 9:00] 
    11     22,097                [N[NN]]   cervix [biopsy 9:00]
    12     21,704                 [NPAN]   skin of left ear
    13     21,102                   [NN]   ear lobe
    14     20,638                  [BAN]   non diagnostic findings
    15     16,864               [AAN[N]]   left chest wall [biopsy]
    16     13,674                 [AAAN]   left axillary soft tissue
    17     12,798              [NCAN[N]]   skin , left flank [biopsy]
    18     12,692                [ANCAN]   soft tissue , inguinal region
    19     12,596               [ANPAAN]   fibrous plaque from left carotid artery
    20     12,507   [N[N]ANCA|VANCA|NPN]   leg [ bka ] old thrombus and calcified atherosclerotic plaque , negative for osteomyelitis 
    21     12,136         [BAANHA|VPAAN]   no helicobacter pylori organisms are identified on diff quik stain
    22     12,097              [AAAN[N]]   left true vocal cord [biopsy]
    23     10,555                [ANPAN]   soft tissue of right wrist 
    24     10,257              [A|NPNCN]   negative for fungi or afb
    25      9,952                  [ANN]   left ear lobe
    26      9,650                [N[AN]]   colon [biopsy left]
    27      9,533               [N[NCN]]   cervix [biopsy , 9:00]
    28      8,732               [ANN[N]]   left ear lobe [biopsy]
    29      8,239                  [NAN]   skin right ear
    30      7,937                    [A]   void
    31      7,550                  [NPN]   biopsy at 9:00
    32      6,862               [NCN[N]]   skin, face [biopsy]
    33      6,675                 [ANPN]   left head of femur
    34      6,121                [NCNCN]   placenta, membranes and cord 
    35      6,111               [AN[NN]]   left breast [core biopsy]
    36      6,096               [N[NPA]]   colon [biopsy of left]
    37      5,728                [N[NA]]   colon [biopsy right] 
    38      5,685              [N[NPAN]]   colon [biopsy of right colon] 
    39      5,650             [ANCAN[N]]   soft tissue , left chest [excision]
    40      5,422          [DNHA|VPDAAN]   this case was shown at the quality assurance conference
    41      5,161                [NN[N]]   ear lobe [biopsy]
    42      5,072         [N[N]ANCA|NPN]


The objective of BNF processing of a Zipf grammar is to transform low-frequency sentence-patterns into high-frequency sentence patterns. For example, the most-frequent sentence-pattern in JH-SP is [N] (a single noun), with 423,177 occurrences (example: "hemangioma"). The second-most-frequent sentence-pattern in JH-SP is [N[N]] (namely, noun[noun]]), with 106,034 occurrences (example: "liver [ needle ]"). The second sentence-pattern is converted into the first sentence-pattern by the BNF:
N] ==> N[N]] ;
In RBNF notation, this formula is: N{n}] . The effect of this RBNF formula is convert 106,034 occurrences of [N[N]] into the more-frequent sentence-pattern, [N], for a new total of: 423,177 + 106,034 = 529,211 occurrences of [N]. Then this BNF formula:
[] ==> [N];
[n] in RBNF notation, transforms the single-noun-sentence, [N], into the null-sentence, [], which is the stopping-point. Ultimately, the objective of the RBNF parser is to convert all sentences into into the null-sentence.

Application of a preliminary quantitative BNF table to the 2,302,366 presumed-parsable sentences yielded 1,907,372 complete parses down to the null-sentence (i.e., 82.8% = 1,907,372 / 2,302,366 of sentences are parsable from a simple BNF table). It is expected that this simple BNF table will be greatly enriched during the project period. The most commonly used formula, called 689,478 times by the parser, was [N] ==> [] , which reduces a noun-only sentence to a null-sentence. For example, the sentence: [liver] . The second-most-commonly used formula, called 313,234 times by the parser, was [AN] ==> [] , which reduces an adjective-noun sentence to a null-sentence. For example, the sentence: [actinic keratosis] . The third-most-commonly used formula, called 117,039 times by the parser, was [AAN] ==> [] , which reduces an adjective-adjective-noun sentence to a null-sentence. For example, the sentence: [hypertrophic actinic keratosis] . In training/test set exercises in Phase 3 of VHP-SPIN, certain parsing formulas will be withheld from the test set, in case they are never called by the training set.

The table of BNFs called at least 10,000 times by the parser is as follows:
  RANK   FREQUENCY      BNF FORMULA   EXAMPLE
     1     689,478       [N] ==> []   [prostate]
     2     313,234      [AN] ==> []   [actinic keratosis]
     3     117,039     [AAN] ==> []   [hypertrophic actinic keratosis]
     4      86,762     [N|V] ==> []   [scar]
     5      80,127    [NN|V] ==> []   [skin scar]
     6      66,816     [NAN] ==> []   [skin soft tissue]
     7      60,129     [NCN] ==> []   [decidua and villi]
     8      55,728       [AN ==> [N   [actinic KERATOSIS
     9      52,777     [A|N] ==> []   [negative]
    10      47,375      [NN] ==> []   [granulation tissue]
    11      47,139       [A] ==> []   [void]
    12      42,661     [NPN] ==> []   [adenocarcinoma of colon]
    13      36,076    [AAAN] ==> []   [focal bowenoid actinic keratosis]
    14      31,946    [NPAN] ==> []   [skin with actinic keratosis]
    15      25,168     [BAN] ==> []   [focally invasive tumor]
    16      22,761    [NCAN] ==> []   [ulcer and acute inflammation]
    17      22,276     [ANN] ==> []   [exuberant granulation tissue]
    18      16,791       [NN ==> [N   [lung CARCINOMA
    19      15,577    [NAPN] ==> []   [carcinoma metastatic to lung]
    20      13,764     [NNN] ==> []   [liver gallbladder pancreas]
    21      13,212      [AAN ==> [N   [hypertrophic actinic KERATOSIS
    22      12,417     [BAN ==> [BN   [FOCALLY active GASTRITIS
    23      12,053      [NCN ==> [N   [decidua and VILLI




22. POTENTIAL PITFALLS.



There are well-established pitfalls of quantitative natural language processing, described in the literature ( Hutchins, 1986; Manning and Schuetze, 2000) These potential pitfalls tend to be counterbalanced by focused efforts on the part of pathologists to prepare clear reports that could not be misinterpreted by their intended audience, namely, other physicians caring for the patient, as well as by potentially adversarial readers, such as malpractice attorneys.

      The most significant problem mentioned in the quantitative natural language processing literature is the difficulty in obtaining a reasonably error-free source-text, or corpus. As shown by the initial linguistic inventory, this difficulty is not expected to be a major one for this project. That is, the frequency of misspellings is estimated to be less than 1%, and in an attempt to encode the entire database with a small, preliminary syntactic model, over 80% of all likely-grammatical sentences had a successful parse. It is expected that this parsing performance can be greatly improved over the duration of the project.

      Another significant problem encountered in quantitative parsing is that, although over half the words in a large text-corpus are covered by defining a few hundred distinct words and collocations (the high-frequency end of the Zipf distribution); conversely the list of rare words is seemingly endless [Nagao, 1992]. As long as the project is active, we can probably keep up with adding new words and collocations, but the consortium must eventually develop a mechanism for adding new words and terms. Since the UMLS is updated annually by the U. S. National Library of Medicine, additional synonyms could possibly be forwarded to this agency. See DISCOVERY METHODS, above.

Yet another problem described in the QNLP literature is conjunctions that connect components of a sentence with unclear boundaries, as for example, run-on sentences connected by 'and'. Conjunction-connectors could possibly represent a significant problem in the JH-SP corpus, because conjunctions are quite common (222,175 occurrences of 'and' in 361,957 cases, 61.4%). However, it is our perception that these conjunctions are largely written in an orderly manner, since over 80% of sentences are parsable. High-frequency collocations involving potential conjunction-error, including terms such as 'acute and chronic inflammation' or 'infilitrating and intraductal carcinoma' can be discovered and managed with suitable entries in the encoding lexicon

A theoretical problem that has received considerable attention in the machine translation literature is unclear pronoun-references [Hutchins, 1986]. This is particularly burdensome for managing languages such as French, where pronouns must have a gender that matches the reference noun. We expect pronoun-reference problems to be relatively minor, since there are less than 1% pronouns in the JH-SP text-corpus. We attribute this fact to careful pathologists, who tend to avoid pronouns in order to limit the possible ambiguity in their reports.

Another theoretical problem that has received attention in the machine translation literature is the resolution of ambiguities. There are numerous ambiguous single-words in a surgical pathology report, but often these ambiguities may be resolved by the immediately surrounding words, as for example, "uterine adnexa," "gastric fundus," "cervical vertebra." Thus, many ambiguities can be resolved with robust collocation-discovery-algorithms, as discussed in the DISCOVERY METHODS section, above. In other contexts, ambiguous pathology words, such as "adnexa" (skin, uterus, eye), "fundus" (stomach, uterus, eye), "cervical" (uterus, spine, neck), etc., may not be resolvable from the immediately surrounding words, but may be resolvable by other words in the same specimen, such as a skin-word or an eye-word in the same specimen as the word "adnexa." This approach is called thesaurus-based disambiguation (Manning and Schuetze, 2000) The Consortium will select a group of potentially ambiguous words that might confuse the surgical pathology encoder, and appropriate sensitivity/specificity tests will be conducted.

In summary, the VHP-encoder translates coherent surgical-pathology free-text into a standardized coding language, formatted as XML. The VHP-encoder has been tested initially on the Johns Hopkins surgical pathology free-text corpus. It is expected that the general principles will be applicable across institutions, to other medical free-text available to the VHP-project, and to query free-text.



23. REFERENCES.



1. Aitchison J.
Teach Yourself Linguistics. Fifth Edition.
Chicago: NTC/Contemporary Publishing Co. 2000.
ISBN: 0844226688.

2. Bundy A, ed.
Artificial Intelligence Techniques: A Comprehensive Catalogue. Fourth, Revised Edition.
Heidelberg: Springer Verlag. 1997;:.
ISBN: 3540593233.

3. Chomsky N.
Morphophonemics of Modern Hebrew.
Undergraduate Honors Essay. University of Pennsylvania. 1949;:. Cited in: Newmeyer FJ. Generative Linguistics. A historical Perspective. London: Routledge. 1996;:.

4. Chomsky N.
Aspects of the Theory of Syntax.
Cambridge, MA: The MIT Press. 1965.

5. Chomsky N.
Language and Mind.
San Diego: Harcourt Brace Jovanovich. 1968.

6. Fedorowicz J.
A Zipfian model of an automatic bibliographic system: An application to MEDLINE.
J Am Soc Info Sci 1982;33:223-232.

7. Giere W.
Foundations of clinical data automation in cooperative programs.
Proc 5th Ann Symp Comp Applic Med Care. 1981;5:1142-1148.

8. Hutchins WJ.
Machine Translation : Past, Present, Future .
Ellis Horwood/Wiley, Chichester/ New York. 1986. Ellis Horwood Series in Computers and Their Applications. ASIN: 0135435218 .

9. Hutchins GM, Berman JJ, Moore GW, Hanzlick R, the Autopsy Committee of the College of American Pathologists.
Practice Guidelines for Autopsy Pathology.
Arch Pathol Lab Med. 1999; 123:1085-1092.

10. Joseph DM, Wong RL.
Correction of misspellings and typographical errors in a free-text medical English information storage and retrieval system.
Methods Inf Med. 1979 Oct;18(4):228-234.

11. Justeson JS, Katz SM.
Technical terminology: some linguistic properties and an algorithm for identification in text.
Natural Language Engineering. 1995;1:9-27.

12. Kucera H, Francis WN.
Computational Analysis of Present-Day American English.
Providence, RI: Brown University Press. 1967;:.

13. Lewis CI, Langford CH.
Symbolic Logic. Second Edition.
New York: Dover Publications, Inc. 1932.

14. Mandelbrot B.
Structure formelle des textes et communication.
Word 1954;10:1-27.

15. Manning CD, Schütze H.
Foundations of Statistical Natural Language Processing.
Cambridge, MA: The MIT Press. 2000;:.
ISBN: 0262133601, 680 pages.
http://www-nlp.stanford.edu/fsnlp/intro/

16. Masarie FE jr, Miller RA, Bouhaddou O, Guise NB, Warner HR.
An Interlingua for Electronic Interchange of Medical Information: Using Frames to Map Between Clinical Vocabularies.
Comp Biomed Res 1991; 24(4):379-400.

17. Moore GW, Boitnott JK, Miller RE, Eggleston JC, Hutchins GM.
Integrated anatomic pathology reporting system using natural language diagnoses.
Modern Pathol 1988;1:44-50.

18. Moore GW, Miller RE, Hutchins GM.
Indexing by MeSH titles of natural language pathology phrases identified on first encounter using the Barrier Word Method.
In: Scherrer JR, Cote RA, Mandil SH, eds. Computerized Natural Medical Language Processing for Knowledge Representation. North-Holland. 1989;:29-39.

19. Moore GW, Berman JJ, Hanzlick RL, Buchino JJ, Hutchins GM.
A prototype internet autopsy database: 1625 consecutive fetal and neonatal autopsy facesheets spanning twenty years.
Arch Pathol Lab Med. 1996;120:782-785.
http://www.netautopsy.org/protoiad.htm

20. Moore GW, Berman JJ.
Anatomic Pathology Data Mining.
In: Cios KJ, ed. Medical Data Mining and Knowledge Discovery. Heidelberg: Springer Verlag. 2000
http://www.netautopsy.org/apdmchap.htm

21. Nagao M.
Machine Translation.
In: Shapiro SC, ed. Encyclopedia of Artificial Intelligence. Volume 2. M-Z. New York: Wiley-Interscience. 1992;2:898-902.

22. Naur P.
Revised Report on the Algorithmic Language ALGOL 60.
Comm ACM, 1960 May; 3(5):299-314.

23. Nelson SJ, Olson NE, Fuller L, Tuttle MS, Cole WG, Sherertz DD.
Identifying concepts in medical knowledge.
Medinfo. 1995;8:33-36.

24. Newmeyer FJ.
Generative Linguistics. A historical Perspective.
London: Routledge. 1996;:.

25. Salton G, McGill MJ.
Introduction to modern information retrieval.
New York: McGraw-Hill. 1983;:.

26. Salton G, Buckley C.
Global text matching for information retrieval.
Science. 1991;253:1012-1015.

27. Sawyer R, Berman JJ, Borkowski A, Moore GW.
Elevated prostate-specific antigen levels in black men and white men.
Mod Pathol. 1996 Nov;9(11):1029-1032.
http://www.netautopsy.org/elevpsal.htm

28. Suppes P.
Introduction to Logic.
New York: Van Nostrand. 1957;:.

29. Suppes P.
Axiomatic Set Theory.
New York: Dover Publications. 1972;:.
ISBN: 0486616304.

30. Taylor M, Saltz J, Nichols JH.
Design of an Integrated Clinical Data Warehouse.
J Assn Lab Automation. 2000. in press.

31. Tersmette KWF, Scott AF, Moore GW, Matheson NW, Miller RE.
Barrier word method for detecting molecular biology multiple word terms.
Proc 12th Annu Symp Comput Appl Med Care. 1988;12:.

32. Tymoczko T, ed.
New Directions in the Philosophy of Mathematics.
Princeton, NJ: Princeton University Press. 1998;:.

33. U.S. National Library of Medicine.
Unified Medical Language System.
http://www.nlm.nih.gov/research/umls/

34. U. S. National Library of Medicine.
UMLS Knowledge Sources. Eleventh Edition. Unified Medical Language System.
U. S. Department of Health and Human Services. National Institutes of Health. National Library of Medicine. 2000;:.

35. U. S. National Library of Medicine.
UMLS Knowledge Sources. Tenth Edition. Unified Medical Language System.
U. S. Department of Health and Human Services. National Institutes of Health. National Library of Medicine. 1999.

36. U. S. National Library of Medicine.
UMLS Knowledge Sources. Ninth Edition. Unified Medical Language System.
U. S. Department of Health and Human Services. National Institutes of Health. National Library of Medicine. 1998;:.

37. Wilbur WJ.
Overview of Books at NCBI.
http://www.ncbi.nlm.nih.gov:80/books/mboc/bookshelp/bookover.html#link

38. Wong RL, Gaynon P.
An automated parsing routine for diagnostic statements of surgical pathology reports.
Methods Inf Med. 1971 Jul;10(3):168-175.

39. Wong RL, Reno JD, Hain TC, Platt RC, Gaynon PS, Joseph DM.
Profile of a dictionary compiled from scanning over one million words of surgical pathology narrative text.
Comput Biomed Res. 1980 Aug;13(4):382-398.

40. Zhang Q.
Easy entry of Chinese character set symbols.
Proc 5th Ann Symp Comp Appl Med 1981;5:143-149.

41. Zipf GK.
Human Behavior and The Principle of Least Effort. An Introduction to Human Ecology.
Reading, MA: Addison-Wesley Press. 1949;:19-55.



24. ADDITIONAL SUGGESTED READINGS.



Li W.
Zipf's Law Bibliography.
http://www.nslij-genetics.org/wli/zipf/

Moore GW, Miller RE, Hutchins GM, Riede UN, Polacsek RA.
Multilingual translation techniques in the analysis of narrative medical text.
Proc Annu Symp Comput Appl Med Care. 1985;9:. November 10-13, 1985, Baltimore, MD.

Moore GW, Miller RE, Hutchins GM.
Microcomputer translator for medical text: Theorem verification for Chapter Two of Zeman's Modal Logic.
Adv Math Comput Med. 7:1621-1633, 1986.

Moore GW, Riede UN, Polacsek RA, Miller RE, Hutchins GM.
Automated translation of German to English medical text.
Am J Med. 1986 Jul;81(1):103-111.
PMID: 3755289.
PubMed Entry

Moore GW, Riede UN, Polacsek RA, Miller RE, Hutchins GM.
Group theory approach to computer translation of medical German.
Methods Inf Med. 1986 Jul;25(3):176-182.
PMID: 3755498.
PubMed Entry

Moore GW, Polacsek RA, Erozan YS, de la Monte SM, Miller RE, Hutchins GM, Riede UN.
Multilingual translation techniques in the analysis of narrative medical text.
Comput Methods Programs Biomed. 1986 Mar;22(1):35-42.
PMID: 3634670.
PubMed Entry

Yu CC-Y, Moore GW, Unschuld PU.
Romanized Chinese respelling rules for an English medical word list.
Proc Annu Symp Comput Appl Med Care. 1987;11:. Washington DC, November 1-4, 1987.

Moore GW, Hutchins GM, Boitnott JK, Miller RE, Polacsek RA.
Word root translation of 45,564 autopsy reports into MeSH titles.
Proc Annu Symp Comput Appl Med Care. 1987;11:. Washington DC, November 1-4, 1987.

Tersmette KWF, Scott AF, Moore GW, Matheson NW, Miller RE.
Barrier word method for detecting molecular biology multiple word terms.
Proc Annu Symp Comput Appl Med Care. 1988;12:207-211. Washington DC, November 6-9, 1988.

Moore GW, Miller RE, Hutchins GM.
Indexing by MeSH titles of natural language pathology phrases identified on first encounter using the barrier word method.
In: Scherrer JR, Côté RA, and Mandil SH, eds., Computerized Natural Medical Language Processing for Knowledge Representation. North-Holland, Amsterdam, 1989.

Moore GW, Wakai I, Satomura Y, Giere W.
TRANSOFT: Medical translation expert system.
Artif Intell Med 1:149-157, 1989.

Moore GW.
TRANSOFT: Public-domain English-to-SNOMED computer translation shell, using the DVA File Manager. Abstract.
Mod Pathol. 4:123A, 1991.

Moore GW.
Medical Expert System User Interface. Editorial.
Artif Intell Med. 1991:15;.

Sorace JM, Berman JJ, Carnahan GE, Moore GW.
PRELOG: precedence logic inference software for blood donor deferral.
Proc Annu Symp Comput Appl Med Care. 1991;:976-977.
PMID: 1807774.
PubMed Entry

Berman JJ, Moore GW.
Object-oriented controlled-vocabulary translator using TRANSOFT + HyperPAD.
Proc Annu Symp Comput Appl Med Care. 1991;15:973-975.
PMID: 1807773.
PubMed Entry

Moore GW, Berman JJ.
Anatomic Pathology Data Mining.
Chapter 4. In: Cios KJ. Medical Data Mining and Knowledge Discovery. Berlin: Springer Verlag. 2000;4:61-107.
ISBN: 3-7908-1340-0, 502 pages.
Published within the series: "Studies in Fuzziness and Soft Computing", Physica-Verlag Heidelberg, a Springer-Verlag Company.

Cios KJ, Moore GW.
Medical Data Mining and Knowledge Discovery: Overview.
Chapter 1. In: Cios KJ. Medical Data Mining and Knowledge Discovery. Berlin: Springer Verlag. 2000;1:1-16.
ISBN: 3-7908-1340-0, 502 pages.
Published within the series: "Studies in Fuzziness and Soft Computing", Physica-Verlag Heidelberg, a Springer-Verlag Company.

Berman JJ.
Tumor classification: molecular analysis meets Aristotle.
BMC Cancer. 2004 Mar 17;4:10.
PMID: 15113444.
PubMed Entry

Pacak MG, Pratt AW.
Identification and transformation of terminal morphemes in medical English part II.
Methods Inf Med. 1978 Apr;17(2):95-100.
PMID: 661609.
PubMed Entry

Dunham GS, Pacak MG, Pratt AW.
Automatic indexing of pathology data.
J Am Soc Inf Sci. 1978 Mar;29(2):81-90.
PMID: 10318395.
PubMed Entry

Pratt AW.
Interactive data processing in the medical research institution.
Methods Inf Med Suppl. 1976;10:65-76.
PMID: 1078477.
PubMed Entry

Bengtsson S, Schneider W, Spencer WA, Pratt AW, Kastner VV, Reichertz P, Lamson BG, Anderson J.
The application of computer techniques in health care.
World Hosp. 1976;12(1):47-51.
PMID: 1024332.
PubMed Entry

Graepel PH, Henson DE, Pratt AW.
Comments on the use of the Systematized Nomenclature of Pathology.
Methods Inf Med. 1975 Apr;14(2):72-75.
PMID: 1207468.
PubMed Entry

Pratt AW, Pacak M.
Identification and transformation of terminal morphemes in medical English.
Methods Inf Med. 1969 Apr;8(2):84-90.
PMID: 5819388.
PubMed Entry

Sager N, Lyman M, Nhan NT, Tick LJ.
Medical language processing: applications to patient data representation and automatic encoding.
Methods Inf Med. 1995 Mar;34(1-2):140-146.
PMID: 9082123.
PubMed Entry

Sager N, Lyman M, Bucknall C, Nhan N, Tick LJ.
Natural language processing and the representation of clinical data.
J Am Med Inform Assoc. 1994 Mar-Apr;1(2):142-160. Review.
PMID: 7719796.
PubMed Entry

Sager N, Lyman M, Nhan NT, Tick LJ.
Automatic encoding into SNOMED III: a preliminary investigation.
Proc Annu Symp Comput Appl Med Care. 1994;:230-234.
PMID: 7949925.
PubMed Entry

Sager N, Lyman M, Tick LJ, Nhan NT, Bucknall CE.
Natural language processing of asthma discharge summaries for the monitoring of patient care.
Proc Annu Symp Comput Appl Med Care. 1993;:265-268.
PMID: 8130474.
PubMed Entry

Lyman M, Sager N, Tick L, Nhan N, Borst F, Scherrer JR.
The application of natural-language processing to healthcare quality assessment.
Med Decis Making. 1991 Oct-Dec;11(4 Suppl):S65-S68.
PMID: 1770852.
PubMed Entry

Borst F, Lyman M, Nhan NT, Tick LJ, Sager N, Scherrer JR.
TEXTINFO: a tool for automatic determination of patient clinical profiles using text analysis.
Proc Annu Symp Comput Appl Med Care. 1991;:63-67.
PMID: 1807679.
PubMed Entry

Chi EC, Sager N, Tick LJ, Lyman MS.
Relational data base modelling of free-text medical narrative.
Med Inform (Lond). 1983 Jul-Sep;8(3):209-223.
PMID: 6600043.
PubMed Entry

Sager N, Wong R.
Developing a database from free-text clinical data.
J Clin Comput. 1983;11(5-6):184-194.
PMID: 10278191.
PubMed Entry

Sager N, Bross ID, Story G, Bastedo P, Marsh E, Shedd D.
Automatic encoding of clinical narrative.
Comput Biol Med. 1982;12(1):43-56.
PMID: 7075165.
PubMed Entry

Hirschman L, Story G, Marsh E, Lyman M, Sager N.
An experiment in automated health care evaluation from narrative medical records.
Comput Biomed Res. 1981 Oct;14(5):447-463.
PMID: 7273723.
PubMed Entry

Wingert F.
Medical linguistics: automated indexing into SNOMED.
Crit Rev Med Inform. 1988;1(4):333-403.
PMID: 3288353.
PubMed Entry

Wingert F.
Automated indexing of SNOMED statements into ICD.
Methods Inf Med. 1987 Jul;26(3):93-98.
PMID: 3670105.
PubMed Entry

Wingert F.
An indexing system for SNOMED.
Methods Inf Med. 1986 Jan;25(1):22-30.
PMID: 3753739.
PubMed Entry

Wingert F.
Morphologic analysis of compound words.
Methods Inf Med. 1985 Jul;24(3):155-162.
PMID: 4033445.
PubMed Entry

Wingert F.
Automated indexing based on SNOMED.
Methods Inf Med. 1985 Jan;24(1):27-34.
PMID: 3982279.
PubMed Entry

Wingert F.
[Morphosyntactical analysis of compound word forms in medical language]
Methods Inf Med. 1977 Oct;16(4):248-255. German.
PMID: 337050.
PubMed Entry

Wingert F, Ries P.
[Pathology findings system]
Methods Inf Med. 1973 Jul;12(3):150-155. German.
PMID: 4729117.
PubMed Entry

Wingert F.
[PAULA: program for evaluation of logical expressions. Plausibility-control and evaluation of optical mark reader forms]
Methods Inf Med. 1972 Apr;11(2):96-103.
PMID: 5026579.
PubMed Entry

Salton G.
Experiments in automatic thesaurus construction for information retrieval.
In: Proceedings IFIP Congress, 1971;:43-49.

Salton G, ed.
The Smart Retrieval System - Experiments in Automatic Document Processing.
Englewood Cliffs, NJ: Prentice-Hall. 1971;:.

Salton G.
Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer.
Reading, MA: Addison Wesley. 1989;:.

Salton G, Allan J, Buckley C, Singhal A.
Automatic analysis, theme generation and summarization of machine-readable texts.
Science 1994;264:1421-1426.

Salton G, Allen J.
Selective text utilization and text traversal.
In: Proceedings of ACM Hypertext 93, New York.
New York: Association for Computing Machinery. 1993;:.

Salton G, Buckley C.
Global text matching for information retrieval.
Science 1991;253:1012-1015.

Salton G, Fox EA, Wu H.
Extended boolean information retrieval.
Communications of the ACM 1983;26:1022-1036.

Salton, Gerard, and Michael J. McGill.
Introduction to modern information retrieval.
New York: McGraw-Hill. 1983;:.

Salton G, Buckley C, Fox EA.
Automatic query formulations in information retrieval.
J Am Soc Inf Sci. 1983 Jul;34(4):262-280.
PMID: 10299297.
PubMed Entry

Salton G.
Automatic text analysis.
Science. 1970 Apr 17;168(929):335-343.
PMID: 5435890.
PubMed Entry

Minsky M, Hillis D, Rudisch G.
Artificial intelligence.
N Engl J Med. 1980 Jun 26;302(26):1482.
PMID: 7374720.
PubMed Entry

Zipf GK.
Relative frequency as a determinant of phonetic change.
Harvard Studies in Classical Philology 1929;40:1-95.

Zipf GK.
The Psycho-Biology of Language.
Boston, MA: Houghton Mifflin. 1935;:.

Zipf GK.
Human Behavior and the Principle of Least Effort.
Cambridge, MA: Addison-Wesley. 1949;:.

Fitch WT, Hauser MD, Chomsky N.
The evolution of the language faculty: Clarifications and implications.
Cognition. 2005 Sep;97(2):179-210.
PMID: 16112662.
PubMed Entry

Chomsky N.
Universals of human nature.
Psychother Psychosom. 2005;74(5):263-268.
PMID: 16088263.
PubMed Entry

Hauser MD, Chomsky N, Fitch WT.
The faculty of language: what is it, who has it, and how did it evolve?
Science. 2002 Nov 22;298(5598):1569-1579. Review.
PMID: 12446899.
PubMed Entry

Chomsky N.
The development of grammar in child language: Formal discussion.
Monogr Soc Res Child Dev. 1964;29:35-39.
PMID: 14125365
PubMed Entry

Chomsky N.
Syntactic Structures.
The Hague: Mouton. 1957;:.

Chomsky N.
Aspects of the Theory of Syntax.
Cambridge, MA: MIT Press. 1965;:.

Chomsky N.
Rules and Representations.
New York: Columbia University Press. 1980;:.

Chomsky N.
Knowledge of Language: Its Nature, Origin, and Use.
New York: Prager. 1986;:.

Chomsky N.
The Minimalist Program.
Cambridge, MA: MIT Press. 1995;:.

Suppes P.
Probabilistic grammars for natural languages.
Synthese 1970;22:95-116.

Suppes P.
Probabilistic Metaphysics.
Oxford: Blackwell. 1984;:.

Suppes P, Bottner M, Liang L.
Machine learning comprehension grammars for ten languages.
Computational Linguistics. 1996;22:329-350.

Wittgenstein L.
Philosophical Investigations [Philosophische Untersuchungen]. Third edition.
Oxford: Basil Blackwell. 1968;:.

Last updated: 2/22/2008, by G. William Moore, MD, PhD.