Marys Medicine

Detecting Privacy-Sensitive Events Department of Computer Science, UIUC {jindal2, danr, cgunter} September 20, 2013 Recent US government initiatives have led to wide adoption of Electronic Health Records (EHRs). More and more health care institutions are storingpatients' data in an electronic format. This emerging practice is posing sev-eral security-related risks because electronic data can easily be shared withinand across institutions. So, it is important to design robust frameworkswhich will protect patients' privacy. In this report, we present a methodto detect security-related (particularly drug abuse) events in medical text.
Several applications can use this information to make the hospital systemsmore secure. For example, portions of the clinical reports which contain de-scription of critical events can be encrypted so that it can be viewed only byselected individuals.
While dealing with clinical narratives, there are several privacy concerns.
Clinical narratives often contain sensitive information about the patients. Ina hospital system, clinical narratives need to be visible to many people so thatthey can perform their respective functions. Sometimes, it is also necessaryto share the clinical narratives among hospital systems. It is important thatthe privacy of patients should be respected while sharing such informationacross hospital systems.
There are several types of sensitive data that are found in the clinical narratives. We categorize the sensitive data into 5 major types below: 1. Mental health and abuse in the family 4. Genomic data; indication of genetic information in EHRs 5. Sexually transmitted diseases However in the data that we have, we only found significant number of drug abuse cases. We didn't find sufficient number of cases for other 4 types.
So, in this study, we restrict ourselves to the cases of drug abuse.
Drug Abuse
Wikipedia gives the following definition of drug abuse which is consistentwith the definitions of drug abuse found in medical sources like Medline-Plus etc.
Substance abuse, also known as drug abuse, is a patterned use of a substance (drug)in which the user consumes the substance in amounts or with methods neither ap-proved nor supervised by medical professionals. Substance abuse/drug abuse is notlimited to mood-altering or psycho-active drugs. If an activity is performed usingthe objects against the rules and policies of the matter (as in steroids for performanceenhancement in sports), it is also called substance abuse.
In this chapter, following 3 things will be addressed: 1. To identify the concepts related to drug abuse.
2. To identify the assertion status (positive or negative) of concepts.
3. To identify whether the concept belonged to the patient.
Datasets for Experiments
For our experiments, we used the clinical narratives made available by i2b2team as part of 2011 i2b2/VA coreference challenge. These clinical narrativescame from 2 institutions: (a) Partners HealthCare, Boston and (b) Beth IsraelDeaconess Medical Center.
Data was annotated by 2 annotators where one of them was a medical expert. Now, we report the results on Inter-Annotator agreement (IAA) on10 documents. There were a total of 57 concepts related to drug abuse in thedata that we selected.
For concept extraction, there was disagreement over 4 cases. So, IAA for concept extraction = 92.9%.For determining assertions, there was disagree-ment over 3 cases. All the cases of disagreement were related to mild alcoholusage. So, IAA for assertion detection = 94.7%. Finally, we decided that allcases of drug abuse (whether mild or strong) should be annotated to be pos-itive. For determining experiencer of the drug abuse event, there was totalagreement. So, IAA = 100.0%.
Since we had very limited data, we decided to use semi-supervised meth- ods for finding drug-abuse events. We reserved all the annotated data fortesting.
In the next few subsections, we describe the methodology that we used.
Concept identification was done using dictionary lookup. We compiled alist of commonly used substances used for drug abuse from web sources.
Next, we obtained all the phrases appearing in the clinical narratives usinga shallow parser. All those phrases which contained any term located in thedrug-abuse dictionary were considered to be drug-abuse events.
We adapted 3 state-of-the-art expert systems to find the assertion status ofthe concepts. Below, we describe these three systems in more detail: This is an implementation of the ConText algorithm [1, 2] by Imre Solti. Itfirst of all identifies the trigger words for the negation. Consider the follow-ing sentence as an example: The patient denies any IV drug use but did describe cocaine use for last 2 months.
In the above sentence, ‘denies' is the trigger word for negation. It is impor- tant to note that the algorithm differentiates between pseudo-triggers (like‘no increase', ‘not cause' etc.) and the actual trigger words.
After determining the triggers, the algorithm determines the scope of the trigger words. The scope of a trigger word generally starts from the word tothe right of the trigger and extends till the end of the sentence. But certaintermination words (like ‘but' in the above example) can cause the scope ofa trigger to end early. Also, for certain triggers, the scope lies to the left ofthe trigger instead of the right. For example, consider the following sentence: Lung injury was ruled out by the MRI exam.
In the above sentence, the scope of ‘was ruled out' is ‘Lung injury'.
Then if a concept falls within the scope of some trigger word for negation, its scope is changed to negative.
This has similar implementation as that of Callkit. However, it uses slightlydifferent lists of trigger words.
Just like ConText algorithm, it also keeps a list of trigger words and identifiesthe scope of trigger words. However, it addresses the issue that there maybe multiple trigger words whose scope may span the concept. To resolvesuch a thing, it maintains a score for all possible categories. Whenever theconcept falls under the scope of some trigger word, it updates the score ofthe corresponding category. Finally, the category with the maximum scorewins. The following scoring formula was used in our implementation. Itshould be noted that the scoring formula depends on the distance because it Table 1: This table shows the performance of concept extraction for drug-abuseconcepts.
Table 2: This table compares the performance of three systems for negation andexperiencer detection for drug-abuse concepts.
is intuitive that when a concept is close to the trigger word, then it is morelikely that the trigger word is associated with the concept.
where window size was chosen to be 3.
Patient or not
All the 3 systems described above give information about the experiencer ofthe event as well. The mechanism used to identify the experiencer is exactlythe same as described for determining the assertion.
Table 1 shows the results for concept identification in terms of Precision,Recall and F1 scores. We see from this table that although we achieved veryhigh precision, recall is somewhat low.
Table 2 gives the results for assertion and experiencer determination for correctly identified concepts. We find that all systems perform quite well indetecting negation and experiencer. Utah and MSRA performed the best.
We can note from the above section on results that our system has a some-what lower recall for concept identification. This is because of the reason thatthe list of substances used for drug-abuse that we generated was not com-prehensive enough. In our list, we included the commonly used drug-abusesubstances. However, the error analysis showed that several other substancesare also used for drug-abuse. Some of the concepts that we missed includethe following: codeine, morphine sulphate, etoh, IVDU, drug use, drunk heavily,illicit substances and pack-year history.
For negation and experiencer detection, we made mistakes on cases which are particularly difficult. For example, consider the following sentence: Patient's primary care provider was called to discuss outpatient plans to help thepatient stop smoking .
In the above sentence, the phrase ‘patient stop smoking' can mislead the system to predict a negated event. However, when we see the overall con-text, we can see that the patient is still continuing with his/her smokinghabit. Next, consider the following sentence: He works as a counselor at an alcohol and drug treatment facility for teenagers .
In the above sentence, the word ‘alcohol' can mislead the system to predict a positive drug-abuse event. However, there is no drug-abuse (either positiveor negative) being reported here at all.
Medical Set Expansion
In Section 7, we saw that our system has somewhat low recall for conceptidentification. For concept identification, we have very limited annotateddata. This prevents us from developing a supervised learning approach forconcept identification.
Semi-Supervised Methods for Concept Identifica-
In the literature, several semi-supervised methods have been proposed forconcept identification. The essential underlying principle behind these semi-supervised methods is that of bootstrapping. In bootstrapping, the inputconsists of a few examples (also called seeds) of the concept type which weare interested in. Then the system tries to grow the seed set by finding con-cepts which are similar to the seeds. Distributional context of the conceptsgenerally provides a good way to test the similarity of any two concepts.
Bootstrapping approach terminates when the system is unable to grow theseed set further.
For bootstrapping approach to be successful, there should be a lot of instances of the concepts which we are interested in. If this is not the case,then the distributional context of the concepts would be very sparse andthus, insufficient for computing the similarity between two mentions. Thisis exactly the problem that we face in the datasets that we are experimentingwith. These datasets have very few instances of "drug abuse" events, thus,limiting the usefulness of bootstrapping approach.
Active Learning Solution for Concept Identification
Since our datasets have only few instances of relevant concepts, we need toprovide some extra level of supervision to our concept identification system.
We rely on active learning methods to provide this extra level of supervision.
In an active learning based solution, the system asks some questions to theuser. The answers provided by the user are used by the system to learna model for identifying relevant concepts. A good active learning systemshould ask minimal number of questions from the user.
Moreover, since we lack a good distributional context of the relevant con- cepts, we use the tree positions of the concepts in a medical encyclopedianamed SNOMED CT to find the similarity between mentions.
Using SNOMED CT for Medical Set Expansion
Using SNOMED CT, we build a detailed descriptor of every concept. Everyconcept can appear at multiple places in SNOMED CT. We define the de-scriptor of a concept to be simply the parents of the concept upto 5 higherlevels. We explain it below with the help of an example. Let us consider theconcept "cocaine". The descriptor of this concept is shown in Table 3. Atlevel 0, two SNOMED CT concepts corresponding to "cocaine" are shown.
Cocaine measurement Drug measurement, Tropane alkaloid, Ester type local anesthetic Azabicyclo compound, Local anesthetic, Measurement of substance, Heterocyclic compound, Tropane alkaloid, Psychotherapeutic agent Heterocyclic compound, Azabicyclo compound, Organic compound, Drug pseudoallergen by function, Tropane alkaloid, Psychotherapeutic agent Evaluation procedure, Chemical categorized structurally, Heterocyclic compound, Azabicyclo compound, Organic compound, Drug pseudoallergen, Tropane alkaloid, Psychotherapeutic agent, Substance categorized functionally, General drug type Table 3: This table shows the descriptor for concept "cocaine".
Concepts at any level i + 1 are basically the parents of concepts at level i. Itis normal for some of the concepts to repeat at later levels. These descrip-tors were made by a simple breadth-first search on the SNOMED CT graphstarting from the concept under consideration.
In this subsection, we describe how the user contributes to the learning of amodel for concept identification. To begin with, input to the system consistsof a few seeds. Let us represent this seed set by S. Let si denote the ithelement of seed set. For finding the substances which are potentially usedfor drug abuse, the input can be the following: "cocaine", "marijuana", "al- cohol". Then the system computes the descriptors of each of the conceptsand then merges those descriptors into a single descriptor. Let us assumethat for concept x, parents at level i are denoted by the set Li(x). Then thelevels of the overall descriptor are defined by the following equation: After some preprocessing (like removing overly general concepts), the descriptor is shown to the user. Then the user is supposed to identify oneor more most appropriate SNOMED CT concepts from the descriptor. Userresponse is recorded into a list. Let us call this list as MedRep(S ). No furtherinput from user is now required.
Computing the Score of a Concept
In this subsection, we describe how to compute the similarity of any givenSNOMED CT concept to the seed set, S, provided by the user. Let us denotethe given SNOMED CT concept by the variable x. Also, assume that forconcept x, parents at level i are denoted by the set Li(x). Then the similarity,sim(x, S ), of the concept x to the seed set S is defined by the followingequation: In other words, similarity of a concept to the seed set is the number of unique SNOMED CT concepts in the descriptor of the concept that alsoappear in the representative model of the seed set given to the system.
Performing Concept Identification
After receiving the user input, the system proceeds to find the relevant con-cepts from the provided dataset. Relevant concepts are found using thefollowing steps: 1. First of all, we use a chunker to find all the NPs (noun phrases) in the given document.
2. Each of the noun phrases found in Step 1 is mapped to SNOMED CT concepts using a biomedical engine (MetaMap).
Algorithm 1: MedicalSetExpansion
Input : S (Seed Set), D (Document Set)
Output: RD(S) (Ranked List of concepts)
for every seed s ∈ S do
Compute the descriptor of s using Breadth First Search onSNOMED CT graph Compute the overall descriptor of S by merging the individualdescriptors according to Equation (2) Display the overall descriptor to user after some pre-processing Record user response in MedRep(S ) for each noun phrase x in D do
Compute sim(x, S ) according to Equation (3) RD(S) ←− List of NPs sorted by similarity (descending order) 3. Then we compute the score of each NP as described in previous sub- section (§8.5).
4. Finally, the noun phrases are displayed to the user in decreasing order Algorithm 1 explains the overall algorithm for medical set expansion.
Focussing on Drug Abuse Events
Using the concept recognition technique described in §8.6, it is possible tobuild a recognizer for any concept type that we may be interested in. Forexample, one may build a recognizer for finding out the mentions of heartproblems. Other examples of recognizers include lung problems, kidneyproblems, pain-killers, closed surgeries, drug abuse events, sex-related mat-ters, genomic data etc.
In this section, we will focus on the recognizer for drug abuse events. In §5.1, we described a recognizer for drug abuse events based on dictionarylookup. §A gives a list of popular drugs that are often used for abuse. Thislist was compiled from these websites: Wikipedia1, SAMHSA2, MedlinePlus3 In §8.6, we described a yet another technique of concept recognition using medical set expansion. In that technique, model for concept iden-tification consists of a list (called as MedRep(S )) which basically containsthe representatives of the desired concept type in a medical encyclopedia(namely SNOMED CT). For the concept type "drugs used for substanceabuse", MedRep(S ) contains the following elements: 1. Psychoactive substance 2. Alcoholic Beverage 3. Central Depressant 5. Alcohol products 6. Substance of abuse 12. Tobacco smoking behavior 13. Tobacco use and exposure 14. Psychotherapeutic agent 15. Psychostimulant 17. Morphine Derivative 20. Drugs used to treat addiction 21. Carboxylic acid and/or salt 23. Centrally acting muscle relaxant Table 4: This table shows the performance of concept extraction for drug-abuseconcepts.
24. Centrally acting hypotensive agent 25. Cardiovascular agent 26. Sympathomimetic agent 28. Inhaled Drug Administration 30. Anxiolytic, sedative AND/OR hypnotic The above list was obtained using just a few seed words like cocaine, hashish, beer, wine, cannabis, smoking etc.
Table 4 shows the results for concept identification for the dataset describedin §4. We see from this table that the recall improved from 82.5 to 89.3whereas the precision dropped a little. Overall, the F1 score increased from89.5 to 92.1.
To further test the effectiveness of our system in identifying the sub- stances used for drug abuse, we prepared a dataset using medical forumswhere people discuss issues related to addiction with drugs. The datasetcontained a total of 135 distinct substances that can be used for drug abuse.
Out of these 135 substances, our system could correctly identify 55 sub-stances. Thus, we achieved a recall of 40.7.
The above results indicate that our system still misses many drugs that areused for abuse. §B gives a list of drugs that were missed by our system.
Below we identify the main reasons for missing such drugs: 1. One primary reason for the low recall was that SNOMED CT does not always have the trademark names for the drugs. For example,Lorazepam is a drug that can potentially be abused. Its tradename isAtivan. Although, SNOMED CT has an entry for Lorazepam, it doesnot have an entry for Ativan. Similar thing happened with the con-cepts Percocet, Vicodin, Darvocet, Ritalin and Lorcet which were trade-names for oxycodone, hydrocodone, propoxyphene, methylphenidate and hy-drocodone bitartrate respectively.
2. Another reason for the low recall is that sometimes the drugs are men- tioned by their street names which are not present in SNOMED CT.
For example, street names for the drug marijuana are ganja, grass, green,Mary Jane etc. Similarly, street names for the drug cocaine are candy,Charlie, toot, crack etc.
3. Third reason for the low recall is that SNOMED CT sometimes doesn't have the abbreviations for the drug names. For example, it does nothave the abbreviations LAAM (levacetylmethadol), PCP (phencyclidine)etc.
Future Work
Following are the good directions for the future work: 1. Wikipedia has a lot of medical knowledge. As discussed above in §10, a good amount of knowledge in Wikipedia is not even covered in medicalencyclopedias like SNOMED CT. So, it will be a very good project toextract the medical knowledge in Wikipedia and put it in a structureddatabase. For example, Wikipedia can tell the tradenames and commonabbreviations for a lot of drugs. Following are the good sources ofinformation in Wikipedia: (a) Hyperlinks in free text (b) Redirect Pages (c) Disambiguation Pages 2. Like Wikipedia, there are several other sources of medical information on the web. One very good source for medical information is Med-linePlus. It will be good to extract medical information from it. Thereis another website, MediLexicon, which gives a lot of useful medicalabbreviations.
3. Another good way to get useful medical knowledge is to send auto- mated queries to web search engines. The top pages from the searchresults can then be used to glean useful medical information. It willbe good to design the protocol such that the queries to the search en-gine are minimized because some search engines block the IP addresseswhich send too many queries.
Related Work
The task of set-expansion has been addressed in several works. We reporthere the most significant efforts towards this task.
Web-based Set-Expansion systems
GoogleTM has a proprietory system, Google Sets5, for set-expansion. GoogleSets make use of the lists identified by the Google search engine as it crawlsthe web. Items given as input to the Google Sets are matched up againstthese lists and probabilities are calculated to determine which items mightbe a good match for the desired concept. The lists produced by GoogleSets are quite small (≤ 50). Google Sets has been used for a number ofpurposes in the research community, including deriving features for named-entity recognition [3], and evaluation of question answering systems [4].
Another system for set-expansion is Boowa6 [5, 6, 7, 8, 9]. Like Google Sets, Boowa works by finding semi-structured web pages that contain "lists"of items, and then aggregating these "lists" so that the most promising itemsare ranked higher. Unlike Google Sets, Boowa produces extensive lists.
Boowa accepts a maximum of 3 seeds. Since Boowa tries to find all the seedson the same web-page, its performance may go down with increasing num-ber of seeds as is also the case with Google Sets. The KnowItAll system ofEtzioni et al. [10] depends on the output of existing search engines to extractcollections of facts from the Web. Etzioni et al. [10] use Pattern Learning,Subclass Extraction and List Extraction to improve KnowItAll's recall.
Set-Expansion systems for free text
For set-expansion on free-text, pattern recognition and distributional simi-larity have primarily been used.
Riloff and Jones [11] used a two-level bootstrapping mechanism based on pattern recognition for set-expansion. Their algorithm starts with a fewseeds from the category of interest. Then they run multiple iterations wherein each iteration, they add 5 new members to the list. Since they need tomake one pass over the entire corpus for every iteration, their method isquite inefficient. Moreover, their algorithm is very sensitive to the erroneousmembers which may get added to the list during the expansion.
Talukdar et al. [12] present a context pattern induction method for named- entity extraction. Their method automatically selects trigger words to markthe beginning of a pattern, which is then used for bootstrapping from freetext. However, they focussed on very broad entity types like Location, Personand Organization whereas we are interested in finer concepts like Athletes,Actors etc. Moreover, they used hundreds of seeds for constructing the se-mantic lexicons. On the other hand, we give a much smaller number ofseeds.
Sarmento et al. [13] present a corpus-based approach to set-expansion.
For a given set of seed entities they use co-occurrence statistics taken froma text collection to define a membership function that is used to rank candi-date entities for inclusion in the set. They represent entities as vectors andessentially construct a centroid of the seed-set.
Pantel et al. [14] developed a parallel implementation for computing the pairwise semantic similarity between the entities. They applied the learnedsimilarity matrix to the task of set-expansion using the centroid-based al-gorithm developed by Sarmento et al. [13]. They present a large empiricalstudy to quantify the effect of corpus size, corpus quality, seed compositionand seed size on set-expansion performance.
Set-Expansion systems using Integrated approaches
Talukdar et al. [15] present a graph-based semi-supervised label propaga-tion algorithm for acquiring open domain labeled classes and their instancesfrom a combination of unstructured and structured text sources. Pennac-chiotti and Pantel [16] present a framework called Ensemble Semantics formodeling information extraction algorithms that combine multiple sourcesof information and multiple extractors. Pasca and Van Durme [17] presentan approach to information extraction that exploits both Web documents andquery logs to acquire open-domain classes of instances, along with relevantsets of open-domain class attributes.
Ghahramani and Heller [18] illustrates a Bayesian Sets algorithm that solves a particular sub-problem of set-expansion, in which candidate sets are given, rather than a corpus of documents.
Use of Negative Examples in Set-Expansion
Thelen and Riloff [19] and Lin et al. [20] present a framework to learn sev-eral semantic classes simultaneously. In this framework, the instances whichhave been accepted by any one semantic class serve as negative examples forall other semantic classes. This approach is limited because it necessitatesthe learning of several semantic classes simultaneously. Moreover, negativeexamples are NOT useful if the different semantic classes are not related toone another. Winston et al. note that it is not easy to acquire good negativeexamples. The approach presented by us allows the use of negative exam-ples even when there is only one semantic class. Also, we present a strategyto easily acquire good negative examples.
In this chapter, we focus on set-expansion from free text. So, we don't compare our system with the systems which use textual sources other thanfree text (e.g. semi-structured web pages or query logs). The works of Sar-mento et al. [13] and Pantel et al. [14] are the state-of-the-art works that comeclosest to our approach. In our experiments, we compare the centroid-basedapproach employed by them with the approach developed by us.
Recently, there has been a lot of work centered around Wikipedia. Ratinov etal. [21] analyze local and global approaches for disambiguation to Wikipedia.
Yan et al. [22] present an unsupervised relation extraction method for discov-ering and enhancing relations in which a speciï ˇn ˛ Aed concept in Wikipedia participates. Using respective characteristics of Wikipedia articles and Webcorpus, they develop a clustering approach based on combinations of pat-terns: dependency patterns from dependency analysis of texts in Wikipedia,and surface patterns generated from highly redundant information relatedto the Web. Nguyen and Moschitti [23] extend distant supervision (DS)based on Wikipedia for Relation Extraction (RE) by considering (i) rela-tions deï ˇn ˛ Aned in external repositories, e.g. YAGO, and (ii) any subset of Wikipedia documents. They show that training data constituted by sentencescontaining pairs of named entities in target relations is enough to producereliable supervision. Wu and Weld [24] present WOE, an open IE systemwhich improves dramatically on TextRunnerâ ˘ Zs [25, 26] precision and re- call. The key to WOEâ ˘ Zs performance is a novel form of self-supervised learning for open extractors â ˘ AˇT using heuristic matches between Wikipedia infobox attribute values and corresponding sentences to construct trainingdata.
In this report, we presented a study on the detection of drug abuse eventsin medical text. We explored different state-of-the-art techniques for de-termining the negation status and experiencer of drug abuse events. Forfinding the drug abuse concepts, we used an active learning based approachto set expansion. The medical knowledge needed in set-expansion processwas obtained from SNOMED CT. We showed that our concept identificationtechnique is able to successfully find even uncommon drugs which are usedfor abuse. However, since SNOMED CT does not have tradenames and streetnames for many concepts, a good direction for future research is to augmentthe current system with the knowledge from web.
We thankfully acknowledge the contribution of Tony Michalos in helpingus to prepare the datasets used in the experimental work. We are alsothankful to Prof. Ozlem Uzuner for making available the clinical narrativesused in i2b2 NLP challenges. This research was supported by Grant HHS90TR0003/01. Its contents are solely the responsibility of the authors and donot necessarily represent the official views of the HHS or the US government.
Popular Drug Abuse Substances
Following is a list of most popular substances which are used for drug abuse: 3. anabolic steroids 6. benzodiazepines (particularly alprazolam, temazepam, diazepam and 12. depressants (sedatives) 15. hallucinogens 22. methamphetamine 27. pain relievers 29. psychotherapeutics 34. tranquilizers Concepts that We Missed
Following is a list of drug abuse substances that were not detected by oursoftware: [1] H. Harkema, J. N. Dowling, T. Thornblade, and W. W. Chapman, "Con- text: An algorithm for determining negation, experiencer, and tempo-ral status from clinical reports," Journal of biomedical informatics, vol. 42,no. 5, pp. 839–851, 2009.
[2] W. Chapman, W. Bridewell, P. Hanbury, G. Cooper, and B. Buchanan, "A simple algorithm for identifying negated findings and diseases indischarge summaries," Journal of biomedical informatics, vol. 34, no. 5, pp.
301–310, 2001.
[3] B. Settles, "Biomedical named entity recognition using conditional ran- dom fields and rich feature sets," in Proceedings of the International JointWorkshop on Natural Language Processing in Biomedicine and its Applica-tions. Association for Computational Linguistics, 2004, pp. 104–107.
[4] J. Prager, "Question answering using constraint satisfaction," in Pro- ceedings of the 42nd Meeting of the Association for Computational Linguistics(ACL'04. Citeseer, 2004.
[5] R. Wang and W. Cohen, "Language-independent set expansion of named entities using the web," in ICDM.
IEEE Computer Society, 2007, pp. 342–350.
[6] R. Wang and W. Cohen, "Iterative set expansion of named entities using the web," in 2008 Eighth IEEE International Conference on Data Mining.
IEEE, 2008, pp. 1091–1096.
[7] R. Wang and W. Cohen, "Character-level analysis of semi-structured documents for set expansion," in Proceedings of the 2009 Conference onEmpirical Methods in Natural Language Processing: Volume 3-Volume 3.
Association for Computational Linguistics, 2009, pp. 1503–1512.
[8] R. Wang, N. Schlaefer, W. Cohen, and E. Nyberg, "Automatic set ex- pansion for list question answering," in Proceedings of the Conference onEmpirical Methods in Natural Language Processing. Association for Com-putational Linguistics, 2008, pp. 947–954.
[9] R. Wang and W. Cohen, "Automatic set instance extraction using the web," in Proceedings of the Joint Conference of the 47th Annual Meeting ofthe ACL and the 4th International Joint Conference on Natural Language Pro-cessing of the AFNLP: Volume 1-Volume 1. Association for ComputationalLinguistics, 2009, pp. 441–449.
[10] O. Etzioni, M. Cafarella, D. Downey, A. Popescu, T. Shaked, S. Soder- land, D. Weld, and A. Yates, "Unsupervised named-entity extractionfrom the web: An experimental study," Artificial Intelligence, vol. 165,no. 1, pp. 91–134, 2005.
[11] E. Riloff and R. Jones, "Learning dictionaries for information extraction by multi-level bootstrapping," in Proceedings of the National Conference onArtificial Intelligence. JOHN WILEY & SONS LTD, 1999, pp. 474–479.
[12] P. Talukdar, T. Brants, M. Liberman, and F. Pereira, "A context pat- tern induction method for named entity extraction," in Proceedings ofthe Tenth Conference on CoNLL. ACL, 2006, pp. 141–148.
[13] L. Sarmento, V. Jijkuon, M. de Rijke, and E. Oliveira, "More like these: growing entity classes from seeds," in Proceedings of the sixteenth ACMconference on CIKM. ACM, 2007, pp. 959–962.
[14] P. Pantel, E. Crestan, A. Borkovsky, A. Popescu, and V. Vyas, "Web-scale distributional similarity and entity set expansion," in Proceedings of the2009 Conference on EMNLP. ACL, 2009, pp. 938–947.
[15] P. Talukdar, J. Reisinger, M. Pa¸sca, D. Ravichandran, R. Bhagat, and F. Pereira, "Weakly-supervised acquisition of labeled class instances us-ing graph random walks," in Proceedings of the Conference on EMNLP.
ACL, 2008, pp. 582–590.
[16] M. Pennacchiotti and P. Pantel, "Entity extraction via ensemble seman- tics," in Proceedings of the 2009 Conference on Empirical Methods in NaturalLanguage Processing: Volume 1-Volume 1. Association for ComputationalLinguistics, 2009, pp. 238–247.
[17] M. Pasca and B. Van Durme, "Weakly-supervised acquisition of open- domain classes and class attributes from web documents and querylogs," in Proceedings of the 46th Annual Meeting of the ACL (ACL-08).
Citeseer, 2008, pp. 19–27.
[18] Z. Ghahramani and K. Heller, "Bayesian sets," Advances in Neural Infor- mation Processing Systems, vol. 18, p. 435, 2006.
[19] M. Thelen and E. Riloff, "A bootstrapping method for learning semantic lexicons using extraction pattern contexts," in Proceedings of the ACL-02conference on Empirical methods in natural language processing-Volume 10.
Association for Computational Linguistics, 2002, pp. 214–221.
[20] W. Lin, R. Yangarber, and R. Grishman, "Bootstrapped learning of se- mantic classes from positive and negative examples," in Proceedings ofICML-2003 Workshop on The Continuum from Labeled to Unlabeled Data,vol. 1, no. 4, 2003, p. 21.
[21] L.-A. Ratinov, D. Roth, D. Downey, and M. Anderson, "Local and global algorithms for disambiguation to wikipedia." in ACL, vol. 11, 2011, pp.
[22] Y. Yan, N. Okazaki, Y. Matsuo, Z. Yang, and M. Ishizuka, "Unsuper- vised relation extraction by mining wikipedia texts using informationfrom the web," in Proceedings of the Joint Conference of the 47th AnnualMeeting of the ACL and the 4th International Joint Conference on NaturalLanguage Processing of the AFNLP: Volume 2-Volume 2.
Computational Linguistics, 2009, pp. 1021–1029.
[23] T.-V. T. Nguyen and A. Moschitti, "End-to-end relation extraction using distant supervision from external semantic repositories." in ACL (ShortPapers), 2011, pp. 277–282.
[24] F. Wu and D. S. Weld, "Open information extraction using wikipedia," in Proceedings of the 48th Annual Meeting of the Association for Computa-tional Linguistics. Association for Computational Linguistics, 2010, pp.
[25] A. Yates, M. Cafarella, M. Banko, O. Etzioni, M. Broadhead, and S. Soderland, "Textrunner: open information extraction on the web,"in Proceedings of Human Language Technologies: The Annual Conference ofthe North American Chapter of the Association for Computational Linguistics:Demonstrations.
Association for Computational Linguistics, 2007, pp.
[26] M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni, "Open information extraction from the web." in IJCAI, vol. 7, 2007, pp.


WHO RISK ASSESSMENT Human infections with avian influenza A(H7N9) virus 10 May 2013 A total of 131 confirmed cases of human infection with avian influenza A(H7N9) virus have been reported to WHO by China National Health and Family Planning Commission and one case by the Taipei Centers for Disease Control (Taipei CDC). Although cases have been reported in both sexes and across a

ANZCP CONFERENCE 2015 14th & 15th August Aerial UTS Function Centre Ultimo, Sydney Welcome to ANZCP Conference 2015 The College welcomes you to Sydney and to the ANZCP Conference 2015. It has been four years since the College held a major conference in Sydney, and while the past couple of years have seen the College's activities grow into other states, it is certainly pleasing to deliver this conference in our membership heartland.