DeTistics | Publications

Text Mining Publications

Harvesting Ontologies | text mining | nlp | automatic ontology generation

Harvesting Domain Specific Ontologies from Text

Hamid Mousavi, Deirdre Kerr, Markus R. Iseli, and Carlo Zaniolo - June 2014

Ontologies contain the concepts of a given domain and the relationships between those concepts. Ontologies are a vital component of most knowledge-based applications, including semantic web search, intelligent information integration, and natural language processing. While ontologies have traditionally been generated manually or using highly supervised approaches, such approaches to not scale well. In this paper, we propose a new approach that automatically generates domain-specific ontologies from a corpus using deep NLP-based text mining. Starting from a small initial seed of concepts, OntoHarvester iteratively extracts ontological relationships connecting existing concepts to other terms in the text and adds strongly connected terms to the ontology. The resulting ontologies are comprehensive, focused, and resistant to noise and outperform both manually generated ontologies and ontologies generated by current techniques, even those that require very large, well-focused data sets.

TOP >

View

Mining Semantic Structures | text mining | nlp | semantic information from syntactic structures

Mining Semantic Structures from Syntactic Structures in Free Text

Hamid Mousavi, Deirdre Kerr, Markus R. Iseli, and Carlo Zaniolo - March 2014

Advances in the Web have given rise to many ambitious text-mining applications such as review or news summarization, essay grading, question answering, and semantic search. For many such applications, statistical text-mining techniques are ineffective and provide very low recall, since they do not utilize morphological structures of the text. Thus, many approaches are now using deeper NLP-based techniques, by parsing the text and employing patterns to mine and analyze it. However, in addition to being noisy, parse trees and other similar structures contain many of the syntactical structures in the text. Analyzing such structures requires many complex patterns, which are very costly to generate. To address this issue, we present a weighted graph-based representation of text, called a TextGraph, which provides the grammatical and semantic relations between words and terms in the text, as well as a SPARQL-like query language and an optimized engine for semantically querying and mining TextGraphs.

View

TOP >

Automatic Essay Scoring | text mining | nlp | score content | proposition extraction

Automatically Scoring Short Essays for Content

Deirdre Kerr, Hamid Mousavi, and Markus R. Iseli - December 2013

New assessments emphasize short essay constructed response items over multiple choice items because they are more precise measures of understanding. However, such items are too costly and time consuming to be used in large-scale assessments unless they can be scored automatically. Current automatic essay scoring techniques are inappropriate for scoring essay content because they rely on either grammatical measures of quality or machine learning techniques, neither of which identifies statements of meaning (propositions) in the text. In this paper, we explain our process of (1) extracting meaning from essays in the form of propositions using our text mining framework called SemScape, (2) using the extracted propositions to score the essays, and (3) testing SemScape's performance on two separate sets of essays. Results demonstrate the potential of this purely semantic process and indicate that the system can accurately extract propositions from short essays, approaching or exceeding standard benchmarks for scoring performance.

View

TOP >

Deducing Infoboxes | text mining | nlp | infoboxes | unstructured text | wikipedia

Deducing InfoBoxes from Unstructured Text in Wikipedia Pages

Hamid Mousavi, Deirdre Kerr, and Markus R. Iseli - January 2013

InfoBoxes in Wikipedia pages were originally quick references for readers. However, knowledge bases built from InfoBoxes now play a crucial role in a variety of important applications, including review summarization, document categorization, question answering, and semantic search. Current InfoBoxes suffer from incompleteness, inconsistencies, and inaccuracies, because they are created manually. Previous attempts to correct these problems have relied on text mining approaches that exploit structured information such as internal links, redirects, or disambiguation pages. In this paper, we present a novel system, IBminer, to derive InfoBox information from the free text in Wikipedia using NLP. IBminer generates subject-attribute-value triples from TextGraphs using predefined SPARQL-like queries. After resolving pronouns and co-references, the attribute names are matched to those in currently existing InfoBox triples. Additionally, IBminer can use exising knowledge bases to suggest new or incorrect InfoBox triples, and propose attribute synonyms.

View

TOP >

Framework for Text Mining | text mining | nlp | parse trees | linguistics

A New Framework for Textual Information Mining over Parse Trees

Hamid Mousavi, Deirdre Kerr, and Markus R. Iseli

September 2011

Mining information from text is a challenging problem, not easily solved through either statistical or rule-based natural language processing techniques. This paper introduces a domain-independent rule-based text mining framework using a tree-based Linguistic Query Language, called LQL to extract grammatical relationships between words. The framework generates parse trees for each sentence using a probabilistic parser, and annotates each node of these parse trees with main-parts information from the node's branch based on the linguistic structure of the branch. Using main-parts-annotated parse trees for a given textual dataset, the system can efficiently answer individual queries as well as mine the text for a given set of queries. The framework also has the ability to support grammatical ambiguity through probabilistic rules and linguistic exceptions in order to increase the quality of the extracted information.