`processors-agiga`

What is it?

Just a way to convert agiga data into a processors Document.

Why spend the time and resources parsing and annotating over 183 million sentences when it has already been done?

import org.clulab.agiga

// build a processors.Document
val doc = agiga.toDocument("path/to/agiga/xml/ltw_eng_200705.xml.gz")

Everything is configured in the application.conf file.

Change the view property to "lemmas"
Change the inputDir property to wherever your copy of agiga is nestled on your disk
Change the outputDir property to wherever you want your compressed of the lemmatized English Gigaword to be written
(Optional) Change the nthreads property to the maximum number of threads you prefer to use for parallelization.

All that's left is to run AgigaReader:

sbt "runMain sem.AgigaReader"

Value	Description
"words"	word form of each token
"lemmas"	lemma form of each token
"tags"	PoS tag of each token
"entities"	NE labels of each token
"deps"	`<word form of head>_<relation>_<word form of dependent>`
"lemma-deps"	`<lemmatized head>_<relation>_<lemmatized dependent>`
"tag-deps"	`<pos tag of head>_<relation>_<pos tag of dependent>`
"entity-deps"	`<NE label of head>_<relation>_<NE label of dependent>`

Add output options for dependencies using the DFS ordering described in "Higher-order Lexical Semantic Models for Non-factoid Answer Reranking"