processors-agiga

What is it?

Just a way to convert agiga data into a processors Document.

Why spend the time and resources parsing and annotating over 183 million sentences when it has already been done?

Dependencies

  1. Java 8
  2. sbt
  3. A copy of the Annotated English Gigaword

Reading an annotated English Gigaword xml file

import org.clulab.agiga

// build a processors.Document
val doc = agiga.toDocument("path/to/agiga/xml/ltw_eng_200705.xml.gz")

Running AgigaReader

Example 1: dump a lemmatized form of the English Gigaword

Everything is configured in the application.conf file.

  1. Change the view property to "lemmas"

  2. Change the inputDir property to wherever your copy of agiga is nestled on your disk

  3. Change the outputDir property to wherever you want your compressed of the lemmatized English Gigaword to be written

  4. (Optional) Change the nthreads property to the maximum number of threads you prefer to use for parallelization.

All that's left is to run AgigaReader:

sbt "runMain sem.AgigaReader"

Options for "view"

Value Description
"words" word form of each token
"lemmas" lemma form of each token
"tags" PoS tag of each token
"entities" NE labels of each token
"deps" <word form of head>_<relation>_<word form of dependent>
"lemma-deps" <lemmatized head>_<relation>_<lemmatized dependent>
"tag-deps" <pos tag of head>_<relation>_<pos tag of dependent>
"entity-deps" <NE label of head>_<relation>_<NE label of dependent>

TODO

References

  1. Annotated Gigaword