lib-text
A little text processing library for Scala.
Overview
This is a little text processing library which supports language identification, tokenization, stopword filtering and provides some useful helper functions. The tokenization has been tuned to work well with text conventions commonly used in social media such as Twitter, and supports URLs, emoji, hashtags, emails and @-mentions cleanly. Stopword filtering is currently supported for
- German
- English
- Spanish
- French
- Indonesian
- Japanese
- Malay
- Dutch
- Portuguese
- Swedish
- Turkish
- Arabic
More to come.
Usage
Add to your project dependencies:
resolvers += "peoplepattern" at "https://dl.bintray.com/peoplepattern/maven/"
libraryDependencies += "com.peoplepattern" %% "lib-text" % "0.3"
Example
import com.peoplepattern.text.Implicits._
val txt = "Did you get your personalised print with your copy of #MadeintheAM on Black Friday? If not, there's still time! http://www.myplaydirect.com/one-direction"
txt.lang
// Some(en)
txt.tokens
// Vector(Did, you, get, your, personalised, print, with, your, copy, of, #MadeintheAM, on, Black, Friday, ?, If, not, ,, there's, still, time, !, http://www.myplaydirect.com/one-direction)
txt.terms
// Set(print, personalised, black, copy, friday, time)
txt.termsPlus
// Set(print, personalised, black, #madeintheam, copy, friday, time)
txt.termBigrams
// Set(black friday, personalised print)
License
lib-text is open source and licensed under the Apache License 2.0.
Acknowledgements
Developed with ❤️ at People Pattern Corporation