truerss / content-extractor   1.0.2

MIT License GitHub

Java library. Detect top-level selector on the HTML page.

Scala versions: 2.12


Java library.

Returns the selector with the largest amount of content.


// sbt: 

"io.github.truerss" % "content-extractor" % "1.0.5"

// maven: 


// graddle

implementation 'io.github.truerss:content-extractor:1.0.5'

jsoup should be present in classpath.


import com.github.truerss.ContentExtractor;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Document;

String url = "";
Document doc = Jsoup.connect(url).get();
Element body = doc.body();
ExtractResult result = ContentExtractor.extract(body);
System.out.println("==========> " + result.selector);

License: MIT