JSON API Analyzer Filter

Repo that contains specialized filter to use on Jackson JsonParser as well as scaffolding for using it for extracting textual content from within JSON content, indicated by simple "inclusion paths" notation.

The idea is that for content like:

{
  "_id": 124,
  "name": "Bob Burger",
  "phone": {
    "home": "555-123-4567",
    "work": "555-111-2222"
    
  }
}

we can extract contents by specifying inclusion paths like:

   name, phone.home

which results in constructing indexable text blob of:

Bob Burger 555-123-4567

and for this specific case, we could do that by:

JsonFieldExtractorFactory f = JsonFieldExtractorFactory.construct(new ObjectMapper());
JsonFieldExtractor extr = f.buildExtractor("name, phone.home");
String json = "..."; // get JSON from somewhere
String toIndex = extr.extractAsString(json).get(); // Optional.empty() if not JSON

assertThat(toIndex).isEqualTo("Bob Burger 555-123-4567 "); // note trailing space

Caching

Instances of JsonFeidlExtractorFactory and JsonFieldExtractor are thread-safe and can be shared between threads. They should be cached as much as possible: for former a Singleton is enough, and for latter, a size-bound cache (like Caffeine) keyed by field definition String is recommended; this avoids processing to build token filter (which should not be particularly expensive but is not free either).

Implementation

Internally the implementation is based on Jackson's JsonParser configured with a JsonToken constructed from inclusion path definition. As such read performance should be close to that of basic JSON decoding with little extra overhead. Output aggregation is simple text aggregation using StringWriter, although if output is needed as ByteBuffer, additional UTF-8 encoding overhead is incurred.

Benchmarking

Project includes JMH based micro-benchmarks for comparing performance of extraction to that of basic JSON decoding.

Sample results below are run on my dev laptop (MacBoo Pro, 6-core 2.6 Ghz) and JDK 17.

"Docs Api" (2Kb)

Benchmark that uses example JSON document of 2164 bytes (2.1Kb) and extracts contents as String (for extraction cases):

Benchmark                                 Mode  Cnt       Score       Error  Units
BenchmarkDocsApi.jsonReadAndExtractMost  thrpt    9  140044.517 ±  5051.514  ops/s
BenchmarkDocsApi.jsonReadAndExtractTiny  thrpt    9  182548.624 ± 12296.384  ops/s
BenchmarkDocsApi.jsonReadTree            thrpt    9  124324.194 ±  4962.277  ops/s
BenchmarkDocsApi.jsonScanOnly            thrpt    9  224189.990 ±  9225.558  ops/s

in this case we get average throughput numbers as follows:

225,000 documents (450 MB) per second per core for basic JSON scanning (skipping through tokens, not accessing values)
182,500 documents (365 MB) per second per core when extracting small amounts (2 unrelated subtrees, 5 leaf values)
140,000 documents (280 MB) per second per core when extracting larger amounts (about half the document; dozens of leaf values)
125,000 documents (250 MB) per second per core when building (but not processing) in-memory Tree representation (access all leaf values)

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
.mvn/wrapper		.mvn/wrapper
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

.mvn/wrapper

.mvn/wrapper

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

mvnw

mvnw

mvnw.cmd

mvnw.cmd

pom.xml

pom.xml

Repository files navigation

JSON API Analyzer Filter

Caching

Implementation

Benchmarking

"Docs Api" (2Kb)

About

Releases

Packages

Languages

License

tatu-at-datastax/json-api-analyzer-filter

Folders and files

Latest commit

History

Repository files navigation

JSON API Analyzer Filter

Caching

Implementation

Benchmarking

"Docs Api" (2Kb)

About

Resources

License

Stars

Watchers

Forks

Languages