Skip to content

tatu-at-datastax/json-api-analyzer-filter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JSON API Analyzer Filter

Repo that contains specialized filter to use on Jackson JsonParser as well as scaffolding for using it for extracting textual content from within JSON content, indicated by simple "inclusion paths" notation.

The idea is that for content like:

{
  "_id": 124,
  "name": "Bob Burger",
  "phone": {
    "home": "555-123-4567",
    "work": "555-111-2222"
    
  }
}  

we can extract contents by specifying inclusion paths like:

   name, phone.home

which results in constructing indexable text blob of:

Bob Burger 555-123-4567

and for this specific case, we could do that by:

JsonFieldExtractorFactory f = JsonFieldExtractorFactory.construct(new ObjectMapper());
JsonFieldExtractor extr = f.buildExtractor("name, phone.home");
String json = "..."; // get JSON from somewhere
String toIndex = extr.extractAsString(json).get(); // Optional.empty() if not JSON

assertThat(toIndex).isEqualTo("Bob Burger 555-123-4567 "); // note trailing space

Caching

Instances of JsonFeidlExtractorFactory and JsonFieldExtractor are thread-safe and can be shared between threads. They should be cached as much as possible: for former a Singleton is enough, and for latter, a size-bound cache (like Caffeine) keyed by field definition String is recommended; this avoids processing to build token filter (which should not be particularly expensive but is not free either).

Implementation

Internally the implementation is based on Jackson's JsonParser configured with a JsonToken constructed from inclusion path definition. As such read performance should be close to that of basic JSON decoding with little extra overhead. Output aggregation is simple text aggregation using StringWriter, although if output is needed as ByteBuffer, additional UTF-8 encoding overhead is incurred.

Benchmarking

Project includes JMH based micro-benchmarks for comparing performance of extraction to that of basic JSON decoding.

Sample results below are run on my dev laptop (MacBoo Pro, 6-core 2.6 Ghz) and JDK 17.

"Docs Api" (2Kb)

Benchmark that uses example JSON document of 2164 bytes (2.1Kb) and extracts contents as String (for extraction cases):

Benchmark                                 Mode  Cnt       Score       Error  Units
BenchmarkDocsApi.jsonReadAndExtractMost  thrpt    9  140044.517 ±  5051.514  ops/s
BenchmarkDocsApi.jsonReadAndExtractTiny  thrpt    9  182548.624 ± 12296.384  ops/s
BenchmarkDocsApi.jsonReadTree            thrpt    9  124324.194 ±  4962.277  ops/s
BenchmarkDocsApi.jsonScanOnly            thrpt    9  224189.990 ±  9225.558  ops/s

in this case we get average throughput numbers as follows:

  • 225,000 documents (450 MB) per second per core for basic JSON scanning (skipping through tokens, not accessing values)
  • 182,500 documents (365 MB) per second per core when extracting small amounts (2 unrelated subtrees, 5 leaf values)
  • 140,000 documents (280 MB) per second per core when extracting larger amounts (about half the document; dozens of leaf values)
  • 125,000 documents (250 MB) per second per core when building (but not processing) in-memory Tree representation (access all leaf values)

About

Repo that contains specialized filter to use on JsonParser, related utility classes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages