Parameter Space Traversal #1647

jshook · 2023-10-25T18:44:50Z

jshook
Oct 25, 2023
Maintainer

Goals

We need to be able to iterate over a set of parameter combinations so that we can test all meaningful cases around a given set of questions.

Example question: Which tuning parameters work best across our canonical datasets to achieve a minimum recall of 90% while maintaining high performance on a standard deployment?

Practical result:

User provide one or more parameters, each with one or more discrete values.
The combinations of these parameters are understood and used to render parameterized workload sections.
The NB runtime automatically uses these parameterized sections in a single session to iterate and record all results in light of the specific parameters.
Thus, establishing a robust connection between all independent variables and all dependent results which can be used for more complete analysis and findings. This means we can answer questions like the above mechanically and in a way that is robust and reproducible.

Motivating Example

Suppose you want to understand relevancy measures over a set of workloads, K, and system loads. Here are some parameters that you might use:

datasets: glove25, glove50, glove100, glove200, lastfm, deep1b
@k: 1,3,5,10,25,50,75,100
saturation levels: 20%,40%,60%,80%,100%

This yields a search space of 240 distinct combinations, which is an incredible number of tests to run for a single battery. Ideally, we do not run all of these as full-length tests. It becomes feasible to run these at this grain of detail if each combination can be test quickly and with cohesion between the parameters and the results for each.

Given that large search spaces like this yield higher value information when you measure the edges of the space first, it should be default that the bounds of each parameter are tested first. This is simply using the parameters provided while ignoring any interior value. Thus, the search space becomes 2^P (P is the set of parameter names)

Possible Approaches

template variable enhancements

We presently support TEMPLATE(name,default) style variables which can contain a single value. If we were allowed to provide values for these template variables as inputs, perhaps from a list form like name=[glove25,glove50,...] then it would be simple to build an iterable form of the workload over all combinations.

structural templates

Introduce a in-yaml (and in-json) templating scheme based on examples from other projects to auto-render combinations from procedural template language definitions.

jsonnet

lean on Jsonnet more to render these blocks.

jshook · 2023-10-30T18:31:15Z

jshook
Oct 30, 2023
Maintainer Author

The previous examples assume a flat combination of all parameters, although there is likely a need to have a layered view of parameters as in multi-layer nesting.

Example sketch:

 for dataset in (a, b, c)
  for k in (1,2,3)
    ...

This is different than saying "all combiantions of (a,b,c) and (1,2,3)", as this simple example says nothing about the structure of traversal, ordering, expressions in the native workload, etc.

0 replies

jshook · 2023-10-30T18:45:41Z

jshook
Oct 30, 2023
Maintainer Author

Example workload template for parameter expansion:

TEMPLATE(dataset,[glove25,glove50])
TEMPLATE(k,[100,50,25,10,5,1])

scenarios:
    scenario_TEMPLATE(dataset)_TEMPLATE(k):
        init_db: run ... dataset=TEMPLATE(dataset) k=TEMPLATE(k) ...
        apply_schema: run ... dataset=TEMPLATE(dataset) k=TEMPLATE(k) ...

With this set of template parameters, it would be possible for a user to generate 12 distinct scenarios, each with two commands to run an activity, both of which contain the specific combination of the 12 possible from the supplied values for dataset and k. This is just to demonstrate the basic scenario of applying combinations of parameters over a single layer image of rendered templates.

The equivalent long-form rendering would look like this:

TEMPLATE(dataset,[glove25,glove50])
TEMPLATE(k,[100,50,25,10,5,1])

scenarios:
    scenario_glove25_100:
        init_db: run ... dataset=glove25 k=100 ...
        apply_schema: run ... dataset=glove25 k=100 ...
    scenario_glove25_50:
        init_db: run ... dataset=glove25 k=50 ...
        apply_schema: run ... dataset=glove25 k=50 ...
    scenario_glove25_25:
        init_db: run ... dataset=glove25 k=25 ...
        apply_schema: run ... dataset=glove25 k=25 ...
    scenario_glove25_10:
        init_db: run ... dataset=glove25 k=10 ...
        apply_schema: run ... dataset=glove25 k=10...
    scenario_glove25_5:
        init_db: run ... dataset=glove25 k=5 ...
        apply_schema: run ... dataset=glove25 k=5 ...
    scenario_glove25_1:
        init_db: run ... dataset=glove25 k=1...
        apply_schema: run ... dataset=glove25 k=1 ...
    scenario_glove50_100:
        init_db: run ... dataset=glove50 k=100 ...
        apply_schema: run ... dataset=glove50k=100 ...
    scenario_glove50_50:
        init_db: run ... dataset=glove50 k=50 ...
        apply_schema: run ... dataset=glove50 k=50 ...
    scenario_glove50_25:
        init_db: run ... dataset=glove50 k=25 ...
        apply_schema: run ... dataset=glove50 k=25...
    scenario_glove50_10:
        init_db: run ... dataset=glove50 k=10 ...
        apply_schema: run ... dataset=glove50 k=10 ...
    scenario_glove50_5:
        init_db: run ... dataset=glove50 k=5 ...
        apply_schema: run ... dataset=glove50 k=5 ...
    scenario_glove50_1:
        init_db: run ... dataset=glove50k=1 ...
        apply_schema: run ... dataset=glove50 k=1 ...

0 replies

dave2wave · 2023-10-31T19:43:46Z

dave2wave
Oct 31, 2023
Maintainer

I think that it is important to be able to present the proper layers of the tests. The for loop form maintains that, but the template for m shows how to expand the single workload template by implicit looping. It would help to expand what this looks like when running with three levels. In your example would you prefer to run saturation level as an outer loop or an inner loop.

1 reply

jshook Oct 31, 2023
Maintainer Author

If I follow your meaning, it would be "inner loop" because it needs to happen as an integral part of each ensemble of activities around a specific set of parameters.

dave2wave · 2023-11-03T18:14:37Z

dave2wave
Nov 3, 2023
Maintainer

I can see where this can be implemented.
NamedScenarios are retrieved here: io/nosqlbench/engine/api/scenarios/NBCLIScenarioParser.java:117:118
The processing and expansion ought to occur here:
io/nosqlbench/adapters/api/activityconfig/rawyaml/RawYamlLoader.java which is called by io/nosqlbench/adapters/api/activityconfig/yaml/Scenarios.java

0 replies

dave2wave · 2023-11-03T20:41:52Z

dave2wave
Nov 3, 2023
Maintainer

Let's look at the following:

TEMPLATE(datasets,"glove25,glove50")
TEMPLATE(k_array,"100,50,25,10,5,1")

scenarios:
  scenario_TEMPLATE(dataset)_TEMPLATE(k):
    for:
      - dataset: TEMPLATE(datasets)
      - k: TEMPLATE(k_array)
    init_db: run ... dataset=TEMPLATE(dataset) k=TEMPLATE(k) ...
    apply_schema: run ... dataset=TEMPLATE(dataset) k=TEMPLATE(k) ...
    saturation: java optimo testann ...
    response_curve: java stepup testann ...

for: triggers the expansion into many scenarios.
datasets and k_array are split into arrays on commas.
On each iteration TEMPLATE(dataset) and TEMPLATE(k) are defined or replaced?
Note: That order matters and the first item is the outermost loop.

Would it make sense to use ITERATION instead of TEMPLATE?

TEMPLATE(datasets,"glove25,glove50")
TEMPLATE(k_array,"100,50,25,10,5,1")

scenarios:
  scenario_ITERATION(dataset)_ITERATION(k):
    for:
      - dataset: TEMPLATE(datasets)
      - k: TEMPLATE(k_array)
    init_db: run ... dataset=ITERATION(dataset) k=ITERATION(k) ...
    apply_schema: run ... dataset=ITERATION(dataset) k=ITERATION(k) ...
    saturation: java optimo testann ...
    response_curve: java stepup testann ...

9 replies

dave2wave Nov 4, 2023
Maintainer

I think this is the best plan.

TEMPLATE(datasets,"glove25,glove50")
TEMPLATE(k_array,"100,50,25,10,5,1")

scenarios:
  ~macro_0:
     ~foreach:
        - dataset: TEMPLATE(datasets)
        - k: TEMPLATE(k_array)
     scenario_TEMPLATE(dataset)_TEMPLATE(k):
        init_db: run ... dataset=TEMPLATE(dataset) k=TEMPLATE(k) ...
        apply_schema: run ... dataset=TEMPLATE(dataset) k=TEMPLATE(k) ...
        saturation: java optimo testann ...
        response_curve: java stepup testann ...

With this structure we shift the scenarios that we which to repeat under a ~macro key.
Any scenarios in the macro are repeated with the foreach clauses expanded by TEMPLATE as the clauses are parsed.
Current iteration values will be placed into the multimap for template expansions on each iteration and the macro scenario looks like a normal scenario aside from the scenario name.

With this approach we are working within the RawYamlLoader.java class. The major surgery is to invert when string transformations occur. At that level it might make sense to use a comment form like:

TEMPLATE(datasets,"glove25,glove50")
TEMPLATE(k_array,"100,50,25,10,5,1")

scenarios:
# !FOR-EACH
# !dataset: TEMPLATE(datasets)
# !k: TEMPLATE(k_array)
   scenario_TEMPLATE(dataset)_TEMPLATE(k):
      init_db: run ... dataset=TEMPLATE(dataset) k=TEMPLATE(k) ...
      apply_schema: run ... dataset=TEMPLATE(dataset) k=TEMPLATE(k) ...
      saturation: java optimo testann ...
      response_curve: java stepup testann ...
# !END-FOR

By doing this work while loading the yaml from source we are operating at the correct layer. We introduce two new concepts:

Macro expansion based on template data.
Template variable scope

dave2wave Nov 4, 2023
Maintainer

This is an alternative that would remove the need to scope variable substitution:

TEMPLATE(datasets,"glove25,glove50")
TEMPLATE(k_array,"100,50,25,10,5,1")

scenarios:
# !BEGIN-MACRO
# !FOR-EACH dataset TEMPLATE(datasets)
# !FOR-EACH k TEMPLATE(k_array)
   scenario_TEMPLATE(dataset)_TEMPLATE(k):
      init_db: run ... dataset=TEMPLATE(dataset) k=TEMPLATE(k) ...
      apply_schema: run ... dataset=TEMPLATE(dataset) k=TEMPLATE(k) ...
      saturation: java optimo testann ...
      response_curve: java stepup testann ...
# !END-MACRO

If dataset or k exists in the multimap we must UNSET those values with a warning.

We rewrite each TEMPLATE(dataset) and TEMPLATE(k) to use the form with a default value which varies with the iteration.

jshook Nov 4, 2023
Maintainer Author

I'm -1 on overloading comments. This has been done in other systems and invariably leads to surprises, however rare. Often the least likely or expected surprises are the ones that hurt the most.

I think a better trade-off here would be to avoid directly leveraging the template variable system, but allow it to be composed as a distinct and separate layer if/when needed. We can use a standard type of offset symbol for variables just like we do for the other constructs. At least in this way, we can avoid mixing the designs too much and constraining what we do in this layer with previous design decisions. It may eventually wholly replace the template variable mechanism.

An example:

# edit: I removed the extraneous TEMPLATEs that were here
scenarios:
  ~foreach:
      dataset: [ds1,ds2]
      k: 
        - 100
        - TEMPLATE(1,other_k,50)
  s_{~dataset~}_{~k~}:
    init_db: run ... dataset={~dataset~} k={~k~}
    apply_schema: run ... dataset={~dataset~} k={~k~}
    flywheel: start ... dataset={~dataset~} k={~k~} ...
    optimo: optimo activity=flywheel ...
    stepup: stepup: activity=flywheel ...

This is a combined parameter example, overloading the typical foreach syntax to deal with tuples in a parameter space. There is no reason we couldn't provide this out of the box.

We could allow an ordering option which automatically visits the outside edges of the parameter space first, thus providing a strategic ordering which establishes testing bounds first. This is useful for two main reasons. First, you nearly always want to map the boundaries and understand those results first in order to determine whether sub-dividing the parameter space makes sense. A simple way to do this is to apply a global order over all parameter tuples which establishes a precedence between tuples which have an interior parameter value and those which don't. In the example above, interior parameter values are taken simply as those which are not at the beginning or end of a list of a parameter's value, like "other_k".

The TEMPLATE reference here is just showing that this would work if applied as it is now at the top level of YAML loading, before proper parsing occurs, but it is not essential to the ability to specify parameter coordinates.

Some caveats and possibilities to consider:

When used in a map context, a ~foreach or any other directives should be the first. Maybe all meta directives should be required to be first as a safety.
With this form, it would be possible to use conventional YAML structure to control the directives. Standard forms work and it is easy to read and understand. Further, if we decide to allow richer structure, like named key-value tuples as parameter or modifiers, there is no additional syntax support needed beyond what YAML already provides.
We could support syntactical sugar like auto-parsing comma-separated word lists where a list context is expected (as we do for foreach parameters)
The directive(s) should apply to the contained block, meaning every key at the same level which is not a directive. Thus, users wouldn't have to do onerous indentation for what is actually a meta-directive and not a structural element once applied. Once applied, the flattened representation has no residue of having been rendered from a template.
This assumes that there is a clear boundary between what is in scope for a directive and what is not. By default, the simplest assumption is probably that the sibling block is fully included. As in, every template directive is wholly defined for a block level and everything within it, and nothing else. Supporting layered ~foreach is possible, but not strictly necessary for the first implementation.
An intermediate parsed representation could be kept internally if we wanted, such that the fully-rendered flat view is available and usable, but internal logic may not need it. If you need to parameterize a 5-tuple over a space of 5^3, it's arguably more useful to know that you have a basic template which is a paragraph long, and a set of 5 parameters for each invocation of it. Our runtime execution patterns can use this more richly, such as reifying the parameter names in diagnostic readouts, etc.

dave2wave Nov 5, 2023
Maintainer

If we go down the macro template path you don't like then it would make sense to use Jinja2 which would work by converting a workload.yaml.jinja2 into workload.yaml this is a pattern that I know others on the team have used for terraform.

For the other path we will need to fully understand the object model to make sure that processing proceeds correctly. Looking forward to a discussion.

jshook Nov 6, 2023
Maintainer Author

I think jinja2 is worth looking at for examples and ideas. Yet, I want to be very careful about adding "bolt-on" runtimes or external stack requirements into the mix where it isn't essential. There is a Java library for jinja(2?), though. For jinja2-specific needs, users can already do that without us needing to provide it as part of NB.

There is a trade-off between including external commodity languages and having something more direct and built-in. I would generally opt for something which users aren't required to know about unless they needed it, but which could be added at such time. (by adding, I mean the syntax only, nothing else needed) This suggests going from what users would normally be doing directly in the YAML form to adding a few directive lines to virtualize parameters; no changes in indentation needed, etc.

Case in point: We've used the approach of adding an extant [expression] language to NB to support advanced templates. We have Jsonnet support. You can use a workload.jsonnet file already to derive workloads. It's just not as useful as we originally thought, for a few reasons which we understand better in retrospect. This is based on real user experience using this method to drive non-trivial workloads.

Backing out of the simpler yaml view into a Jsonnet view is not trivial for users who haven't already paid the price to write Jsonnet somewhat reliably. It's not impossible, it's just not trivial. Each templating [or expression] based renderer comes with a set of concepts and patterns which you have to get moderately familiar with in order to use it without frustration.
Due to the impedance mismatch between the language levels (Jsonnet->JSON->YAML), the editing experience is fragile and prone to breakage for simple changes. The expression forms, type conversions, syntactical incongruities, they all stack up to make the user's work harder in the end.
The expression-focused form is helpful to functional programmers, but not as much to those who aren't. It's simply a matter of experience and familiarity -- another bridge to cross.
This manifests as high friction (and effort) for doing something which is not conceptually difficult.

Based on this, I want to add some possible qualifications to the requirements:

Users should be able to parameterize their tests simply adding the necessary directives to an existing workload.
The syntax for parameterization should be YAML compatible -- the YAML is valid with or without the directives.
The templating logic must not be required to evaluate the raw contents. It should be sufficient to load YAML as a generic object and then pass it to data-structure level transformer to get the rendered workload. (or parsed form if we choose to provide it as an iterable form over renderable sections)

dave2wave · 2023-11-06T22:38:57Z

dave2wave
Nov 6, 2023
Maintainer

I think we've settled on this format for the yaml:

scenarios:
  ~foreach:
    ~dataset: TEMPLATE(datasets)
    ~k: TEMPLATE(k_array)
    ~tablename: scenario_${dataset}_${k}
    init_db: run ... dataset=${dataset} k=${k} ...
    apply_schema: run ... dataset=${dataset} k=${k} ...
    saturation: java optimo testann ...
    response_curve: java stepup testann ...

scenarios:
  ~foreach:
  ~dataset: TEMPLATE(datasets)
  ~k: TEMPLATE(k_array)
  ~tablename: scenario_${dataset}_${k}
  init_db: run ... dataset=${dataset} k=${k} ...
  apply_schema: run ... dataset=${dataset} k=${k} ...
  saturation: java optimo testann ...
  response_curve: java stepup testann ...

~foreach will be a startsWith match in order to allow multiple sceanrios with foreach logic.
~dataset and ~k are used to create a combination which has each scenario as a flat array.
~tablename is a special for creating the name of each scenario context.
NBReconfiguration will be expanded and used to help with the Scenario logic.

0 replies

dave2wave · 2023-11-07T20:28:40Z

dave2wave
Nov 7, 2023
Maintainer

Working on implementation in #1657

0 replies

jshook · 2023-11-08T18:26:17Z

jshook
Nov 8, 2023
Maintainer Author

How about this, which reads similarly to the original form?

scenarios:
 ~scenario_${k}_${dataset}:
  ~k: TEMPLATE(k,"100,50")
  ~dataset: TEMPLATE(dataset,"a,b,c")
  step1: ...
  step2: ...
 ~scenario_${k}:
 ~k: TEMPLATE(k,"33,66")
  step1: ...

~/bin/nb5 myworkload.yaml scenario_100_a

8 replies

dave2wave Nov 9, 2023
Maintainer

I'm rethinking the need to have a ~ in front of the scenario name. Instead we can detect foreach from steps that have either ~k or (k).

dave2wave Nov 9, 2023
Maintainer

If we have a scenario defined like:

scenarios:
  scenario_(k)_(dataset):
    (k): TEMPLATE(k,"100,50")
    (dataset): TEMPLATE(dataset,"a,b,c")
    step1: ...
    step2: ...
  scenario_(k)":
    (k): TEMPLATE(k,"33,66")
    step1: ...

The (k): and (dataset) terms are redundant because we already have the key=values provided on the command line.
If the user has not provided k= and dataset= then we can reject the scenario run with a BasicError.

jshook Nov 9, 2023
Maintainer Author

On the original issue:

The reason your example won't work is that scenario name on the command line must match the scenario name in the yaml.

I presumed we would be rendering the logical result into the yaml which we use then to match the use-specified scenario against. I wouldn't expec them, for example, to nb5 workload ~scenario....

This does raise the question of specifying which scenarios they would want to run out of the full set. They would need to be able to specify specific scenarios, potentially in two ways: 1) by specifying the fully rendered scenario name which would be present after expansion, or 2) by referencing the base in some way and providing appropriate parameters.

We also need to ensure that the scenario name is unique, so having it be templated over unique parameter tuples is something we probably can't avoid.

It does seem, however, that we've missed one usage pattern which I mentioned above. Suppose a user does want to specify one of the scenarios in terms of the template and a set of parameters. That means we'd need a nominal form for the template itself so it can be invoked programattically.

In other words, as a user I want to invoke "scenario_(k)_(dataset)" with k=50 and dataset=foo

jshook Nov 9, 2023
Maintainer Author

We also need to support defaults for the parameters in the yaml or scenario definition, not requiring the user to provide them for the basic case. The easiest way to do this is probably to provide them if possible in the template or structure somewhere and let other values replace them if provided by the user.

dave2wave Nov 9, 2023
Maintainer

The changes to use post processed scenario names from the workload.yaml turns this whole approach on its head.

In other words, as a user I want to invoke "scenario_(k)_(dataset)" with k=50 and dataset=foo

Then

nb5 scenario_(k)_(dataset) k=50 dataset=foo

with workload.xml containing

scenarios:
  scenario_(k)_(dataset):
    (k): TEMPLATE(k,"100,50")
    (dataset): TEMPLATE(dataset,"a,b,c")
    step1: ...
    step2: ...
  scenario_(k)":
    (k): TEMPLATE(k,"33,66")
    step1: ...

This will create a single scenario with the name: "scenario_50_foo".
All of the steps in that scenario will use "k=50 dataset=foo" in their steps even if they have a different value. These overrides work just like overrides from the user on the command line.

dave2wave · 2023-11-09T19:37:55Z

dave2wave
Nov 9, 2023
Maintainer

Each scenario step does the following in allowing key=values set on the step to supercede those set from the command line.

                    // consume each of the parameters from the steps to produce a composited command
                    // order is primarily based on the step template, then on user-provided parameters
                    for (CmdArg cmdarg : parsedStep.values()) {

                        // allow user provided parameter values to override those in the template,
                        // if the assignment operator used in the template allows for it
                        if (usersCopy.containsKey(cmdarg.getName())) {
                            cmdarg = cmdarg.override(usersCopy.remove(cmdarg.getName()));
                        }

                        buildingCmd.put(cmdarg.getName(), cmdarg.toString());
                    }

Is this the proper order of precedence?

1 reply

jshook Nov 13, 2023
Maintainer Author

I'm not sure I see the issue. We're walking the args for the parsed step from the named scenario.
For each position where the user provided a same-named parameter, we're consuming it from the user-provided parameters and replacing the final version of the argument with the user-supplied one, but IF and only IF the override is allowed. It would not be allowed in the case that the arg was assigned with == (silent locking) or === verbose locking. Afterwards, we are taking the arg, replaced or not, and putting into the final command template which we are keeping and using.

In other words, the precedence looks to be what the user provides on the command line first, unless it is not allowed by the value locking syntax.

dave2wave · 2023-11-15T20:28:36Z

dave2wave
Nov 15, 2023
Maintainer

In looking at how we are actually using dataset in existing workloads it turns out that the dataset choice should automatically set other key value pairs. For example:

  textembedding-gecko:
    dimensions: 768
    trainsize: 100000
    testsize: 10000
    sfunction: cosine
  glove-25-angular:
    dimensions: 25
    trainsize: 1183514
    testsize: 10000
    indexname: glove25
    sfunction: cosine

I would like to extend NB5 as follows:

Take one or more yaml files that define key-value pairs that are processed during ScenarioParsing as opposed to how TEMPLATE works during Workload loading and structural parsing.
During the scenario parsing we trigger the parsing and then update the keys in an automatic way.

Take a look at pinecone_live.yaml and let's discuss how it could change to enable a "foreach" use:

min_version: "5.17.3"

description: |
  A workload which reads ann-benchmarks vector data from HDF5 file format.

scenarios:
  pinecone_vectors:
    rampup: run tags==block:vector_writes cycles=TEMPLATE(trainsize) threads=TEMPLATE(rampup_threads,100) driver=pinecone apiKey=TEMPLATE(apikey) projectName=f88a480 environment=eu-west4-gcp
    search_and_index: run tags='block:vector_reads' threads=TEMPLATE(search_threads,100) driver=pinecone apiKey=TEMPLATE(apikey) projectName=f88a480 environment=eu-west4-gcp stride=100 striderate=13.5

bindings:
  rw_key: ToString()
  train_vector: HdfFileToFloatList("testdata/TEMPLATE(dataset).hdf5", "/train");
  test_vector: HdfFileToFloatList("testdata/TEMPLATE(dataset).hdf5", "/test");
  validation_set: HdfFileToIntArray("testdata/TEMPLATE(dataset).hdf5", "/neighbors");
  random_vector: HashedFloatVectors(TEMPLATE(dimensions));

blocks:
  vector_writes:
    ops:
      rampup-op:
        upsert: TEMPLATE(index_name)
        namespace: TEMPLATE(namespace)
        upsert_vectors:
          - id: "{rw_key}"
            values: "{train_vector}"

  vector_invalidate_writes:
    ops:
      rampup-op:
        upsert: TEMPLATE(index_name)
        namespace: TEMPLATE(namespace)
        upsert_vectors:
          - id: "{rw_key}"
            values: "{random_vector}"

  vector_reads:
    ops:
      read-op:
          query: TEMPLATE(index_name)
          namespace: TEMPLATE(namespace)
          vector: "{test_vector}"
          top_k: TEMPLATE(select_limit,100)
          include_values: true
          include_metadata: true
          verifier-init: |
            relevancy=scriptingmetrics.newRelevancyMeasures(_parsed_op,"group","relevancy");
            for (int k in List.of(100)) {
              relevancy.addFunction(io.nosqlbench.engine.extensions.computefunctions.RelevancyFunctions.recall("recall",k));
              relevancy.addFunction(io.nosqlbench.engine.extensions.computefunctions.RelevancyFunctions.precision("precision",k));
              relevancy.addFunction(io.nosqlbench.engine.extensions.computefunctions.RelevancyFunctions.F1("F1",k));
              relevancy.addFunction(io.nosqlbench.engine.extensions.computefunctions.RelevancyFunctions.reciprocal_rank("RR",k));
              relevancy.addFunction(io.nosqlbench.engine.extensions.computefunctions.RelevancyFunctions.average_precision("AP",k));
            }
          verifier: |
            // driver-specific function
            actual_indices=pinecone_utils.responseIdsToIntArray(result)
            // driver-agnostic function
            relevancy.accept({validation_set},actual_indices);
            return true;

The scenario block could change like this: (I'm skipping the foreach entries)

scenarios:
  pinecone_vectors:
    rampup: run tags==block:vector_writes cycles={trainsize} threads=TEMPLATE(rampup_threads,100) driver=pinecone apiKey=TEMPLATE(apikey) projectName=f88a480 environment=eu-west4-gcp
    search_and_index: run tags='block:vector_reads' threads=TEMPLATE(search_threads,100) driver=pinecone apiKey=TEMPLATE(apikey) projectName=f88a480 environment=eu-west4-gcp stride=100 striderate=13.5

Cycles (trainsize) is a function of the chosen dataset.

For bindings it's even more live data:

bindings:
  rw_key: ToString()
  train_vector: HdfFileToFloatList("testdata/{dataset}.hdf5", "/train");
  test_vector: HdfFileToFloatList("testdata/{dataset}.hdf5", "/test");
  validation_set: HdfFileToIntArray("testdata/{dataset}.hdf5", "/neighbors");
  random_vector: HashedFloatVectors({dimensions});

Here's a block ops with changes:

  vector_writes:
    ops:
      rampup-op:
        upsert: {index_name}
        namespace: TEMPLATE(namespace)
        upsert_vectors:
          - id: "{rw_key}"
            values: "{train_vector}"

Should we somehow make these into bindings for their use in ops?

0 replies

Parameter Space Traversal #1647

jshook Oct 25, 2023 Maintainer

Goals

Motivating Example

Possible Approaches

template variable enhancements

structural templates

jsonnet

Replies: 10 comments · 19 replies

jshook Oct 30, 2023 Maintainer Author

jshook Oct 30, 2023 Maintainer Author

dave2wave Oct 31, 2023 Maintainer

jshook Oct 31, 2023 Maintainer Author

dave2wave Nov 3, 2023 Maintainer

dave2wave Nov 3, 2023 Maintainer

dave2wave Nov 4, 2023 Maintainer

dave2wave Nov 4, 2023 Maintainer

jshook Nov 4, 2023 Maintainer Author

dave2wave Nov 5, 2023 Maintainer

jshook Nov 6, 2023 Maintainer Author

dave2wave Nov 6, 2023 Maintainer

dave2wave Nov 7, 2023 Maintainer

jshook Nov 8, 2023 Maintainer Author

dave2wave Nov 9, 2023 Maintainer

dave2wave Nov 9, 2023 Maintainer

jshook Nov 9, 2023 Maintainer Author

jshook Nov 9, 2023 Maintainer Author

dave2wave Nov 9, 2023 Maintainer

dave2wave Nov 9, 2023 Maintainer

jshook Nov 13, 2023 Maintainer Author

dave2wave Nov 15, 2023 Maintainer

jshook
Oct 25, 2023
Maintainer

Replies: 10 comments 19 replies

jshook
Oct 30, 2023
Maintainer Author

jshook
Oct 30, 2023
Maintainer Author

dave2wave
Oct 31, 2023
Maintainer

jshook Oct 31, 2023
Maintainer Author

dave2wave
Nov 3, 2023
Maintainer

dave2wave
Nov 3, 2023
Maintainer

dave2wave Nov 4, 2023
Maintainer

dave2wave Nov 4, 2023
Maintainer

jshook Nov 4, 2023
Maintainer Author

dave2wave Nov 5, 2023
Maintainer

jshook Nov 6, 2023
Maintainer Author

dave2wave
Nov 6, 2023
Maintainer

dave2wave
Nov 7, 2023
Maintainer

jshook
Nov 8, 2023
Maintainer Author

dave2wave Nov 9, 2023
Maintainer

dave2wave Nov 9, 2023
Maintainer

jshook Nov 9, 2023
Maintainer Author

jshook Nov 9, 2023
Maintainer Author

dave2wave Nov 9, 2023
Maintainer

dave2wave
Nov 9, 2023
Maintainer

jshook Nov 13, 2023
Maintainer Author

dave2wave
Nov 15, 2023
Maintainer