Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE][Unity][Merge] Test merge main to unity #14416

Closed
wants to merge 29 commits into from
Closed

Conversation

csullivan
Copy link
Contributor

No description provided.

cblmemo and others added 28 commits March 20, 2023 18:02
Introducing MemHammer

Co-authored-by: Wuwei Lin <wuwei@apache.org>
Co-authored-by: Junru Shao <junrushao1994@gmail.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com>
Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>
Co-authored-by: Hongyi Jin <3231950289@qq.com>
Fix data type and add minimal reproducible test.

Co-authored-by: Sunghyun Park <spark@octoml.ai>
Goes to the latest revision in the same major version (22.12.0). 

23.1.0 is released but it involves some style changes so we would need to reformat the entire codebase. 

I need 22.12.0 to properly deal with processing some files found in this PR: 

#14167

Where black cannot parse the file in the current version, but can in the updated version.
This patch just exposes an existing analysis API IsOutputBlock to
python. Since many schedule primitives have conditions on output blocks,
this API would be really useful while scheduling
* [BugFix] Support rewrite_once when the number of callbacks > 1

* callbacks_map -> done, swapping false and true

---------

Co-authored-by: Bin Li <binli1@amd.com>
Short desc
This changes allow my to compile and tune models for hexagon directly from my macOS laptop without full switching to linux environment.

List of changes

Replace local linker call with call from docker container with Hexagon SDK. Yes, that is the only SDK tool used by TVM during compilation.
Enhanced search of ADB. Not only in PATH, but also in ANDROID_HOME, ANDROID_SDK_ROOT and default sdk installation directory. Mac OS doesn't allow to easily change default PATH env var for UI application launched from dock bar. So adb is not available for IDE by default.
Motivation
Some engineers would like to continue work with comfortable macOS environment even if they have to play with hexagon devices. At this moment there is no official Hexagon SDK for macOS system. Alternatives are next: fully switch to remote linux, use local linux virtual machine or try to introduce required hexagon SDK functionality for macOS. The last option is more preferable to me.

Signed-off-by: Alexander Peskov <peskovnn@gmail.com>
# Motivation
Currently, we miss a schedule primitive to change the data type of allocated buffer (e.g. via `cache_read`/`cache_write`), and thus we cannot perform type conversion while loading data from global to shared memory.

This PR adds a new schedule primitive `set_dtype` that follows the interface of `set_scope` and allows users to customize the allocated buffers' data type.

# Example
Before running `set_dtype`:
```python
@T.prim_func
def before_set_dtype(
    A: T.Buffer((128, 128), "float32"), C: T.Buffer((128, 128), "float32")
) -> None:
    B = T.alloc_buffer((128, 128), dtype="float32")

    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j]
            C[vi, vj] = B[vi, vj] + 1.0
```
then we perform the `set_dtype` schedule:
```python
sch = tir.Schedule(before_set_dtype)
sch.set_dtype("B", buffer_index=0, dtype="float16")
print(sch.mod["main"].script())
```
we get transformed code:
```python
@T.prim_func
def after_set_dtype(
    A: T.Buffer((128, 128), "float32"), C: T.Buffer((128, 128), "float32")
) -> None:
    B = T.alloc_buffer((128, 128), dtype="float16")

    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = T.cast(A[vi, vj] * 2.0, "float16")
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j]
            C[vi, vj] = T.cast(B[vi, vj], "float32") + 1.0
```
where data type conversions are inserted automatically.

# Other Usage
Using the combination of `cache_read` + `set_dtype` can help us load data from the memory hierarchy while converting data to the desired type.
This PR adds test harness for mlperftiny submission using microTVM.
* [TIR][Utility] More flexible tir::Substitute arguments

Previously, the `tir::Substitute` method had overloads that supported
a few ways of providing the variable map (e.g. `const
Map<Var,PrimExpr>&`, `std::unordered_map<const VarNode*, PrimExpr>&`,
etc.), delegating out to the overload that uses
`std::function<Optional<PrimExpr>(const Var&)>`.  However, the types
supported for the variable map depended on the type being
substituted (e.g. only supporting `const Map<Var,PrimExpr>&` with
substituting into a `Array<Range>`), which would be unexpected to new
developers.

This PR makes the `tir::Substitute` utility more uniform in the
arguments that it accepts.

* For any type that is supported by `tir::Substitute`, `Array<T>` is
  also supported.

* Any variable mapping type can be used with any substitution type.
  All variable mapping types are normalized to
  `std::function<Optional<PrimExpr>(const Var&)>`.

* For `Map` and `std::unordered_map` arguments, the value type may be
  any subclass of `PrimExpr` (e.g. `Map<Var, Var>` instead of
  `Map<Var, PrimExpr>`).  Previously, the calling scope needed to
  either construct a temporary map that returned `PrimExpr`, or to use
  a broader value type in the map than otherwise required.

The initial and primary goal was to allow a `Map<Var, Var>` to be used
as an argument to `tir::Substitute`, rather than a `Map<Var,
PrimExpr>`, and making the utility more general was more
straightforward than adding multiple overloads specificall for
`Map<Var, Var>`.

* fix linting errors

* Fix copy/paste error in docstring
…14376)

* [Hexagon] Allow scalar tensors to have null shape during allocation.

* Add unit test for new case (scalar and null shape)
Previously, while TVMScript introduces a new scope for other
contexts (e.g. `for`, `while`, `with`, etc), the `if` and `else`
blocks did not introduce a new scope.  This caused erroneous parsing
errors if the `if` and `else` blocks each contained a variable with
the same name.  Added a `self.var_table.with_frame()` context resolves
this issue.
* initial basis

* Generated all the tile sizes

* is this all you need?

* linting

lint

move schedule rule to own file

lint p2

layout transform fixings

* forgot to forward arg

* fix tests

* reduce search space

* lint

* schedule rule documentation

* add a note

* fix wording

* handle implicit reshape case v1

* clean up comments

* address comments

* testing harness

* more progress on testing harness

* fix case where shape changes in mod

* inline after schedule genreation to help analysis

* proper autoinlining INTO layout transform block to maintain extants

* clean up

* reindex for introducing cache block

* reorganize testing

* more cleanup

* remove forced false

* use the proper dispatcher

* update test, make default schedule rule None

* linting

* fix mypy errors

* clean up

* manual test cases

* manual tests

* add comment, fix improper implicit reshape handling

* fix

* remove extra comments

* more lints

* refactor

* remove extraneous check

* lint again :/

* remove uneeded newline

* remove leading spaces
`Load` and `Store` were deprecated in [RFC0039](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0039-buffer-physical-layout.md). It has been several releases since the deprecation and it's no longer used in the codebase, now it is time to remove them.
This PR fixes the TVMScript printer to print `T.bool(True)` or `T.bool(False)` for `IntImm(dtype=bool)`. So the parser is able to parse it back.
Updated the error message to state which PrimFunc has a malformed
pipeline annotation, the blocks found in that primfunc, and the
pipeline annotation found.
Previously, the legalization was only handled by propagating the dtype
of the indices to the transformed indices.  As a result, output
indices whose value did not depend on the input index would be left
with the incorrect dtype.
If a variable is undefined within a PrimFunc, included the name of the
PrimFunc in the error message.  Otherwise, in a IRModule with multiple
functions, it may be unclear which PrimFunc is malformed.
…s as attr (#14379)

* not estimating the flops when there is a default estimated flops as attr

* add unittests

* lint fix

* make unittest simpler
Enable USE_MICRO for mac and windows CI builds to ensure code building doesn't break for those platforms.
* [HEX] Enhanced vector lanes for some intrinsics

* fix pylint

Signed-off-by: Alexander Peskov <peskovnn@gmail.com>

* fix lint 2

Signed-off-by: Alexander Peskov <peskovnn@gmail.com>

* Fix typo

Signed-off-by: Alexander Peskov <peskovnn@gmail.com>

---------

Signed-off-by: Alexander Peskov <peskovnn@gmail.com>
Upgrading ethos-u-vela pip version to 3.7.0.
This PR adds unittest for schedule primitive read_at and write_at.

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
…ed to Ethos-U (#13212)

Added an option to tvmc and Ethos-U for printing to console or to the file which operators from the initial graph are offloaded to Ethos-U and which aren't. It forms line-by-line output of initial model IR, indicating which operations ported to Ethos-U.

Compiler option "--target-ethos-u-dump_npu_functions_coverage" has been replaced by more generic "--dump-offloads" with the same meaning.


## Usage
```
# output to console:
tvmc compile --target=ethos-u,cmsis-nn,c \
    --dump-offloads=- \
    ........

# output to file:
tvmc compile --target=ethos-u,cmsis-nn,c \
    --dump-offloads=<file path> \
    ........
```

## Example output:


...
Total number of operators and distribution by targets
Total: 211
target1: 198
target2: 10
generic: 3

'target1        <-     target2.qnn_conv2d'
'target1        <-          %0 = qnn.conv2d(%tfl.quantize, %v_param_1, ...'
'target1        <-          %1 = nn.bias_add(%0, %v_param_2, axis=3);'
'target1        <-          %2 = qnn.requantize(%1, meta[relay.Constant]...'
'target2        <-     target2.reshape'
'target2        <-          %3 = reshape(%2, newshape=[1, 1001]);'
'generic        <-     %4 = nn.pad(%3, -128f, pad_width=[[0, 0], [1, 1]...'
...
This PR refactors BF16Legalize to enable more f32 computations.
We also split the BF16Legalize into two steps.

- BF16ComputeLegalize changes all computation to f32 while keeping
  the external BF16 storages.
- BF16StorageLegalize changes all storage to u16.

Now BF16 kernels accept tvm.nd.array that are created as bfloat16 type.
…ise op (#14408)

[MetaSchedule][Hexagon] Improve vectorization for standalone elementwise ops

Motivation:
It was found that for standalone elementwise operations (add, sub, etc.)
MetaScheduler generates code with poor performance due to lack of vector
code on some input tensor shapes. Current implementation is not able to
vectorize if innermost loops extent is not multiple of the vector
length.

What was done:
Core changes: it checks current loops nest, if all loops are "simple",
i.e. loops without annotations, bindings, reduce axis, then it does the
following:
 1) Fuse all loops into single one.
 2) Split this new loop into 2 parts: inner and outer. Herewith split
    factor for the inner loop is equal to 'max_vectorize_extent'
    MetaScheduler parameter.
 3) Parallelize outer loop and vectorize inner loop.

Performance measurement:
Measurement was done on Qualcomm Snapdragon 888. As it was expected, 1
and 2 got significant performance boost, 3 and 4 - without changes.

N |    op   | Dtype |      Shape       | Before fix, ms | After fix, ms | speedup |
--|---------|-------|------------------|----------------|---------------|---------|
1 | add     | uint8 | 1, 8, 56, 56, 32 |      1.264     |     0.167     |  7.5x   |
2 | qnn.add | uint8 | 1, 8, 56, 56, 32 |      2.213     |     0.336     |  6.6x   |
3 | add     | int32 | 1, 8, 56, 56, 32 |      0.161     |     0.150     |  1.07x  |
4 | seq*    | uint8 | 1, 64, 56, 56    |      2.634     |     2.679     |  0.98x  |
----------------------------------------------------------------------------------|

seq* - test of the ops sequence: qnn.conv2d + bias_add + qnn.requantize,
       weights shape = [256, 64, 1, 1]
@tvm-bot
Copy link
Collaborator

tvm-bot commented Mar 28, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

Previously, the error messasge from `FindPrimFuncFrame`,
`FindBlockFrame`, and `FindIfFrame` stated that they could not find
the requested frame when the top-most frame did not match the
requested type.  This error message could be misinterpreted by a user
as stating that the frame didn't exist at all.

This commit updates the error message to distinguish between the case
of a missing frame (e.g. `T.reads()` occurring outside of any
`T.block()` frame) and a frame not being top-most (e.g. `T.reads()`
occurring inside a `T.block()`, but inside an `if` conditional instead
of the top of the block).
@csullivan csullivan closed this Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet