MPS cherry picks for 1.12.1 (#81976)

* MPS: Fixes (#78930) Cast integer to float in UnaryOps Add tensor dtype in key generation Enable FP16 scalars and use placeholder for alpha tensor in add/sum ops Fixes #ISSUE_NUMBER Pull Request resolved: #78930 Approved by: https://github.com/albanD * MPS: Binary cast fix by proper type promotion and remove spurious copy warning (#79185) Fixes #78019, #78020 Fixes #79185 Pull Request resolved: #79185 Approved by: https://github.com/albanD, https://github.com/razarmehr * MPS: add exponential op (#79188) Add exponential distribution Fixes #ISSUE_NUMBER Pull Request resolved: #79188 Approved by: https://github.com/razarmehr, https://github.com/albanD * [MPS] Delete unused vars from OperationUtils.mm Pull Request resolved: #79514 Approved by: https://github.com/kulinseth, https://github.com/albanD * [MPS] Fix getDefaultGenerator and copy_kernel_mps Returning reference to stack memory is really bad Pull Request resolved: #79515 Approved by: https://github.com/albanD * [MPS][BE]Do not use `new/delete[]` in `chainViewOperation` `std::array` will do just fine Pull Request resolved: #79516 Approved by: https://github.com/albanD * [MPS] Support stride of stride Fixes #79181 Pull Request resolved: #79521 Approved by: https://github.com/kulinseth * MPS: TopK raise an error if K>16 (#79677) * Error out in TopK when k>16. * Add a test case too. Fixes #78915 Pull Request resolved: #79677 Approved by: https://github.com/albanD * [MPS]: Add fix for squeezed input axes handling in BCE loss (#79676) Fixes #79527 Pull Request resolved: #79676 Approved by: https://github.com/razarmehr, https://github.com/albanD * MPS: Add amax and amin Ops with tests (#79682) * Add amax and amin with tests Fixes #ISSUE_NUMBER Pull Request resolved: #79682 Approved by: https://github.com/albanD * [MPS] Fix torch.uint8 support (#80049) `ScalarType.Byte` should be cast to `MPSDataTypeUInt8` And support for `torch.int8` as well as test those conversions in `TestMPS.test_to` Fixes #80006 Pull Request resolved: #80049 Approved by: https://github.com/albanD * [MPS] Fix binary ops between int32 tensor with int64 scalar (#80220) For some reason, tensor *op* scalar does not follow the normal binary promotion rules So cast output tensor to expected type if needed It seems that one should have casted input tensors to expected output tensor type, but it does not really work for boolean binary ops, so... Add output tensor type/shape to cached graph key Extend `TestMPS. test_add_scalars` to test for this regression Fixes #79835 Pull Request resolved: #80220 Approved by: https://github.com/albanD * [MPS] Add equal operator (#80195) Which is, in essence is composite of `eq`->`all`->`item` `native/mps/operators/Equal.cpp` is an almost verbatim copy of `native/cuda/Equal.cpp` Fix codegen by generating MPSFunctions headers Pull Request resolved: #80195 Approved by: https://github.com/albanD * [MPS] add `aten::normal.Tensor_float` `aten::normal.float_Tensor` `aten::normal.Tensor_Tensor` (#80297) Fixes #ISSUE_NUMBER Pull Request resolved: #80297 Approved by: https://github.com/albanD, https://github.com/kulinseth * [MPS] Add flip (#80214) Fixes #ISSUE_NUMBER Pull Request resolved: #80214 Approved by: https://github.com/DenisVieriu97, https://github.com/albanD * [MPS] Add logical ops (#80216) This PR adds `logical_not`, `logical_and`, `logical_or`, `logical_xor`. Pull Request resolved: #80216 Approved by: https://github.com/albanD, https://github.com/kulinseth * [MPS] Add glu (#79866) Adds mps op for `aten::glu.out`. Pull Request resolved: #79866 Approved by: https://github.com/kulinseth, https://github.com/albanD * [MPS] Fix std/var cache issue (#80502) Use `getTensorsStringKey` which has tensor shape info added as part of the key to prevent cache lookup issue when the shape of input tensor is changed. Fixes #80499 Pull Request resolved: #80502 Approved by: https://github.com/malfet, https://github.com/kulinseth * Add scatter support for view operations (#79939) * Add scatter support for view operations; #78074, #78886, #79672 * Update test_slicing_replace_column to properly test different sizes * Handle in-place changes for binary ops; add new testcase * Add new view ops testing scatter; add MPSDebugConfig.h config file for debugging purposes * Merge gatherViewTensor and scatterViewTensor into a generic function * Add scatter on demand in scatterViewOperation instead of caching it into a generic graph * Create separate graphs for scatter and gather; * Create scatter graph at scatter time Fixes #ISSUE_NUMBER Pull Request resolved: #79939 Approved by: https://github.com/razarmehr * MPS: Fix handling of 1D tensors in linear backward (#80759) Fixes ##79784 Pull Request resolved: #80759 Approved by: https://github.com/ezyang * [MPS] Move the View ops to a separate file and reduce the number of graphs created (#80491) This is dependent on the PR to go in first: #79939 Remove the data_ptr from the View Graph key which reduces the number of graphs created significantly. Don't wait when copying from MPS to MPS tensors Pull Request resolved: #80491 Approved by: https://github.com/malfet * [MPS] Add softplus backward (#79873) Fixes #ISSUE_NUMBER Pull Request resolved: #79873 Approved by: https://github.com/malfet * [MPS] Add argmin (#80828) This PR 1. adds argmin 2. refactors `reduction_type` in `ReduceOps.mm` with enum. Co-authored by Kulin Seth <kulinseth@gmail.com> Pull Request resolved: #80828 Approved by: https://github.com/malfet * [MPS] Fix LSTM batch_first output transposed (#80597) The output of LSTM with `batch_first` should be transposed back to batch first format. Fixes #80306 Pull Request resolved: #80597 Approved by: https://github.com/kulinseth * [MPS][BE] Introduce MPSUnaryCachedGraph (#81033) I.e. CachedGraph that has input and output tensors Also, add `MPSGraphCache::LookUpAs` template, which combines LookUp with static_cast to target type Pull Request resolved: #81033 Approved by: https://github.com/kulinseth * [MPS] Add test consistency from OpInfo based tests from PR 78504 (#79532) Pull Request resolved: #79532 Approved by: https://github.com/albanD, https://github.com/malfet * [MPS] Add huber loss (#80163) Fixes #ISSUE_NUMBER Pull Request resolved: #80163 Approved by: https://github.com/kulinseth, https://github.com/malfet * Remove two tests dependent on the MPS serialization checkin. * Fix lint error (FLAKE8) F401 * Remove the serialization test from test_mps as its support is not there in 1.12.1. Co-authored-by: Kulin Seth <kulinseth@gmail.com> Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com> Co-authored-by: Kulin Seth <kulin_seth@apple.com> Co-authored-by: Abhishek Pathak <abhipathak97@gmail.com> Co-authored-by: Nikita Shulga <nshulga@fb.com> Co-authored-by: qqaatw <qqaatw@gmail.com> Co-authored-by: Ramin Azarmehr <razarmehr@apple.com>
pytorch · Jul 25, 2022 · e8534b9 · e8534b9
1 parent 03b82bd
commit e8534b9
Show file tree

Hide file tree

Showing 22 changed files with 3,895 additions and 848 deletions.
diff --git a/.github/workflows/_mac-test-arm64.yml b/.github/workflows/_mac-test-arm64.yml
@@ -40,7 +40,7 @@ jobs:
           # shellcheck disable=SC1090
           . ~/miniconda3/etc/profile.d/conda.sh
           set -ex
-          conda create -yp "${ENV_NAME}" "python=${PY_VERS}" numpy expecttest
+          conda create -yp "${ENV_NAME}" "python=${PY_VERS}" numpy expecttest pyyaml
           # As wheels are cross-compiled they are reported as x86_64 ones
           ORIG_WHLNAME=$(ls -1 dist/*.whl); ARM_WHLNAME=${ORIG_WHLNAME/x86_64/arm64}; mv ${ORIG_WHLNAME} ${ARM_WHLNAME}
           conda run -p "${ENV_NAME}" python3 -mpip install dist/*.whl

diff --git a/aten/src/ATen/mps/MPSAllocator.h b/aten/src/ATen/mps/MPSAllocator.h
@@ -26,6 +26,23 @@
 namespace at {
 namespace mps {
 
+class IMpsAllocatorCallback {
+ public:
+  enum class EventType {
+    ALLOCATED, // buffer got allocated to be used immediately
+    RECYCLED,  // buffer pulled from free list to be reused
+    FREED,     // buffer put to free list for future recycling
+    RELEASED,  // buffer memory released
+  };
+  virtual ~IMpsAllocatorCallback() = default;
+  virtual void executeMPSAllocatorCallback(void* ptr, EventType event) = 0;
+};
+
+// MPS allocator will execute every registered callback when a block of memory is freed.
+C10_DECLARE_REGISTRY(MPSAllocatorCallbacksRegistry, IMpsAllocatorCallback);
+#define REGISTER_MPS_ALLOCATOR_CALLBACK(name, ...) \
+  C10_REGISTER_CLASS(MPSAllocatorCallbacksRegistry, name, __VA_ARGS__);
+
 namespace HeapAllocator {
 
 #define MB(x) round_page(x * 1048576UL)
@@ -46,13 +63,18 @@ struct HeapBlock;
 struct BufferBlock
 {
   id<MTLBuffer> buffer;
-  size_t size;
+  size_t size; // size after alignment
+  size_t requested_size; // requested size (before alignment)
+  // buffer shape is used for retrieving base of views in cached graphs
+  std::vector<int64_t> shape;
   bool in_use;
   HeapBlock* heap;
   id_t buf_id;
 
-  BufferBlock(size_t Size, const id<MTLBuffer> Buffer = nullptr, HeapBlock* Heap = nullptr, id_t BufID = 0) :
-            buffer(Buffer), size(Size), in_use(false), heap(Heap), buf_id(BufID) { }
+  BufferBlock(size_t Size, size_t RequestedSize = 0, const id<MTLBuffer> Buffer = nullptr,
+              HeapBlock* Heap = nullptr, id_t BufID = 0) :
+              buffer(Buffer), size(Size), requested_size(RequestedSize),
+              in_use(false), heap(Heap), buf_id(BufID) { }
 
   static bool Comparator(const BufferBlock* a, const BufferBlock* b) {
     return (a->size != b->size) ? a->size < b->size : (uintptr_t)a->buffer < (uintptr_t)b->buffer;
@@ -176,6 +198,9 @@ class MPSHeapAllocatorImpl
   void Free(void* ptr);
   void EmptyCache();
   bool isSharedBuffer(void* ptr);
+  ssize_t getRequestedBufferSize(void* ptr);
+  void setBufferShape(void* ptr, const IntArrayRef& shape);
+  IntArrayRef getBufferShape(void* ptr);
 
   inline id<MTLDevice> Device() const { return m_device; }
   void enable_debug_info() { m_enable_debug_info = true; }
@@ -209,6 +234,7 @@ class MPSHeapAllocatorImpl
   void release_buffers(BufferPool& pool);
   bool release_available_cached_buffers(const AllocParams& p);
   bool release_cached_buffers();
+  void trigger_memory_callbacks(BufferBlock* buffer_block, IMpsAllocatorCallback::EventType event);
 
   BufferPool& get_pool(size_t Size, bool useShared) {
       return Size <= kMaxSmallAlloc ? (useShared ? m_small_pool_shared : m_small_pool_private) :

diff --git a/aten/src/ATen/mps/MPSAllocator.mm b/aten/src/ATen/mps/MPSAllocator.mm
@@ -8,6 +8,8 @@
 namespace at {
 namespace mps {
 
+C10_DEFINE_REGISTRY(MPSAllocatorCallbacksRegistry, IMpsAllocatorCallback);
+
 namespace HeapAllocator {
 
 HeapBlock* MPSHeapAllocatorImpl::get_free_heap(AllocParams& p)
@@ -56,15 +58,16 @@
   TORCH_INTERNAL_ASSERT(buffer);
   // insert heap after a buffer was created on it to update the order of heap's set
   p.pool->heaps.insert(heap);
-  p.buffer_block = new BufferBlock(p.size(), buffer, heap, m_allocated_buffers.size() + 1);
+  p.buffer_block = new BufferBlock(p.size(), p.requested_size, buffer, heap, m_allocated_buffers.size() + 1);
   m_allocated_buffers[p.buffer_block->buffer] = p.buffer_block;
   m_total_allocated_memory += p.size();
 
   if (debug_info_enabled()) {
     std::cerr << "Allocated "
               << (p.pool->is_shared ? "shared" : "private")
               << " buffer #" << p.buffer_block->buf_id
-              << " with aligned size " << format_size(p.size())
+              << " of size " << format_size(p.size())
+              << " at " << p.buffer_block->buffer
               << " (requested size: " << format_size(p.requested_size)
               << ", heap size: " << format_size(heap->size.available)
               << ", total allocated: " << format_size(m_total_allocated_memory) << ")\n";
@@ -90,7 +93,8 @@
     std::cerr << "Reusing "
               << (p.pool->is_shared ? "shared" : "private")
               << " buffer #" << p.buffer_block->buf_id
-              << " with aligned size " << format_size(p.buffer_block->size)
+              << " of size " << format_size(p.buffer_block->size)
+              << " at " << p.buffer_block->buffer
               << " (requested size: " << format_size(p.requested_size) << ")\n";
   }
   return true;
@@ -125,22 +129,29 @@
 void MPSHeapAllocatorImpl::free_buffer(BufferBlock* buffer_block)
 {
   TORCH_INTERNAL_ASSERT(buffer_block->in_use);
+  trigger_memory_callbacks(buffer_block, IMpsAllocatorCallback::EventType::FREED);
   buffer_block->in_use = false;
+  buffer_block->shape.clear(); // reset shape
   BufferPool *pool = buffer_block->heap->pool;
   // Makes sure the BufferBlock* isn't already present in the pool we're freeing it back into.
   TORCH_INTERNAL_ASSERT(pool->buffers.insert(buffer_block).second);
 }
 
 BufferBlock* MPSHeapAllocatorImpl::get_allocated_buffer_block(void* ptr)
 {
-  id<MTLBuffer> buf = __builtin_bit_cast(id<MTLBuffer>, ptr);
-  auto it = m_allocated_buffers.find(buf);
+  auto it = m_allocated_buffers.find(ptr);
   if (it == m_allocated_buffers.end())
     return nullptr;
 
   return it->second;
 }
 
+void MPSHeapAllocatorImpl::trigger_memory_callbacks(BufferBlock* buffer_block, IMpsAllocatorCallback::EventType event) {
+  for (const auto& name : MPSAllocatorCallbacksRegistry()->Keys()) {
+    MPSAllocatorCallbacksRegistry()->Create(name)->executeMPSAllocatorCallback(buffer_block->buffer, event);
+  }
+}
+
 bool MPSHeapAllocatorImpl::isSharedBuffer(void* ptr)
 {
   std::lock_guard<std::mutex> lock(m_mutex);
@@ -150,6 +161,40 @@
   return buffer_block && buffer_block->heap->pool->is_shared;
 }
 
+ssize_t MPSHeapAllocatorImpl::getRequestedBufferSize(void* ptr)
+{
+  std::lock_guard<std::mutex> lock(m_mutex);
+
+  BufferBlock *buffer_block = get_allocated_buffer_block(ptr);
+  if (buffer_block)
+    return (ssize_t) buffer_block->requested_size;
+  // this indicates the passed buffer pointer wasn't found
+  return -1;
+}
+
+void MPSHeapAllocatorImpl::setBufferShape(void* ptr, const IntArrayRef& shape)
+{
+  std::lock_guard<std::mutex> lock(m_mutex);
+
+  BufferBlock *buffer_block = get_allocated_buffer_block(ptr);
+  TORCH_INTERNAL_ASSERT(buffer_block, "failed to find the buffer ", ptr);
+  // note that the IntArrayRef doesn't own the underlying data, and the backing
+  // memory for shape data must persist as long as the buffer is in use.
+  // So we need to copy to vector.
+  buffer_block->shape = shape.vec();
+}
+
+IntArrayRef MPSHeapAllocatorImpl::getBufferShape(void* ptr)
+{
+  std::lock_guard<std::mutex> lock(m_mutex);
+
+  BufferBlock *buffer_block = get_allocated_buffer_block(ptr);
+  if (buffer_block && buffer_block->shape.size() > 0)
+    return IntArrayRef{buffer_block->shape};
+
+  return IntArrayRef();
+}
+
 void MPSHeapAllocatorImpl::Free(void* ptr)
 {
   std::lock_guard<std::mutex> lock(m_mutex);
@@ -167,6 +212,8 @@
 
 void MPSHeapAllocatorImpl::release_buffer(BufferBlock* buffer_block, bool remove_empty_heap)
 {
+  trigger_memory_callbacks(buffer_block, IMpsAllocatorCallback::EventType::RELEASED);
+
   HeapBlock *heap = buffer_block->heap;
   BufferPool *pool = heap->pool;
   m_total_allocated_memory -= buffer_block->size;
@@ -318,6 +365,7 @@ static bool isEnvVarEnabled(const char *envvar) {
   static MPSAllocator s_mps_shared_alloc(true);
   return s_mps_shared_alloc;
 }
+
 MPSAllocator& _getPrivateAllocator() {
   static mps::MPSAllocator s_mps_private_alloc(false);
   return s_mps_private_alloc;
@@ -338,6 +386,19 @@ static bool isEnvVarEnabled(const char *envvar) {
   return &_getPrivateAllocator();
 }
 
+// TODO: create MPSHooks interface and move these there.
+ssize_t get_requested_buffer_size(void* ptr) {
+  return _getAllocImpl().getRequestedBufferSize(ptr);
+}
+
+void set_buffer_shape(void* ptr, const IntArrayRef& shape) {
+  _getAllocImpl().setBufferShape(ptr, shape);
+}
+
+IntArrayRef get_buffer_shape(void* ptr) {
+  return _getAllocImpl().getBufferShape(ptr);
+};
+
 } // namespace mps
 
 namespace native {
@@ -368,5 +429,4 @@ Tensor _pin_memory_mps(const Tensor& self, c10::optional<Device> device)
 }
 
 } // namespace native
-
 } // namespace at
diff --git a/aten/src/ATen/native/mps/OperationUtils.h b/aten/src/ATen/native/mps/OperationUtils.h
@@ -50,18 +50,22 @@ std::string getMPSShapeString(MPSShape* shape);
 std::string getTensorsStringKey(const TensorList& tensors, bool use_scalar_value = true);
 double getMPSScalarValue(const Tensor& t);
 std::string getArrayRefString(const IntArrayRef s);
-std::string getStridedKey(const Tensor& self, const IntArrayRef sz,
-                          const IntArrayRef strides, int64_t offset);
-id<MTLBuffer> gatherViewTensor(const at::Tensor& src, id<MTLBuffer> s);
+// use has_storage() on the returned tensor to determine if src actually is a view
+Tensor gatherViewTensor(const at::Tensor& src, at::Tensor& dst);
+Tensor& scatterViewTensor(const at::Tensor& src, at::Tensor& output);
 
 MPSShape* getMPSShape(const Tensor& t);
 MPSShape* getMPSShape(IntArrayRef sizes);
 MPSShape* getMPSShape(c10::MaybeOwned<Tensor> t);
 
+static inline id<MTLBuffer> getMTLBufferStorage(const at::Tensor& tensor) {
+  return __builtin_bit_cast(id<MTLBuffer>, tensor.storage().data());
+}
+
 class Placeholder {
  public:
-  Placeholder() : _placeholder(nullptr), _value(nullptr) {}
-  Placeholder(MPSGraphTensor* mpsGraphTensor) : _placeholder(mpsGraphTensor), _value(nullptr) {}
+  Placeholder() : _placeholder(nullptr), _value(nullptr), _tensor(Tensor()) {}
+  Placeholder(MPSGraphTensor* mpsGraphTensor) : _placeholder(mpsGraphTensor), _value(nullptr), _tensor(Tensor()) {}
   Placeholder(MPSGraphTensor* mpsGraphTensor, const Tensor& self, MPSShape *mpsShape = nullptr);
   MPSGraphTensor* getMPSGraphTensor() {
     return _placeholder;
@@ -73,26 +77,15 @@ class Placeholder {
     return _value == nullptr;
   }
 
-  void allocateViewTensor(const at::Tensor& src)
-  {
-    assert (!_viewOutput.numel());
-    _viewOutput = at::native::empty_mps(
-                  src.sizes(),
-                  src.scalar_type(),
-                  c10::nullopt,
-                  kMPS,
-                  c10::nullopt,
-                  c10::nullopt);
-  }
-
  private:
   MPSGraphTensor* _placeholder;
   MPSGraphTensorData* _value;
-  Tensor _viewOutput;
+  Tensor _tensor;
 };
 
 void resize_tensor(Tensor* output);
 MPSGraphTensor* trunc_tensor(MPSGraph* mpsGraph, MPSGraphTensor* inputTensor);
+MPSGraphTensor* castMPSTensor(MPSGraph *mpsGraph, MPSGraphTensor* tensor, ScalarType toType);
 MPSGraphTensorData *getMPSGraphTensorData(MPSGraph* mpsGraph, MPSStream* mpsStream, const Tensor& tensor);
 MPSGraphTensorData* getMPSGraphTensorFromScalar(MPSStream* mpsStream, const Scalar& scalar, MPSDataType dataType);
 
@@ -102,11 +95,10 @@ void printTensorNDArray(const Tensor& t);
 MPSGraphTensor* mpsGraphUnrankedPlaceHolder(MPSGraph *mpsGraph, MPSDataType dataType);
 MPSGraphTensor* mpsGraphRankedPlaceHolder(MPSGraph *mpsGraph, MPSDataType dataType, MPSShape* mpsShape);
 MPSGraphTensor* mpsGraphRankedPlaceHolder(MPSGraph *mpsGraph, const Tensor& tensor);
-MPSGraphTensor* mpsGraphConstantPlaceHolder(MPSGraph *mpsGraph, const double value, MPSShape* mpsShape, MPSDataType dataType);
 
 string get_mem_format_string(c10::MemoryFormat memory_format);
 
-using MPSCacheKey = int64_t;
+using MPSCacheKey = uint64_t;
 
 // derive this class to cache a graph and its inputs/ouputs
 // can be used to store any NSObject
@@ -117,16 +109,29 @@ struct MPSCachedGraph
    [_object release];
    _object = nullptr;
   }
+
+  template<typename T>
+  inline T* as() {
+    return static_cast<T*>(this);
+  }
+
   MPSGraph *graph() const { return (MPSGraph *)_object; }
   NSObject *object() const { return _object; }
 private:
   NSObject *_object = nullptr;
 };
 
+struct MPSUnaryCachedGraph : public MPSCachedGraph
+{
+  MPSUnaryCachedGraph(MPSGraph *graph) : MPSCachedGraph(graph) {}
+  MPSGraphTensor *inputTensor_ = nil;
+  MPSGraphTensor *outputTensor_ = nil;
+};
+
+
 // TODO: Improve the overall design of MPSGraphCache.
 // https://github.com/pytorch/pytorch/issues/77176
 // Cache holding various keys mapped to graphs
-
 struct MPSGraphCache
 {
   typedef MPSCachedGraph * (^CreateCachedGraphBlock)();
@@ -158,7 +163,7 @@ struct MPSGraphCache
   MPSGraphCache(const MPSGraphCache&) = delete;
   void operator=(const MPSGraphCache&) = delete;
 
-  MPSCachedGraph* CreateCachedGraph(const std::string& key, CreateCachedGraphBlock createCacheBlock) {
+  MPSCachedGraph* CreateCachedGraph(const std::string& key, CreateCachedGraphBlock createCacheBlock, void* view_ptr = nullptr) {
 
     __block MPSCachedGraph * result = nil;
 
@@ -176,6 +181,9 @@ struct MPSGraphCache
         result = createCacheBlock();
         CacheEntry entry(key, result);
         cache_.emplace(hash, entry);
+        if (view_ptr) {
+          views_list.insert(std::make_pair(view_ptr, hash));
+        }
       }
     });
     return result;
@@ -197,13 +205,40 @@ struct MPSGraphCache
     });
     return result;
   }
+
+  template<typename T>
+  inline T* LookUpAs(const std::string& key) const {
+    return static_cast<T *>(LookUp(key));
+  }
+
+  void FindAndRemoveViewEntry(void* ptr) {
+    // this may find multiple view entries with the same buffer pointers
+    auto views_range = views_list.equal_range(ptr);
+    if (views_range.first == views_range.second)
+      return;
+    for (auto view_it = views_range.first; view_it != views_range.second; ++view_it) {
+      MPSCacheKey hash = view_it->second;
+      // find the cache entry associated with the hash
+      auto cache_it = cache_.find(hash);
+      if (cache_it != cache_.end()) {
+        cache_.erase(cache_it);
+        delete cache_it->second.cachedGraph_;
+      }
+    }
+    // this erase-by-key will remove all pairs in the list with the same key
+    views_list.erase(ptr);
+  }
+
  private:
   MPSGraphCache() {
     serialQueue_ = dispatch_queue_create("cache queue", DISPATCH_QUEUE_SERIAL);
   }
 
   static MPSGraphCache* _instance_cache;
   std::unordered_map<MPSCacheKey, CacheEntry> cache_;
+  // list of buffers associated with view entries in the cache
+  // note that multiple view cache entries could use the same buffer pointer
+  std::unordered_multimap<void*, MPSCacheKey> views_list;
   dispatch_queue_t serialQueue_ = nullptr;
 
 };