Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Best way to store hashes? #2443

Open
dylanstreb opened this issue Mar 5, 2024 · 5 comments
Open

[QUESTION] Best way to store hashes? #2443

dylanstreb opened this issue Mar 5, 2024 · 5 comments
Labels

Comments

@dylanstreb
Copy link

I'm currently storing hashes in a litedb for use as a file cache. I'm planning on expanding it, so I was thinking of revisiting the storage. Right now I'm just using base64 encoded strings. I figured that changing the storage to bytes would be more efficient.

So I wrote a quick test program to find out. I made 1,000,000 integers, hashed them, and used a simple stopwatch to compare strings and byte[]. I'm using murmurhash for this, the output is 128 bits. The models are simple:

    public class StringModel
    {
        [BsonField]
        [BsonId]
        public int Id { get; set; }

        [BsonField]
        public string? Data { get; set; }
    }

    public class ByteModel
    {
        [BsonField]
        [BsonId]
        public int Id { get; set; }

        [BsonField]
        public byte[]? Data { get; set; }
    }

The results were not what I expected:

Took: 18.4086132  byte hashes
Took: 16.8086536  base64 hashes
Bytes database filesize: 122028032
Strings database filesize: 133980160

This is just from doing a InsertBulk on the models. Inserting bytes took longer, which wasn't what I expected. The database is smaller so I assume this isn't doing any kind of binary->hex conversion for storage, but I would naively assume that the smaller bytes record would also insert faster.

Is there a better way to store small binary data in litedb? I'm more concerned about speed than filesize, so should it be left as a base64 string? Or am I setting up the models incorrectly in some way?

@azureskydiver
Copy link

v4 had FileStorage available which is presumably better for storing binary blobs:
https://github.com/mbdavid/LiteDB/wiki/FileStorage

@dylanstreb
Copy link
Author

File storage appears to be for dealing with large files. This is for numerous small blobs - and since they're fixed, known length, in theory it's possible to optimize for this behavior. Doing this manually, i.e. by splitting a 128-bit hash into two 64-bit ints, doesn't help.

Storing the data as a string does seem to be the best option. I'm guessing there are just more optimizations out there (in C# and/or in LiteDB) for string processing than byte[] processing and that's what improves the performance.

@azureskydiver
Copy link

Good point. I read "128 KB", not "128 bits". Sorry.

@azureskydiver
Copy link

Perhaps I'm testing the wrong way, but I'm getting relatively the same file sizes and insertion times (except for the Base64 string approach) with the following:

using System.Diagnostics;
using LiteDB;

namespace TestLiteDb128
{
    interface IModel
    {
        [BsonIgnore]
        byte[]? Value { get; set; }
    }

    class Base64Model : IModel
    {
        public int Id { get; set; }
        public string? v { get; set; }

        [BsonIgnore]
        public byte[]? Value
        {
            get => v == null ? new byte[16] : Convert.FromBase64String(v);

            set
            {
                Debug.Assert(value?.Length == 16);
                v = value == null ? Convert.ToBase64String(new byte[16])
                                  : Convert.ToBase64String(value);
            }
        }
    }

    class ByteModel : IModel
    {
        public int Id { get; set; }
        public byte[]? v { get; set; }

        [BsonIgnore]
        public byte[]? Value
        {
            get => v;
            set
            {
                Debug.Assert(value?.Length == 16);
                v = new byte[16];
                if (value != null)
                    Array.Copy(value, v, value.Length);
            }
        }
    }

    class LongModel : IModel
    {
        public int Id { get; set; }
        public long lv { get; set; }
        public long hv { get; set; }

        [BsonIgnore]
        public byte[]? Value
        {
            get
            {
                var low = BitConverter.GetBytes(lv);
                var high = BitConverter.GetBytes(hv);
                var ret = new byte[16];
                Array.Copy(low, ret, 8);
                Array.Copy(high, 0, ret, 8, 8);
                return ret;
            }

            set
            {
                Debug.Assert(value?.Length == 16);
                if (value == null)
                {
                    lv = 0;
                    hv = 0;
                }
                else
                {
                    lv = BitConverter.ToInt64(value, 0);
                    hv = BitConverter.ToInt64(value, 8);
                }
            }
        }
    }

    class GuidModel : IModel
    {
        public int Id { get; set; }
        public Guid v { get; set; }

        [BsonIgnore]
        public byte[]? Value
        {
            get => v.ToByteArray();

            set
            {
                Debug.Assert(value?.Length == 16);
                v = value == null ? Guid.Empty : new Guid(value);
            }
        }
    }

    class Program
    {
        static IEnumerable<T> Generate<T>(int count) where T : IModel, new()
        {
            var value = new byte[16];

            for (long i = 0; i < count; i++)
            {
                var low = BitConverter.GetBytes(i);
                Array.Copy(low, value, low.Length);
                T model = new T();
                model.Value = value;
                yield return model;
            }
        }

        static void Test<T>(string filename) where T : IModel, new()
        {
            var stopwatch = new Stopwatch();

            Console.WriteLine($"Generating items for {filename}...");
            stopwatch.Start();
            var items = Generate<T>(1000).ToList();
            stopwatch.Stop();
            Console.WriteLine($"Generated items in {stopwatch.Elapsed}");

            if (File.Exists(filename))
                File.Delete(filename);

            Console.WriteLine($"Filling {filename} ...");
            stopwatch.Reset();
            stopwatch.Start();
            using (var db = new LiteDatabase(filename))
            {
                var col = db.GetCollection<T>();

                foreach(var item in items)
                    col.Insert(item);
            }
            stopwatch.Stop();
            Console.WriteLine($"Filled {filename} in {stopwatch.Elapsed}");
        }

        static void Main(string[] args)
        {
            Test<Base64Model>("Base64Model.db");
            Test<ByteModel>("ByteModel.db");
            Test<LongModel>("LongModel.db");
            Test<GuidModel>("GuidModel.db");
        }
    }
}

(Yes, I know using Benchmark would have been better, but I was just trying to get a rough feel for times and sizes.)

@dylanstreb
Copy link
Author

For my test with using two longs, I made a struct instead of putting the longs directly into the model. I'm guessing that's the difference.

GUID I hadn't even considered trying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants