Skip to content

pskopnik/htc-cache-system-simulator

Repository files navigation

A HTC Cache System Simulator

DOI

This simulator for cache systems part of High-Throughput Computing clusters has been developed as part of my master's thesis. As is custom for such a project, the code is severely underdocumented and undertested. Some modules are buggy and did not make it into evaluation. However, the most relevant concepts as well as the underlying presumptions are described in the thesis itself, I recommend reading (parts of) it before looking at the code here.

I have collected some open issues and ideas in the TODO.md file, which I personally would look at if I'd continue using the project.

Evaluation/analysis code lives in a separate repository (simulator-analysis).

Commands

Record

Performs the workload generation phase of the simulator and writes an access sequence (trace) file.

Workload Stats

Computes various extended statistics over an access sequence and writes this information to CSV files.

This command has a very high memory usage when enabling all output stats. A full index of the access sequence constitutes a large part of this. It requires (4 + 2 * length(parts)) * 8 bytes of memory for each access. C_0 has an average number of 2.55 parts per access leading to almost 73 bytes of memory per access. However, because the re-uses of files occur within a limited time interval (about 12 weeks for C_0), swapping to disk is feasible.

Replay

Performs the cache policy simulation phase of the simulator. It reads an access sequence from a file and simulates one or multiple cache processors according to specification passed to the command.

Reproducibility

Using --seed or elimnating randomness through parameter choices (e.g. setting sigma = 0) allows reproducibility.

Because randomness only occurs during computing of the schedule of workflows, any two parameter sets with the same seed are comparable as long as the schedule-affecting parameters are left unchanged.

There remain other aspects which differ between multiple executions nevertheless. Notably, the file names (or file keys) generated by the DataSet class use the memory address of the DataSet instance as part of the name, which is different on each execution. Insignificant variations in statistics have been observed as well, most likely due to follow-on effects of the different file names or due to non-deterministic behaviour of the Python interpreter.

About

Simulates Cache Systems in High-Throughput Computing Clusters

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages