CamelMR is a simple MapReduce-style program written in OCaml, designed to process CSV files. It reads a given CSV file, sums up values based on a specified key (e.g., "Finding Type"), and then prints the results, sorted by the sum in descending order.
- OCaml
- Dune build system
csv
OCaml libraryparmap
OCaml libraryalcotest
OCaml library
- Clone the repository:
git clone <repository-url>
cd <repository-dir>
- Install the required libraries:
opam install csv parmap dune alcotest
- Build the project:
dune build
Run the program with the following command:
dune exec -- CamelMR <path_to_csv_file> <column_name_to_aggregate> (optional)<mode>
Replace <path_to_csv_file>
with the path to your desired CSV file.
For example:
dune exec -- CamelMR data/sample.csv Name
You can also experiment with simple
, parallel
and mapreduce
(default) runtimes that are provided.
By default the runtime is mapreduce
but you can find a lot of performance boost with simple if your dataset is small.
They ultimately do the same, but with different resource usages and can affect how long the script takes to run
dune exec -- CamelMR data/sample.csv Name simple
You can also run the tests in src/test_camelmr.ml
with
dune exec -- testCamelMR
- Map: The program reads the CSV file line by line, extracts the key (based on the specified column name like "Finding Type"), and its associated value.
- Shuffle & Group: It then groups the results by key.
- Reduce: For each key, it sums up its values in parallel.
- Sort & Print: Finally, it sorts the keys based on their summed values in descending order and prints the results.
To focus on a different key or make other changes, modify the main.ml
file in the src
directory and recompile.