Skip to content

Online random forests with prediction uncertainty

License

Notifications You must be signed in to change notification settings

thvasilo/uncertain-trees-reproducible

Repository files navigation

Reproducibility repository for Online Regression Forests

Using this repository you should be able to reproduce all the experiments we performed for our JMLR paper on online regression forests with uncertainty.

Follow the instructions to prepare you environment and data. The file reproduce-output.sh contains the commands to re-create the most important tables and figures in the paper.

Instructions

The repository uses submodules to keep track of the different repositories needed to run the algorithms, so ensure you clone using the --recursive option, i.e. git clone --recursive https://github.com/thvasilo/uncertain-trees-reproducible.git

Installing the dependencies

uncertain-trees-experiments and scikit-garden

There are a few Python libraries needed to run the project, so we recommend creating a virtual environment to avoid messing up your default environment. We have used the Anaconda Python distribution to make things easier.

We've made some small modifications to the original scikit-garden library, so we need to install it from the included submodule rather than the PyPI repository.

conda env create -f rf-pred.yml  # Installs the base dependencies as a new virtual env
source activate rf-pred
pip install -e ./scikit-garden  # Install the customized scikit-garden repo.

MOA

We recommend using the pre-built binaries under binaries. The only requirement is Java 8. We've tested with the Oracle JDK, OpenJDK seems to cause issues with the results.

Alternatively you can build the MOA distribution using Maven by running mvn package -DskipTests in moa/moa.

Obtaining the data

The stationary data are included with the repository under the data/small-mid directory. The large airlines data are compressed under data/airlines. To decompress them, cd into data/airlines and run:

for FILE in *.tar.gz; do tar -zxf ${FILE}; done

To re-create the Friedman data run the generate_friedman_data.sh script.

Re-creating the files

It's also possible to re-create the files using the scripts we've included in the data/airlines directory.

You just need to run in succession:

./get_data.sh
./create_splits.sh

These two scripts will pull the original data, transform to csv, apply the pre-processing steps, and create the 700k, 2M and 5M splits in arff format using Weka.

Running the experiments

After you've prepared the environment and data, to re-run the experiments from the paper we can use the example commands in reproduce-output.sh. We recommend running the experiments selectively and not simply running the script, because the runtime for the airlines experiments is very long. The experiments on the small-scale data should not take very long however.

NOTE: Due to the random nature of the algorithms the exact results will be slightly different from those reported in the paper, unfortunately we didn't keep track of all the random seeds used in our experiments. The overall performance of the algorithms should not change significantly however.

Troubleshooting

Ensure you did git clone --recursive https://github.com/thvasilo/uncertain-trees-reproducible.git. Please file an issue if you run into any problems.

Citing

If you use this work please cite our JMLR paper:

@article{JMLR:v20:19-006,
  author  = {Theodore Vasiloudis and Gianmarco De Francisci Morales and Henrik Bostr{{\"o}}m},
  title   = {Quantifying Uncertainty in Online Regression Forests},
  journal = {Journal of Machine Learning Research},
  year    = {2019},
  volume  = {20},
  number  = {155},
  pages   = {1-35},
  url     = {http://jmlr.org/papers/v20/19-006.html}
}