Reproducibility repository for Online Regression Forests

Using this repository you should be able to reproduce all the experiments we performed for our JMLR paper on online regression forests with uncertainty.

Follow the instructions to prepare you environment and data. The file reproduce-output.sh contains the commands to re-create the most important tables and figures in the paper.

Instructions

The repository uses submodules to keep track of the different repositories needed to run the algorithms, so ensure you clone using the --recursive option, i.e. git clone --recursive https://github.com/thvasilo/uncertain-trees-reproducible.git

Installing the dependencies

`uncertain-trees-experiments` and `scikit-garden`

There are a few Python libraries needed to run the project, so we recommend creating a virtual environment to avoid messing up your default environment. We have used the Anaconda Python distribution to make things easier.

We've made some small modifications to the original scikit-garden library, so we need to install it from the included submodule rather than the PyPI repository.

conda env create -f rf-pred.yml  # Installs the base dependencies as a new virtual env
source activate rf-pred
pip install -e ./scikit-garden  # Install the customized scikit-garden repo.

MOA

We recommend using the pre-built binaries under binaries. The only requirement is Java 8. We've tested with the Oracle JDK, OpenJDK seems to cause issues with the results.

Alternatively you can build the MOA distribution using Maven by running mvn package -DskipTests in moa/moa.

Obtaining the data

The stationary data are included with the repository under the data/small-mid directory. The large airlines data are compressed under data/airlines. To decompress them, cd into data/airlines and run:

for FILE in *.tar.gz; do tar -zxf ${FILE}; done

To re-create the Friedman data run the generate_friedman_data.sh script.

Re-creating the files

It's also possible to re-create the files using the scripts we've included in the data/airlines directory.

You just need to run in succession:

./get_data.sh
./create_splits.sh

These two scripts will pull the original data, transform to csv, apply the pre-processing steps, and create the 700k, 2M and 5M splits in arff format using Weka.

Running the experiments

After you've prepared the environment and data, to re-run the experiments from the paper we can use the example commands in reproduce-output.sh. We recommend running the experiments selectively and not simply running the script, because the runtime for the airlines experiments is very long. The experiments on the small-scale data should not take very long however.

NOTE: Due to the random nature of the algorithms the exact results will be slightly different from those reported in the paper, unfortunately we didn't keep track of all the random seeds used in our experiments. The overall performance of the algorithms should not change significantly however.

Troubleshooting

Ensure you did git clone --recursive https://github.com/thvasilo/uncertain-trees-reproducible.git. Please file an issue if you run into any problems.

Citing

If you use this work please cite our JMLR paper:

@article{JMLR:v20:19-006,
  author  = {Theodore Vasiloudis and Gianmarco De Francisci Morales and Henrik Bostr{{\"o}}m},
  title   = {Quantifying Uncertainty in Online Regression Forests},
  journal = {Journal of Machine Learning Research},
  year    = {2019},
  volume  = {20},
  number  = {155},
  pages   = {1-35},
  url     = {http://jmlr.org/papers/v20/19-006.html}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
binaries		binaries
data		data
moa @ 385b0bb		moa @ 385b0bb
uncertain-trees-experiments @ d3ece0c		uncertain-trees-experiments @ d3ece0c
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
reproduce-output.sh		reproduce-output.sh
rf-pred.yml		rf-pred.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

binaries

binaries

data

data

moa @ 385b0bb

moa @ 385b0bb

uncertain-trees-experiments @ d3ece0c

uncertain-trees-experiments @ d3ece0c

.gitattributes

.gitattributes

.gitignore

.gitignore

.gitmodules

.gitmodules

LICENSE

LICENSE

README.md

README.md

reproduce-output.sh

reproduce-output.sh

rf-pred.yml

rf-pred.yml

Repository files navigation

Reproducibility repository for Online Regression Forests

Instructions

Installing the dependencies

`uncertain-trees-experiments` and `scikit-garden`

MOA

Obtaining the data

Re-creating the files

Running the experiments

Troubleshooting

Citing

About

Releases

Packages

Languages

License

thvasilo/uncertain-trees-reproducible

Folders and files

Latest commit

History

Repository files navigation

Reproducibility repository for Online Regression Forests

Instructions

Installing the dependencies

uncertain-trees-experiments and scikit-garden

MOA

Obtaining the data

Re-creating the files

Running the experiments

Troubleshooting

Citing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

`uncertain-trees-experiments` and `scikit-garden`