GitHub - CyberAgentAI/libffm: A Library for Field-aware Factorization Machines

CyberAgentAI / libffm Public

forked from ycjuan/libffm

Notifications You must be signed in to change notification settings
Fork 7
Star 2

A Library for Field-aware Factorization Machines

BSD-3-Clause license

2 stars 461 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.github/workflows		.github/workflows
examples		examples
ffm		ffm
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
COPYRIGHT		COPYRIGHT
Makefile		Makefile
Makefile.win		Makefile.win
README		README
bigdata.iw.txt		bigdata.iw.txt
bigdata.te.txt		bigdata.te.txt
bigdata.tr.txt		bigdata.tr.txt
ffm-predict.cpp		ffm-predict.cpp
ffm-train.cpp		ffm-train.cpp
ffm.cpp		ffm.cpp
ffm.h		ffm.h
pyproject.toml		pyproject.toml
regression_test.sh		regression_test.sh
setup.py		setup.py

Repository files navigation

LIBFFM is a library for field-aware factorization machine. For the formulation it solves, please check:

    http://www.csie.ntu.edu.tw/~r01922136/slides/ffm.pdf



Table of Contents
=================

- Overfitting and Early Stopping
- Specifying the importance weights
- Installation
- Data Format
- Command Line Usage
- Examples
- Library Usage
- OpenMP
- Building macOS Binaries
- Building Windows Binaries



Overfitting and Early Stopping
==============================

FFM is prone to overfitting, and the solution we have so far is early stopping. See how FFM behaves on a certain data
set:

    > ffm-train -p va.ffm -l 0.00002 tr.ffm
    iter   tr_logloss   va_logloss
       1      0.49738      0.48776
       2      0.47383      0.47995
       3      0.46366      0.47480
       4      0.45561      0.47231
       5      0.44810      0.47034
       6      0.44037      0.47003
       7      0.43239      0.46952
       8      0.42362      0.46999
       9      0.41394      0.47088
      10      0.40326      0.47228
      11      0.39156      0.47435
      12      0.37886      0.47683
      13      0.36522      0.47975
      14      0.35079      0.48321
      15      0.33578      0.48703


We see the best validation loss is achieved at 7th iteration. If we keep training, then overfitting begins. It is worth
noting that increasing regularization parameter do not help:

    > ffm-train -p va.ffm -l 0.0002 -t 50 -s 12 tr.ffm
    iter   tr_logloss   va_logloss
       1      0.50532      0.49905
       2      0.48782      0.49242
       3      0.48136      0.48748
                 ...
      29      0.42183      0.47014
                 ...
      48      0.37071      0.47333
      49      0.36767      0.47374
      50      0.36472      0.47404


To avoid overfitting, we recommend always provide a validation set with option `-p.' You can use option `--auto-stop' to
stop at the iteration that reaches the best validation loss:

    > ffm-train -p va.ffm -l 0.00002 --auto-stop tr.ffm
    iter   tr_logloss   va_logloss
       1      0.49738      0.48776
       2      0.47383      0.47995
       3      0.46366      0.47480
       4      0.45561      0.47231
       5      0.44810      0.47034
       6      0.44037      0.47003
       7      0.43239      0.46952
       8      0.42362      0.46999
    Auto-stop. Use model at 7th iteration.

Specifying the importance weights
=================================

Usage:

    Use '-W weight_file' to assign importance weights for each training data.
    Use '-WV weight_file' to assign importance weights for each validation data.
    Please make sure all importance weights are non-negative.

Example:

    $ ./ffm-train -p va.ffm -W weights.txt -l 0.00002 tr.ffm
    $ ./ffm-train -p va.ffm -W weights.txt -WV va_weights.txt -l 0.00002 tr.ffm


Installation
============

Requirement: LIBFFM is written in C++. It requires C++11 and OpenMP supports. If OpenMP is not available on your
platform, please refer to section `OpenMP.'

- Unix-like systems:
  To compile on Unix-like systems, type `make' in the command line.

- OS X:
  The built-in compiler should be able to compile LIBFFM. However, OpenMP may
  not be supported. In this case you have to compile without OpenMP. See
  section `OpenMP' for detail.

- Windows:
  See `Building Windows Binaries' to compile.



Data Format
===========

The data format of LIBFFM is:

<label> <field1>:<index1>:<value1> <field2>:<index2>:<value2> ...
.
.
.

`field' and `index' should be non-negative integers. See an example
`bigdata.tr.txt.'



Command Line Usage
==================

-   `ffm-train'

    usage: ffm-train [options] training_set_file [model_file]

    options:
    -l <lambda>: set regularization parameter (default 0.00002)
    -k <factor>: set number of latent factors (default 4)
    -t <iteration>: set number of iterations (default 15)
    -r <eta>: set learning rate (default 0.2)
    -s <nr_threads>: set number of threads (default 1)
    -p <path>: set path to the validation set
    -f <path>: set path for production model file
    -m <prefix>: set key prefix for production model
    -W <path>: set path of importance weights file for training set
    -WV <path>: set path of importance weights file for validation set
    --quiet: quiet model (no output)
    --no-norm: disable instance-wise normalization
    --no-rand: disable random update <training_set_file>.bin will be generated)
    --json-meta: generate a meta file if sets json file path.
    --auto-stop: stop at the iteration that achieves the best validation loss (must be used with -p)
    --auto-stop-threshold: set the threshold count for stop at the iteration that achieves the best validation loss (must be used with --auto-stop)
    --nds-rate: set the negative down sampling rate for training dataset.

    By default we do instance-wise normalization. That is, we normalize the 2-norm of each instance to 1. You can use
    `--no-norm' to disable this function.

    By default, our algorithm randomly select an instance for update in each inner iteration. On some datasets you may
    want to do update in the original order. You can do it by using `--no-rand' together with `-s 1.'

    Because FFM usually need early stopping for better test performance, we provide an option `--auto-stop' to stop at
    the iteration that achieves the best validation loss. Note that you need to provide a validation set with `-p' when
    you use this option.


-   `ffm-predict'

    usage: ffm-predict test_file model_file output_file [options]

    options:
    --nds-rate: set the negative down sampling rate for training dataset.



Examples
========

> ffm-train bigdata.tr.txt model

train a model using the default parameters

> ffm-train -l 0.001 -k 16 -t 30 -r 0.05 -s 4 bigdata.tr.txt model

train a model using the following parameters:

    regularization cost = 0.001
    latent factors = 16
    iterations = 30
    learning rate = 0.05
    threads = 4

> ffm-train -p bigdata.te.txt bigdata.tr.txt model

use bigdata.te.txt as validation set

> ffm-train --quiet bigdata.tr.txt

do not print message to screen

> ffm-predict bigdata.te.txt model output

do prediction

> ffm-train -p bigdata.te.txt -t 100 --auto-stop bigdata.tr.txt

use auto-stop to stop at the best iteration according to validation loss

Library Usage
=============

These structures and functions are declared in the header file `ffm.h.' You need to #include `ffm.h' in your C/C++
source files and link your program with `ffm.cpp.' You can see `ffm-train.cpp' and `ffm-predict.cpp' for examples
showing how to use them.

There are four public data structures in LIBFFM.


-   struct ffm_node
    {
        ffm_int f;    // field index
        ffm_int j;    // column index
        ffm_float v;  // value
    };

    Each `ffm_node' represents a non-zero element in a sparse matrix.

-   struct ffm_problem
    {
        ffm_int n;      // number of features
        ffm_int l;      // number of instances
        ffm_int m;      // number of fields
        ffm_node *X;    // non-zero elements
        ffm_long *P;    // row pointers
        ffm_float *Y;   // labels
    };

-   struct ffm_parameter
    {
        ffm_float eta;
        ffm_float lambda;
        ffm_int nr_iters;
        ffm_int k;
        ffm_int nr_threads;
        ffm_float nds_rate
        bool quiet;
        bool normalization;
        bool random;
        bool auto_stop;
    };

    `ffm_parameter' represents the parameters used for training. The meaning of
    each variable is:

    variable         meaning                             default
    ============================================================
    eta              learning rate                           0.1
    lambda           regularization cost                       0
    nr_iters         number of iterations                     15
    k                number of latent factors                  4
    nr_threads       number of threads used                    1
    quiet            no outputs to stdout                  false
    normalization    instance-wise normalization           false
    random           randomly select instance in SG         true
    auto_stop        auto stop at the best iteration       false
    nds_rate         negative down sampling rate             1.0

    To obtain a parameter object with default values, use the function
    `ffm_get_default_param.'


-   struct ffm_model
    {
        ffm_int n;              // number of features
        ffm_int m;              // number of fields
        ffm_int k;              // number of latent factors
        ffm_float *W;           // store model values
        bool normalization;     // do instance-wise normalization
    };



Functions available in LIBFFM include:


-   ffm_parameter ffm_get_default_param();

    Get default parameters.

-   ffm_int ffm_save_model(struct ffm_model const *model, char const *path);

    Save a model. It returns 0 on sucess and 1 on failure.

-   struct ffm_model* ffm_load_model(char const *path);

    Load a model. If the model could not be loaded, a nullptr is returned.

-   void ffm_destroy_model(struct ffm_model **model);

    Destroy a model.

-   struct ffm_model* ffm_train(struct ffm_problem const *prob, ffm_parameter param);

    Train a model.

-   struct ffm_model* ffm_train_with_validation(struct ffm_problem const *Tr, struct ffm_problem const *Va, ffm_parameter param);

    Train a model with training set `Tr' and validation set `Va.' The logloss of the validation set is printed at each
    iteration.

-   ffm_float ffm_predict(ffm_node *begin, ffm_node *end, ffm_model *model);

    Do prediction. `begin' and `end' are pointers to specify the beginning and ending position of the instance to be
    predicted.



OpenMP
======

We use OpenMP to do parallelization. If OpenMP is not available on your
platform, then please comment out the following lines in Makefile.

    DFLAG += -DUSEOMP
    CXXFLAGS += -fopenmp

Note: Please always run `make clean all' if these flags are changed.


Building macOS Binaries
=======================

Apple clang (use libomp)

    brew install libomp
    make OMP_CXXFLAGS="-Xpreprocessor -fopenmp -I$(brew --prefix libomp)/include" OMP_LDFLAGS="-L$(brew --prefix libomp)/lib -lomp"

Using gcc (installed by homebrew)

    brew install gcc
    make CXX="g++-8"

Note: replace "8" with version of gcc installed on your machine

Building Windows Binaries
=========================

To build them via command-line tools of Visual C++, use the following steps:

1. Open a DOS command box (or Developer Command Prompt for Visual Studio) and
go to LIBFFM directory. If environment variables of VC++ have not been set,
type

"C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64\vcvars64.bat"

You may have to modify the above command according which version of VC++ or
where it is installed.

2. Type

nmake -f Makefile.win clean all



Contributors
============

Yu-Chin Juan, Wei-Sheng Chin, and Yong Zhuang

For questions, comments, feature requests, or bug report, please send your email to

    Yu-Chin (guestwalk@gmail.com)