Skip to content

Commit

Permalink
Implement new save_raw in Python. (#7572)
Browse files Browse the repository at this point in the history
* Expose the new C API function to Python.
* Remove old document and helper script.
* Small optimization to the `save_raw` and Json ctors.
  • Loading branch information
trivialfis committed Jan 18, 2022
1 parent 9f20a33 commit dac9eb1
Show file tree
Hide file tree
Showing 8 changed files with 104 additions and 150 deletions.
79 changes: 0 additions & 79 deletions doc/python/convert_090to100.py

This file was deleted.

41 changes: 16 additions & 25 deletions doc/tutorials/saving_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,18 @@
Introduction to Model IO
########################

In XGBoost 1.0.0, we introduced experimental support of using `JSON
In XGBoost 1.0.0, we introduced support of using `JSON
<https://www.json.org/json-en.html>`_ for saving/loading XGBoost models and related
hyper-parameters for training, aiming to replace the old binary internal format with an
open format that can be easily reused. The support for binary format will be continued in
the future until JSON format is no-longer experimental and has satisfying performance.
This tutorial aims to share some basic insights into the JSON serialisation method used in
XGBoost. Without explicitly mentioned, the following sections assume you are using the
JSON format, which can be enabled by providing the file name with ``.json`` as file
extension when saving/loading model: ``booster.save_model('model.json')``. More details
below.
open format that can be easily reused. Later in XGBoost 1.6.0, additional support for
`Universal Binary JSON <https://ubjson.org/>`__ is added as an optimization for more
efficient model IO. They have the same document structure with different representations,
and we will refer them collectively as the JSON format. This tutorial aims to share some
basic insights into the JSON serialisation method used in XGBoost. Without explicitly
mentioned, the following sections assume you are using the one of the 2 outputs formats,
which can be enabled by providing the file name with ``.json`` (or ``.ubj`` for binary
JSON) as file extension when saving/loading model: ``booster.save_model('model.json')``.
More details below.

Before we get started, XGBoost is a gradient boosting library with focus on tree model,
which means inside XGBoost, there are 2 distinct parts:
Expand Down Expand Up @@ -53,7 +55,8 @@ Other language bindings are still working in progress.
based serialisation methods.

To enable JSON format support for model IO (saving only the trees and objective), provide
a filename with ``.json`` as file extension:
a filename with ``.json`` or ``.ubj`` as file extension, the latter is the extension for
`Universal Binary JSON <https://ubjson.org/>`__

.. code-block:: python
:caption: Python
Expand All @@ -65,7 +68,7 @@ a filename with ``.json`` as file extension:
xgb.save(bst, 'model_file_name.json')
While for memory snapshot, JSON is the default starting with xgboost 1.3.
While for memory snapshot, UBJSON is the default starting with xgboost 1.6.

***************************************************************
A note on backward compatibility of models and memory snapshots
Expand Down Expand Up @@ -105,15 +108,10 @@ Loading pickled file from different version of XGBoost

As noted, pickled model is neither portable nor stable, but in some cases the pickled
models are valuable. One way to restore it in the future is to load it back with that
specific version of Python and XGBoost, export the model by calling `save_model`. To help
easing the mitigation, we created a simple script for converting pickled XGBoost 0.90
Scikit-Learn interface object to XGBoost 1.0.0 native model. Please note that the script
suits simple use cases, and it's advised not to use pickle when stability is needed. It's
located in ``xgboost/doc/python`` with the name ``convert_090to100.py``. See comments in
the script for more details.
specific version of Python and XGBoost, export the model by calling `save_model`.

A similar procedure may be used to recover the model persisted in an old RDS file. In R, you are
able to install an older version of XGBoost using the ``remotes`` package:
A similar procedure may be used to recover the model persisted in an old RDS file. In R,
you are able to install an older version of XGBoost using the ``remotes`` package:

.. code-block:: r
Expand Down Expand Up @@ -244,10 +242,3 @@ leaf directly, instead it saves the weights as a separated array.

.. include:: ../model.schema
:code: json

************
Future Plans
************

Right now using the JSON format incurs longer serialisation time, we have been working on
optimizing the JSON implementation to close the gap between binary format and JSON format.
32 changes: 15 additions & 17 deletions include/xgboost/json.h
Original file line number Diff line number Diff line change
Expand Up @@ -89,9 +89,10 @@ class JsonString : public Value {
JsonString(std::string const& str) : // NOLINT
Value(ValueKind::kString), str_{str} {}
JsonString(std::string&& str) noexcept : // NOLINT
Value(ValueKind::kString), str_{std::move(str)} {}
JsonString(JsonString&& str) noexcept : // NOLINT
Value(ValueKind::kString), str_{std::move(str.str_)} {}
Value(ValueKind::kString), str_{std::forward<std::string>(str)} {}
JsonString(JsonString&& str) noexcept : Value(ValueKind::kString) { // NOLINT
std::swap(str.str_, this->str_);
}

void Save(JsonWriter* writer) const override;

Expand All @@ -111,8 +112,8 @@ class JsonArray : public Value {

public:
JsonArray() : Value(ValueKind::kArray) {}
JsonArray(std::vector<Json>&& arr) noexcept : // NOLINT
Value(ValueKind::kArray), vec_{std::move(arr)} {}
JsonArray(std::vector<Json>&& arr) noexcept // NOLINT
: Value(ValueKind::kArray), vec_{std::forward<std::vector<Json>>(arr)} {}
JsonArray(std::vector<Json> const& arr) : // NOLINT
Value(ValueKind::kArray), vec_{arr} {}
JsonArray(JsonArray const& that) = delete;
Expand Down Expand Up @@ -381,10 +382,9 @@ class Json {
return *this;
}
// array
explicit Json(JsonArray list) :
ptr_ {new JsonArray(std::move(list))} {}
Json& operator=(JsonArray array) {
ptr_.reset(new JsonArray(std::move(array)));
explicit Json(JsonArray&& list) : ptr_{new JsonArray(std::forward<JsonArray>(list))} {}
Json& operator=(JsonArray&& array) {
ptr_.reset(new JsonArray(std::forward<JsonArray>(array)));
return *this;
}
// typed array
Expand All @@ -397,17 +397,15 @@ class Json {
return *this;
}
// object
explicit Json(JsonObject object) :
ptr_{new JsonObject(std::move(object))} {}
Json& operator=(JsonObject object) {
ptr_.reset(new JsonObject(std::move(object)));
explicit Json(JsonObject&& object) : ptr_{new JsonObject(std::forward<JsonObject>(object))} {}
Json& operator=(JsonObject&& object) {
ptr_.reset(new JsonObject(std::forward<JsonObject>(object)));
return *this;
}
// string
explicit Json(JsonString str) :
ptr_{new JsonString(std::move(str))} {}
Json& operator=(JsonString str) {
ptr_.reset(new JsonString(std::move(str)));
explicit Json(JsonString&& str) : ptr_{new JsonString(std::forward<JsonString>(str))} {}
Json& operator=(JsonString&& str) {
ptr_.reset(new JsonString(std::forward<JsonString>(str)));
return *this;
}
// bool
Expand Down
2 changes: 2 additions & 0 deletions include/xgboost/learner.h
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@ enum class PredictionType : std::uint8_t { // NOLINT
struct XGBAPIThreadLocalEntry {
/*! \brief result holder for returning string */
std::string ret_str;
/*! \brief result holder for returning raw buffer */
std::vector<char> ret_char_vec;
/*! \brief result holder for returning strings */
std::vector<std::string> ret_vec_str;
/*! \brief result holder for returning string pointers */
Expand Down
42 changes: 32 additions & 10 deletions python-package/xgboost/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -2135,9 +2135,15 @@ def save_model(self, fname: Union[str, os.PathLike]) -> None:
The model is saved in an XGBoost internal format which is universal among the
various XGBoost interfaces. Auxiliary attributes of the Python Booster object
(such as feature_names) will not be saved when using binary format. To save those
attributes, use JSON instead. See :doc:`Model IO </tutorials/saving_model>` for
more info.
(such as feature_names) will not be saved when using binary format. To save
those attributes, use JSON/UBJ instead. See :doc:`Model IO
</tutorials/saving_model>` for more info.
.. code-block:: python
model.save_model("model.json")
# or
model.save_model("model.ubj")
Parameters
----------
Expand All @@ -2152,18 +2158,28 @@ def save_model(self, fname: Union[str, os.PathLike]) -> None:
else:
raise TypeError("fname must be a string or os PathLike")

def save_raw(self) -> bytearray:
def save_raw(self, raw_format: str = "deprecated") -> bytearray:
"""Save the model to a in memory buffer representation instead of file.
Parameters
----------
raw_format :
Format of output buffer. Can be `json`, `ubj` or `deprecated`. Right now
the default is `deprecated` but it will be changed to `ubj` (univeral binary
json) in the future.
Returns
-------
a in memory buffer representation of the model
An in memory buffer representation of the model
"""
length = c_bst_ulong()
cptr = ctypes.POINTER(ctypes.c_char)()
_check_call(_LIB.XGBoosterGetModelRaw(self.handle,
ctypes.byref(length),
ctypes.byref(cptr)))
config = from_pystr_to_cstr(json.dumps({"format": raw_format}))
_check_call(
_LIB.XGBoosterSaveModelToBuffer(
self.handle, config, ctypes.byref(length), ctypes.byref(cptr)
)
)
return ctypes2buffer(cptr, length.value)

def load_model(self, fname: Union[str, bytearray, os.PathLike]) -> None:
Expand All @@ -2173,8 +2189,14 @@ def load_model(self, fname: Union[str, bytearray, os.PathLike]) -> None:
The model is loaded from XGBoost format which is universal among the various
XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as
feature_names) will not be loaded when using binary format. To save those
attributes, use JSON instead. See :doc:`Model IO </tutorials/saving_model>` for
more info.
attributes, use JSON/UBJ instead. See :doc:`Model IO </tutorials/saving_model>`
for more info.
.. code-block:: python
model.load_model("model.json")
# or
model.load_model("model.ubj")
Parameters
----------
Expand Down
26 changes: 16 additions & 10 deletions src/c_api/c_api.cc
Original file line number Diff line number Diff line change
Expand Up @@ -971,28 +971,34 @@ XGB_DLL int XGBoosterSaveModelToBuffer(BoosterHandle handle, char const *json_co
auto format = RequiredArg<String>(config, "format", __func__);

auto *learner = static_cast<Learner *>(handle);
std::string &raw_str = learner->GetThreadLocal().ret_str;
raw_str.clear();

learner->Configure();

auto save_json = [&](std::ios::openmode mode) {
std::vector<char> &raw_char_vec = learner->GetThreadLocal().ret_char_vec;
Json out{Object{}};
learner->SaveModel(&out);
Json::Dump(out, &raw_char_vec, mode);
*out_dptr = dmlc::BeginPtr(raw_char_vec);
*out_len = static_cast<xgboost::bst_ulong>(raw_char_vec.size());
};

Json out{Object{}};
if (format == "json") {
learner->SaveModel(&out);
Json::Dump(out, &raw_str);
save_json(std::ios::out);
} else if (format == "ubj") {
learner->SaveModel(&out);
Json::Dump(out, &raw_str, std::ios::binary);
save_json(std::ios::binary);
} else if (format == "deprecated") {
WarnOldModel();
auto &raw_str = learner->GetThreadLocal().ret_str;
raw_str.clear();
common::MemoryBufferStream fo(&raw_str);
learner->SaveModel(&fo);
*out_dptr = dmlc::BeginPtr(raw_str);
*out_len = static_cast<xgboost::bst_ulong>(raw_str.size());
} else {
LOG(FATAL) << "Unknown format: `" << format << "`";
}

*out_dptr = dmlc::BeginPtr(raw_str);
*out_len = static_cast<xgboost::bst_ulong>(raw_str.length());

API_END();
}

Expand Down
19 changes: 10 additions & 9 deletions src/common/json.cc
Original file line number Diff line number Diff line change
Expand Up @@ -195,11 +195,12 @@ Json& Value::operator[](int) {
}

// Json Object
JsonObject::JsonObject(JsonObject && that) noexcept :
Value(ValueKind::kObject), object_{std::move(that.object_)} {}
JsonObject::JsonObject(JsonObject&& that) noexcept : Value(ValueKind::kObject) {
std::swap(that.object_, this->object_);
}

JsonObject::JsonObject(std::map<std::string, Json> &&object) noexcept
: Value(ValueKind::kObject), object_{std::move(object)} {}
JsonObject::JsonObject(std::map<std::string, Json>&& object) noexcept
: Value(ValueKind::kObject), object_{std::forward<std::map<std::string, Json>>(object)} {}

bool JsonObject::operator==(Value const& rhs) const {
if (!IsA<JsonObject>(&rhs)) {
Expand All @@ -220,8 +221,9 @@ bool JsonString::operator==(Value const& rhs) const {
void JsonString::Save(JsonWriter* writer) const { writer->Visit(this); }

// Json Array
JsonArray::JsonArray(JsonArray && that) noexcept :
Value(ValueKind::kArray), vec_{std::move(that.vec_)} {}
JsonArray::JsonArray(JsonArray&& that) noexcept : Value(ValueKind::kArray) {
std::swap(that.vec_, this->vec_);
}

bool JsonArray::operator==(Value const& rhs) const {
if (!IsA<JsonArray>(&rhs)) {
Expand Down Expand Up @@ -696,6 +698,7 @@ void Json::Dump(Json json, std::string* str, std::ios::openmode mode) {
}

void Json::Dump(Json json, std::vector<char>* str, std::ios::openmode mode) {
str->clear();
if (mode & std::ios::binary) {
UBJWriter writer{str};
writer.Save(json);
Expand Down Expand Up @@ -768,9 +771,7 @@ std::string UBJReader::DecodeStr() {
str.resize(bsize);
auto ptr = raw_str_.c_str() + cursor_.Pos();
std::memcpy(&str[0], ptr, bsize);
for (int64_t i = 0; i < bsize; ++i) {
this->cursor_.Forward();
}
this->cursor_.Forward(bsize);
return str;
}

Expand Down

0 comments on commit dac9eb1

Please sign in to comment.