Skip to content
Antonin Delpeuch edited this page Apr 10, 2023 · 1 revision

Main changes

  • This new version uses a different workspace: your projects from OpenRefine 3.x will not appear in this version. They will not be deleted though: you can always open them again by running OpenRefine 3.x. Project archives exported from OpenRefine 3.x can be read in OpenRefine 4.x, but the operation history will be discarded.
  • Project data no longer needs to fit in the working memory (RAM) of your machine. This makes it easier to work on large datasets. Note that some importers or operations will still load the entire dataset in memory: if you are limited by those, it might still help to increase the memory allocated to OpenRefine. (#242)
  • It is possible to execute OpenRefine operations in Apache Spark (#1433). The execution engine used by OpenRefine is currently selected at startup with the -r (Unix) or /r (Windows) parameter (it is foreseen that this will change before a stable release as Spark support will be moved to an extension, see #4396).
  • Facet statistics are computed on a sample of rows by default. The size of the sample can be configured in the facet panel.
  • The results of long-running processes such as reconciliation or fetching external data are not lost if OpenRefine is stopped while such processes are running. The processes are resumed from their previous progress when the project is reopened. (#87)

Smaller improvements and bug fixes

  • When browsing a project and applying row-wise changes on it, the paging position is preserved after the changes are applied (#33) and similarly when an operation is undone/redone (#572)
  • Long-running processes can be paused (#5183)
  • When using the undo/redo feature to go back in time, making a new change warns the user about data loss before overwriting the undone transformations (#3184)
  • The CSV/TSV importer supports a new option which controls whether rows are allowed to span multiple lines of the source file.
  • The notion of column groups (the orange bars that appear at the top of the grid when using the JSON or XML importers) was removed (see discussion)
  • The numbering and boundaries of records have changed slightly. (TODO: expand on this)

For developers

Most extensions will be incompatible with this new version, as many incompatible changes have been introduced.

  • OpenRefine now uses the org.openrefine namespace instead of com.google.refine.
  • The code base was split into more granular Maven modules. Those modules are published to Maven Central to ease the development of extensions (currently in the snapshot repository as their structure is not final yet). Feedback about the module structure is welcome.
  • The architecture of the data processing engine changed to make it extensible. The execution of workflows can happen fully in memory, off disk or in an Apache Spark, or in other execution engines if the corresponding runners are implemented. Feedback about the data model API is welcome.
Clone this wiki locally