Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema validation error (and diff error) when collation is set #5338

Closed
BenMorel opened this issue Mar 29, 2022 · 17 comments · Fixed by #5471 or #5523
Closed

Schema validation error (and diff error) when collation is set #5338

BenMorel opened this issue Mar 29, 2022 · 17 comments · Fixed by #5471 or #5523

Comments

@BenMorel
Copy link
Contributor

BenMorel commented Mar 29, 2022

Bug Report

Not sure if this belongs here or to ORM / Migrations, but this looks directly related to #5322.

Q A
Version 3.3.4

Summary

When an entity has a collation set for a string column:

class User
{
    #[ORM\Column(type: 'string', length: 100, options: ['collation' => 'ascii_bin'])]
    public string $email;
}

doctrine:schema:validate always reports:

[ERROR] The database schema is not in sync with the current mapping file.

and doctrine:migrations:diff generates the same migration again and again, even after executing it.

How to reproduce

Small reproduce project: https://github.com/BenMorel/doctrine-bug-2022-03-29

  • Set DATABASE_URL in .env or .env.local
  • Run bin/console doctrine:database:create
  • Run bin/console doctrine:migrations:diff
    Generated new migration class to "/home/benjamin/src/roketto/packages/backend/bug/migrations/Version20220329140701.php"
    
    To run just this migration for testing purposes, you can use migrations:execute --up 'DoctrineMigrations\\Version20220329140701'
    
    To revert the migration you can use migrations:execute --down 'DoctrineMigrations\\Version20220329140701'
    
  • Execute the migration: bin/console doctrine:migrations:migrate -n
    [notice] Migrating up to DoctrineMigrations\Version20220329140701
    [notice] finished in 18.6ms, used 14M memory, 1 migrations executed, 1 sql queries
    
  • Check the schema: bin/console doctrine:schema:validate
    Mapping
    -------
    
    [OK] The mapping files are correct.
    
    Database
    --------
    
    [ERROR] The database schema is not in sync with the current mapping file.                                              
    

At this point, running doctrine:migrations:diff again generates the same migration again and again:

ALTER TABLE user CHANGE email email VARCHAR(100) NOT NULL COLLATE `ascii_bin`
@morozov
Copy link
Member

morozov commented Mar 30, 2022

@BenMorel could you try reproducing the issue with a script like in #5322 without any additional dependencies? Otherwise, it's indeed not clear which of the libraries the issue belongs to.

@ruudk
Copy link

ruudk commented Apr 7, 2022

I have a similar issue that seems to happen when you define a charset and/or collation on a #[Column] that is the same as the charset and/or collate from the #Table].

When the column specifies the same as the Table defaults, they are removed by Doctrine\DBAL\Platforms\MySQL\Comparator::normalizeColumns. When comparing them to the database schema, it's always different that causes the diff tool to propose an update.

I think this normalization is naive. When a user explicitly added the values, they should not be removed (normalized).

You could say: "Why define charset/collate on the column when it's the same as the table default?" Well, if you are coming from a legacy application you often have old tables that are still in a different charset/collation. The only way to get JoinColumns to work, is when the foreign column and local column have the same charset/collation. That used to work when defining the charset/collation on the column that is referenced, it would then use that for the other column as well.

@BenMorel
Copy link
Contributor Author

Found a workaround: always define 'charset' in addition to 'collation':

-#[ORM\Column(type: 'string', length: 20, options: ['collation' => 'ascii_bin'])]
+#[ORM\Column(type: 'string', length: 20, options: ['charset' => 'ascii', 'collation' => 'ascii_bin'])]

And the garbage output of doctrine:migrations:diff disappears!

It looks like the diff algorithm fails to infer the charset from the collation when only the latter is set.

@Dherlou
Copy link

Dherlou commented Jun 23, 2022

Found a workaround: always define 'charset' in addition to 'collation'

Unfortunately, that workaround does not work in my case.

I am using doctrine/dbal=3.3.6 in combination with doctrine/orm=2.12.2 and doctrine/migrations=3.5.1.
Both my whole database and the entity's table have the default charset and collation of utf8mb4 and utf8mb4_unicode_ci, respectively.

I have the following ORM-Definition in my Application entity:

/**
 * @ORM\Column(type="string", nullable=false, options={"collation"="utf8mb4_unicode_ci"})
 */
private string $status;

This results in the following generated diff:

// up
$this->addSql('ALTER TABLE application CHANGE status status VARCHAR(255) NOT NULL COLLATE `utf8mb4_unicode_ci`');
// down
$this->addSql('ALTER TABLE application CHANGE status status VARCHAR(255) NOT NULL');

If I apply the workaround and change the ORM definition to:

+@ORM\Column(type="string", nullable=false, options={"charset"="utf8mb4", "collation"="utf8mb4_unicode_ci"})
-@ORM\Column(type="string", nullable=false, options={"collation"="utf8mb4_unicode_ci"})

The diff tool generates the same faulty migrations, but with the added charset definitions.

// up
+$this->addSql('ALTER TABLE application CHANGE status status VARCHAR(255) CHARACTER SET utf8mb4 NOT NULL COLLATE `utf8mb4_unicode_ci`');
-$this->addSql('ALTER TABLE application CHANGE status status VARCHAR(255) NOT NULL COLLATE `utf8mb4_unicode_ci`');
// down
$this->addSql('ALTER TABLE application CHANGE status status VARCHAR(255) NOT NULL');

@morozov
Copy link
Member

morozov commented Jun 23, 2022

Note that it would help more if anyone affected by this issue could provide a way to reproduce it instead of exchanging workarounds.

@Dherlou
Copy link

Dherlou commented Jun 23, 2022

Note that it would help more if anyone affected by this issue could provide a way to reproduce it instead of exchanging workarounds.

@BenMorel already set up a reproduction example and explained it in the initial post:

How to reproduce
Small reproduce project: https://github.com/BenMorel/doctrine-bug-2022-03-29

@morozov
Copy link
Member

morozov commented Jun 23, 2022

@Dherlou the project in the initial post reproduces the issue by using quite some dependencies that the DBAL doesn't depend on. In order to address the issue in the DBAL, we need to reproduce it by using only the DBAL APIs. See #5338 (comment).

@Dherlou
Copy link

Dherlou commented Jun 24, 2022

@morozov Per your request I created a reproduction example repo: https://github.com/Dherlou/doctrine-collation-bug-repro
The details are written in its README.md. I implemented a DBAL only test and another test with ORM+Migrations to double-check (you do not have to use this one and its dependencies, if you don't need it). The DBAL-only test should be sufficient.

As far as I understand it, the charset and collation options are somewhat vendor-specific. I am using MySQL 8.0 (and MariaDB 10.6) and the MySQL-Docs for column charsets state the following:

  • if both charset and collation are specified, they are used as is
  • if only the charset is specified, the default collation of the charset is used
  • if only the collation is specified, the charset is inferred from it
  • if neither is specified, the table defaults are used

I think that the case in bold is not implemented in DBAL. If I leave out the charset definition, it is detected as a diff, because the unspecified charset in my column definition (doesn't matter if it comes from ORM or directly from the DBAL-only code) is not equal to the current charset of the database table (in my case utf8mb4). Therefore a diff like:

ALTER TABLE Test CHANGE name name VARCHAR(255) NOT NULL COLLATE `utf8mb4_unicode_ci`

without a charset definition (= "null") is generated. If it is executed, it doesn't change the table definition, because the collation is already the same and a charset change is not specified. This way it can be run in an endless loop of diff generation <--> diff execution.

@morozov
Copy link
Member

morozov commented Jun 29, 2022

Alright, so this is the reproducer from the test repo:

<?php

use Doctrine\DBAL\Tools\Console\ConnectionProvider;
use Doctrine\DBAL\Schema\Schema;
use Doctrine\DBAL\Types\Types;

require 'vendor/autoload.php';

$connection = /* get connection */;

$connection->executeQuery('CREATE TABLE IF NOT EXISTS Test (id INT AUTO_INCREMENT NOT NULL, name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE `utf8mb4_unicode_ci` NOT NULL, PRIMARY KEY(id)) DEFAULT CHARACTER SET utf8 COLLATE `utf8_unicode_ci` ENGINE = InnoDB;');
$schemaManager = $connection->createSchemaManager();
$fromSchema = $schemaManager->createSchema();

$toSchema = new Schema([], [], $schemaManager->createSchemaConfig());
$table = $toSchema->createTable('test');
$table->addColumn(
    'id',
    Types::INTEGER,
    [
        'autoincrement' => true
    ]
);
$table->setPrimaryKey(["id"]);
$table->addColumn(
    'name',
    Types::STRING,
    [
        'length' => 255,
        'notnull' => true,
        'platformOptions' => [
            'version' => false
        ],
        'customSchemaOptions' => [
            // 'charset' => 'utf8mb4', // this is the workaround. uncommenting it results in no diff
            'collation' => 'utf8mb4_unicode_ci'
        ]
    ]
);

$up = $schemaManager->createComparator()->compareSchemas($fromSchema, $toSchema);
$diffSql = $up->toSql($connection->getDatabasePlatform());
echo("Up Diffs:\n");
var_dump($diffSql);

$down = $schemaManager->createComparator()->compareSchemas($toSchema, $fromSchema);
$diffSql = $down->toSql($connection->getDatabasePlatform());
echo("Down Diffs:\n");
var_dump($diffSql);

@morozov
Copy link
Member

morozov commented Jun 30, 2022

So the charset is specified in the column definition in SQL:

name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE `utf8mb4_unicode_ci`

And the charset is not specified in the column definition in the DBAL:

$table->addColumn(
    'name',
    Types::STRING,
    [
        'length' => 255,
        'notnull' => true,
        'platformOptions' => [
            'version' => false
        ],
        'customSchemaOptions' => [
            // 'charset' => 'utf8mb4', // this is the workaround. uncommenting it results in no diff
            'collation' => 'utf8mb4_unicode_ci'
        ]
    ]
);

The resulting migration up will remove the charset (specified → not specified):

ALTER TABLE Test CHANGE name name VARCHAR(255) NOT NULL COLLATE `utf8mb4_unicode_ci`

The resulting migration down will add the charset (not specified → specified):

ALTER TABLE test CHANGE name name VARCHAR(255) CHARACTER SET utf8mb4 NOT NULL COLLATE `utf8mb4_unicode_ci`

What is the problem? Why do you consider specifying the charset a workaround?

@Dherlou
Copy link

Dherlou commented Jun 30, 2022

The resulting migration up will remove the charset (specified → not specified):

As far as I understand the general concept, a charset is the encoding of characters (i.e. how they are stored as bytes) whereas the collations are the ordering/sorting rules for their respective charset. A collation depends on a charset and cannot work without it.

Therefore (at least in the MySQL/MariaDB world), it is not possible to "remove the charset". Specifying the charset is optional and therefore my example DBAL/ORM code does not specify a charset explicitly. If it is not specified, it is inferred from the provided collation (in this example: utf8mb4_unicode_ci --> utf8mb4), because each collation belongs to one charset, so this inferring strategy is trivial. As a result, executing this up migration doesn't change anything and future diffs still contain the same up migration, because the state is the same as before.

I find it hard to differentiate, whether this is an actual "bug" or just a "missing vendor-specific convenience feature", but with the current state of implementation it is required to explicitly specify the charset (in contrast to the MySQL specification where it is optional) to have DBAL work as expected.
Of course, talking about charsets/collations is very vendor-specific and other DBMS might handle this totally different.

@morozov
Copy link
Member

morozov commented Jun 30, 2022

Therefore (at least in the MySQL/MariaDB world), it is not possible to "remove the charset".

My understanding of the meaning of the charset/collation is the same as yours. By "removing" I mean removing from the definition, not from the column as such.

Specifying the charset is optional and therefore my example DBAL/ORM code does not specify a charset explicitly.

Why do you specify it in SQL then? Will the issue still persist if it's omitted in the SQL as well?

@Dherlou
Copy link

Dherlou commented Jun 30, 2022

Why do you specify it in SQL then? Will the issue still persist if it's omitted in the SQL as well?

Fair point. I removed the charset definition from the table creation SQL statement. (It also still had a charset/collation of utf8(_unicode_ci) instead of utf8mb4(_unicode_ci) specified on table level, because I replicated the state of our production system before and forgot to push that change.)
Bottom line is, it doesn't change anything, because column definitions take precedence over table definitions and the charset is still inferred from the collation, even if I don't specify it. :)

@morozov
Copy link
Member

morozov commented Jun 30, 2022

Here's a simplified reproducer w/o SQL:

<?php

use Doctrine\DBAL\Schema\Table;
use Doctrine\DBAL\Tools\Console\ConnectionProvider;
use Doctrine\DBAL\Types\Types;

require 'vendor/autoload.php';

$connection = /* get connection */;

$table = new Table('test');
$table->addColumn(
    'name',
    Types::STRING,
    [
        'length' => 255,
        'customSchemaOptions' => [
            'collation' => 'utf8mb4_unicode_ci',
        ]
    ]
);

$connection->executeStatement('DROP TABLE IF EXISTS test');
$schemaManager = $connection->createSchemaManager();
$schemaManager->createTable($table);

$t2 = $schemaManager->listTableDetails('test');
$diff = $schemaManager->createComparator()->diffTable($t2, $table);
var_dump($connection->getDatabasePlatform()->getAlterTableSQL($diff));

// array(1) {
//   [0]=>
//   string(84) "ALTER TABLE test CHANGE name name VARCHAR(255) NOT NULL COLLATE `utf8mb4_unicode_ci`"
// }

[…] column definitions take precedence over table definitions and the charset is still inferred from the collation, even if I don't specify it. :)

Yeah, that's what's happening. We probably need to auto-detect the charset if the collation is specified but the charset isn't using something from the documentation. For instance,

SELECT CHARACTER_SET_NAME
FROM information_schema.COLLATIONS
WHERE COLLATION_NAME = 'utf8mb4_unicode_ci';
-- utf8mb4

It's possible that the same issue will exist if the collation is also omitted but it's inherited from some defaults as it works on SQL Server.

@morozov
Copy link
Member

morozov commented Jul 2, 2022

The issue should be fixed in the 3.3.x branch. Please test and let us know if something still doesn't work as expected.

@morozov morozov added this to the 3.3.8 milestone Jul 2, 2022
@morozov morozov reopened this Jul 3, 2022
@morozov
Copy link
Member

morozov commented Jul 3, 2022

The new test fails on 4.0.x due to the changes implemented in #4644. I will try to improve the fix.

@github-actions
Copy link

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants