Troubleshooting: SQL-Modes (and my sanity)
- The Problem
- Why did the problem not occur on our dev db?
This week, I got confronted with the "ONLY_FULL_GROUP_BY" SQL-Mode, mostly because I wrote a faulty query and introduced configuration errors in our local development stack some weeks prior. Both errors combined lead to some confusion, so I wanted to write down the aftermath to have the ability to come back every couple months to do a facepalm.
I refactored parts of our business logic and slightly altered a jooq-query, after some manual testing everything seemed fine, so I opened a merge-request and ran our unit/integration-tests: 40 Tests failed, all with:
ERROR 1055 (42000): Expression #2 of SELECT [x] is not in GROUP BY clause and contains nonaggregated column 'xxxx' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
I had no idea what only_full_group_by was, and the error message was kinda cryptic for me. Why does it only fail inside our integration tests? A quick google search revealed, that this sql-mode is switched on per default on mysql version >5.7, but we already use version >8.0, so that should not be the issue.
What was the problem in the query
A simplified example: We have a "car-parts" table and a translation table to localize part names:
Part of my refactoring was to only return one localized string from the translation table, instead of returning all available translations in a map, as we did before with json_merge_preserve, concat and group_concat
The new (really simplified) query looked like this:
1 select parts.id, 2 translation.name 3 from parts 4 join translation on (translation.PartID = parts.id and 5 (translation.language = 'de' or translation.language = 'en' )) 6 group by parts.id 7 having parts.id = 1021;
I made 2 mistakes:
- I left the group by statement inside the statement, it is not needed anymore because we don't group anything anymore
- The or statement was plain false (functionally and non-functionally) and introduced the only_full_group_by error.
My goal with this query was to fall back to the default translation in case if there is no translation for the item being fetched, that's completely wrong, and I would need a completely different query to project this logic in code.
Furthermore, the default translation is already inside the parts table, a simple coalesce on the parts.name and translation.translation columns would do the job, it's a little harder to spot inside the real codebase, and by having one translation twice in different tables our database is not normalized, but it makes fixing this query easier, and I still feel ashamed not noticing the presence of default translation in the first place.
The SQL Error
By introducing the or logic inside the join, multiple translations exist for a single part-ID (the id that we group), the group by function cannot decide what translated string to choose from. Mysql versions prior to 5.7 operated in a so called "forgiving mode" and would just select one random translation, the application would receive data from the database, but it would not be consistent and undeterministic.
One solution could be to concatenate multiple results, but as we need only one translation anyway, the mentioned coalesce trick is enough for us. After getting rid of the or statement, only one translation gets returned, and our query works again.
Why did the problem not occur on our dev db?
I set up a local docker container to run mysql on my machine, and we have a docker run command in our README.MD because we made some modifications:
I screwed up while copy-pasting some commands and --sql-mode= sneaked into the docker command, resulting in that our local dev database started without a single active sql-mode . A small, hard to spot change with consequences.
I have some serious doubts in my sanity now and will triple check every change I do from now on ... especially when copy-pasting something. At least I learned something about sql-modes ... the hard way. I'm glad that this project has a lot of integration tests.
1 2 -- create 3 ( 4 id INTEGER PRIMARY KEY, 5 name TEXT NOT NULL 6 ); 7 8 -- create 9 ( 10 id INTEGER PRIMARY KEY, 11 partID INTEGER, 12 lang varchar(21) NOT NULL, 13 name TEXT NOT NULL, 14 constraint uc_valueclass_translation 15 unique (partID, lang) 16 ); 17 18 19 20 -- insert 21 INSERT INTO Parts VALUES (0001, 'A'); 22 INSERT INTO Parts VALUES (0002, 'B'); 23 INSERT INTO Parts VALUES (0003, 'C'); 24 25 INSERT INTO translation VALUES (0001,0001, 'DE', 'de'); 26 INSERT INTO translation VALUES (0002,0001, 'EN', 'en'); 27 28 select 29 Parts.id, 30 translation.name 31 from Parts 32 join translation on (translation.PartID = Parts.id and translation.lang = 'EN' ) 33 group by Parts.id;
Additional note: My assumption why this query still works: Even though translation.name is not dependent on the GROUP BY clause, the query still runs because of our unique constraint. A specific partID-lang combination can only occur once, so the result is always unique per ID. If you remove the constraint the error reappears.