Add a second implementation for pop() for MySQL / MariaDB #161

MatthiasKuehneEllerhold · 2020-08-04T10:07:29Z

Should fix #160

roelvanduijnhoven

Really impressed with this nice implementation :).

I think there is one case that could very well happen and is not accounted for. But am not sure yet how we could handle this cleanly.

The case is: a job gets allocated, but never run. This can happen because the job worker crashed halfway during this process. Now there is a job with status allocating, and a worker id assigned that will never run again.

Some ideas:

Run the complete thing in a transaction. But.. that could possibly introduce the locking we don't want. Not sure if this will work.
A job is allocated for a fixed time period. And after that period it is reset to the running state. Our queue could run a query once at start to reset those cases. That would require an additional column with the allocatedAt timestamp.

Maybe there are better ideas!

Also I think we would like to test this strategy. It would be great if we could run the test-suite for both styles. Not sure how easy that is. Did you look into this maybe already?

Anyhow: nais work :).

roelvanduijnhoven · 2020-08-05T07:21:36Z

src/Queue/DoctrineQueue.php

@@ -123,8 +135,39 @@ public function pop(array $options = []): ?JobInterface
        // First run garbage collection
        $this->purge();

-        $conn = $this->connection;
-        $conn->beginTransaction();
+        if (self::DATABASE_PLATFORM_POSTGRES == $this->connection->getDatabasePlatform()->getName()) {


I think we can test more easily if we make a property on this class that determines the pop strategy. And obviously the factory of this queue will only set the MySQL-style pop if MySQL is used.

Furthermore: I think it makes more sense to only enable the mysql pop style for MySQL, and the postgres style as default. As it could well be that MySQL is the exception.

We have another "if postgres" check in push()... how do we want to handle this?

What other Doctrine-supported engines are there? SQLite doesnt support "SELECT ... FOR UPDATE", so we'd have to use the MySQL-pop too. Dont know if MSSQL, ... correctly support this.

From my perspective the mysql variant is the most portable because it uses just simple UPDATE and SELECTs without any locking or transactions. PostgreSQL has an optimization which we cant assume for all other db engines.

What do you think?

@MatthiasKuehneEllerhold Agreed! Was not sure on other platforms, but you have a convincing argument here :). So: POSTGRES is indeed the special case.

My reasoning for having the config variable is that we can run PDOSqlite in our test suite using both polling strategies. But we could also try to run an actual Postgres database in the test suite, that way we won't need that switch. Ultimately i still think it makes sense to introduce a config variable for the polling strategy, as that is the most easy path to test this.

The other check could remain the same as far as I'm concerned and check explicitly on PostgreSQL. Reasoning being that without that code it won't work on Postgres.

src/Queue/DoctrineQueue.php

roelvanduijnhoven · 2020-08-05T07:39:16Z

I'll continue the conversation in this PR. Gave a thorough review. Let me know if I can assist in something :).

MatthiasKuehneEllerhold · 2020-08-05T08:12:57Z

A job is allocated for a fixed time period. And after that period it is reset to the running state. Our queue could run a query > once at start to reset those cases. That would require an additional column with the allocatedAt timestamp.

We're doing this in our implementation with a field named "status_timeout". If a job is in status "ALLOCATING" or "RUNNING" for more than x minutes, then this job is automatically restarted. I didnt port it to this PR because this would mean another field in the table and another check regularly.

MatthiasKuehneEllerhold · 2020-08-05T09:13:49Z

Added allocationTimeout.

Still open:

DB migration (adding fields workerId and allocationTimeout).

roelvanduijnhoven · 2020-08-05T11:29:06Z

@MatthiasKuehneEllerhold I think the migration can simply be stated in the changelog. And that we can bump this library a major version. A migration as part of this code-base is hard because it really depends on what migration strategy you use, and whether this SQL user has enough SQL rights to alter tables.

We should however update somes files in the repo: https://github.com/JouwWeb/SlmQueueDoctrine/blob/master/data/queue_default.sql and https://github.com/JouwWeb/SlmQueueDoctrine/blob/master/tests/Asset/queue_default.sqlite for example. And we should think whether a new index is required, or needs to be set.

Do you have other questions at this point @MatthiasKuehneEllerhold?

Apart from that I think we need to update / fix the test suite at the very least. And maybe look at testing the new functionality :).

MatthiasKuehneEllerhold · 2020-08-06T13:15:01Z

Im currently pretty busy, but I hope that I can work on this next week.

roelvanduijnhoven · 2020-08-21T11:25:59Z

@MatthiasKuehneEllerhold No problem off course :). Let me know if I can assist in any way.

roelvanduijnhoven · 2020-11-30T18:19:36Z

@MatthiasKuehneEllerhold Just a kindly reminder. Do you still care to finish this PR? It not we can look if somebody else is willing to step in, as your work seems promising!

MatthiasKuehneEllerhold · 2020-12-11T07:12:51Z

Im really sorry to leave you hanging like this, but things have gotten kinda crazy here and I wont be able to contribute at least until january. If anyone wants to jump in and help out, I'd be glad!
Again: really sorry for this!

roelvanduijnhoven · 2020-12-11T08:23:35Z

Haha take care @MatthiasKuehneEllerhold. No worries. This PR will be here waiting if you ever like to finish to ;).

MatthiasKuehneEllerhold · 2021-05-06T13:32:39Z

Im terribly sorry for abandoning you here but it was a crazy time. Ive had a little bit of time to properly review our logs und BG-Queue and Im terribly saddened to see that this implementation does not fix the problem Ive outlined.
Although the UPDATE locks less rows than the SELECT FOR UPDATE previously, it still locks too many rows. We're still getting the "deadlock while trying to get lock" exception (about ~ 10 times per day in a queue with ~ 1200 jobs per day and 12 workers). Ive seen clashes between the UPDATE and INSERTs, between the UPDATE and other UPDATEs that just modify 1 job. For a lack of proper fixes we implemented a lot of code to remedy the fact that sometimes jobs go missing, get stuck or ... ?

This problem does not seem to be solvable with MariaDB alone. Ive had proper time now to analyse it and what we can do. Whats problematic is SELECT FOR UPDATE locking the whole table and blanket UPDATEs locking all possible matches.
If we could use a normal SELECT and then UPDATE a single job, that would work. But if 2 jobs simultaneously SELECT the same Job and try to fetch it, we get multiple executions of the same job! Not what we want! But with this way we wont have the locked table / row in the DB.

So what we've come up with is this: we're using redis for an exclusive lock on this code path. It boils down to:

while (true) {
  $lock = acquireLock();
  if ($lock) {
    $job = $db->query('SELECT job');
    if ($job) {
      $db->query('UPDATE job set status = running WHERE id = ' . $job);
    }
  
    releaseLock($lock);
    return $job;
  } else {
    sleep(1);
  }
}

So if multiple workers want to pop a now Job from the queue they can do so... in single file! You can only get to the SELECT+UPDATE pair if you have the lock and redis makes sure, that only one worker can have this lock at the same time.

This way every INSERT or UPDATE on existing Jobs is not impacted. So the if the frontend adds a new job it doesnt have to wait, because there is no table locking. Only workers may have to wait a bit, but thats ok.

For more information on exclusive locks in redis you can read the chapter here: https://redislabs.com/ebook/part-2-core-concepts/chapter-6-application-components-in-redis/6-2-distributed-locking/

Im pretty sure adding redis exclusive locks is severly out of scope for this library because for most it would be easier to just use a different Queue-Provider or database. Im unsure how PostgreSQL solves this problem, so maybe just MariaDB is the culprit here and unsuitable for a message queue (a sentiment I read a lot in the last few days).

roelvanduijnhoven · 2021-05-13T10:04:03Z

@MatthiasKuehneEllerhold Great that you find out before we merged this :). And thanks for you response.

You obviously know more than me, but hear me out. What about the following strategy. I feel like that would work just fine with MySQL and utilizes the fact that an update query is atomic.

Each worker gets a unique ID when it starts working.
You do an UPDATE queue SET leasedTo = [unique id] WHERE leasedTo IS NULL AND status = 1 .. LIMIT 1
The job worker queries all jobs that are leased to him. And only proceeds if exactly one result found.
We update that job: UPDATE queue SET executed = NOW(), leasedTo = NULL WHERE id = [..]
We execute the job.

There needs to be a mechanism that will unlease the jobs that where leased to a job worker that crashed, after some time.

MatthiasKuehneEllerhold added 2 commits August 4, 2020 12:06

Add a second implementation for pop() for MySQL / MariaDB

7947390

fix wrong code style

78f5fc5

MatthiasKuehneEllerhold marked this pull request as draft August 4, 2020 10:10

Minor optimization

e9e5624

roelvanduijnhoven reviewed Aug 5, 2020

View reviewed changes

MatthiasKuehneEllerhold added 3 commits August 5, 2020 10:18

Minor changes from the review

dc573e6

new DateTime() instead of microseconds-fu, use null-coalescing operator

0dcb006

added $allocationTimeout

91db040

Docs, no extra variable

bd83056

MatthiasKuehneEllerhold closed this May 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a second implementation for pop() for MySQL / MariaDB #161

Add a second implementation for pop() for MySQL / MariaDB #161

MatthiasKuehneEllerhold commented Aug 4, 2020

roelvanduijnhoven left a comment

roelvanduijnhoven Aug 5, 2020

MatthiasKuehneEllerhold Aug 5, 2020

roelvanduijnhoven Aug 5, 2020

roelvanduijnhoven commented Aug 5, 2020

MatthiasKuehneEllerhold commented Aug 5, 2020

MatthiasKuehneEllerhold commented Aug 5, 2020

roelvanduijnhoven commented Aug 5, 2020

MatthiasKuehneEllerhold commented Aug 6, 2020

roelvanduijnhoven commented Aug 21, 2020

roelvanduijnhoven commented Nov 30, 2020

MatthiasKuehneEllerhold commented Dec 11, 2020

roelvanduijnhoven commented Dec 11, 2020

MatthiasKuehneEllerhold commented May 6, 2021 •

edited

Loading

roelvanduijnhoven commented May 13, 2021

Add a second implementation for pop() for MySQL / MariaDB #161

Add a second implementation for pop() for MySQL / MariaDB #161

Conversation

MatthiasKuehneEllerhold commented Aug 4, 2020

roelvanduijnhoven left a comment

Choose a reason for hiding this comment

roelvanduijnhoven Aug 5, 2020

Choose a reason for hiding this comment

MatthiasKuehneEllerhold Aug 5, 2020

Choose a reason for hiding this comment

roelvanduijnhoven Aug 5, 2020

Choose a reason for hiding this comment

roelvanduijnhoven commented Aug 5, 2020

MatthiasKuehneEllerhold commented Aug 5, 2020

MatthiasKuehneEllerhold commented Aug 5, 2020

roelvanduijnhoven commented Aug 5, 2020

MatthiasKuehneEllerhold commented Aug 6, 2020

roelvanduijnhoven commented Aug 21, 2020

roelvanduijnhoven commented Nov 30, 2020

MatthiasKuehneEllerhold commented Dec 11, 2020

roelvanduijnhoven commented Dec 11, 2020

MatthiasKuehneEllerhold commented May 6, 2021 • edited Loading

roelvanduijnhoven commented May 13, 2021

MatthiasKuehneEllerhold commented May 6, 2021 •

edited

Loading