Skip to content
This repository has been archived by the owner on Jul 3, 2024. It is now read-only.

Add a second implementation for pop() for MySQL / MariaDB #161

Closed
wants to merge 7 commits into from
Closed

Add a second implementation for pop() for MySQL / MariaDB #161

wants to merge 7 commits into from

Conversation

MatthiasKuehneEllerhold
Copy link
Contributor

Should fix #160

@MatthiasKuehneEllerhold MatthiasKuehneEllerhold marked this pull request as draft August 4, 2020 10:10
Copy link
Collaborator

@roelvanduijnhoven roelvanduijnhoven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really impressed with this nice implementation :).

I think there is one case that could very well happen and is not accounted for. But am not sure yet how we could handle this cleanly.

The case is: a job gets allocated, but never run. This can happen because the job worker crashed halfway during this process. Now there is a job with status allocating, and a worker id assigned that will never run again.

Some ideas:

  • Run the complete thing in a transaction. But.. that could possibly introduce the locking we don't want. Not sure if this will work.
  • A job is allocated for a fixed time period. And after that period it is reset to the running state. Our queue could run a query once at start to reset those cases. That would require an additional column with the allocatedAt timestamp.

Maybe there are better ideas!

Also I think we would like to test this strategy. It would be great if we could run the test-suite for both styles. Not sure how easy that is. Did you look into this maybe already?

Anyhow: nais work :).

@@ -123,8 +135,39 @@ public function pop(array $options = []): ?JobInterface
// First run garbage collection
$this->purge();

$conn = $this->connection;
$conn->beginTransaction();
if (self::DATABASE_PLATFORM_POSTGRES == $this->connection->getDatabasePlatform()->getName()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can test more easily if we make a property on this class that determines the pop strategy. And obviously the factory of this queue will only set the MySQL-style pop if MySQL is used.

Furthermore: I think it makes more sense to only enable the mysql pop style for MySQL, and the postgres style as default. As it could well be that MySQL is the exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have another "if postgres" check in push()... how do we want to handle this?

What other Doctrine-supported engines are there? SQLite doesnt support "SELECT ... FOR UPDATE", so we'd have to use the MySQL-pop too. Dont know if MSSQL, ... correctly support this.

From my perspective the mysql variant is the most portable because it uses just simple UPDATE and SELECTs without any locking or transactions. PostgreSQL has an optimization which we cant assume for all other db engines.

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MatthiasKuehneEllerhold Agreed! Was not sure on other platforms, but you have a convincing argument here :). So: POSTGRES is indeed the special case.

My reasoning for having the config variable is that we can run PDOSqlite in our test suite using both polling strategies. But we could also try to run an actual Postgres database in the test suite, that way we won't need that switch. Ultimately i still think it makes sense to introduce a config variable for the polling strategy, as that is the most easy path to test this.

The other check could remain the same as far as I'm concerned and check explicitly on PostgreSQL. Reasoning being that without that code it won't work on Postgres.

@roelvanduijnhoven
Copy link
Collaborator

I'll continue the conversation in this PR. Gave a thorough review. Let me know if I can assist in something :).

@MatthiasKuehneEllerhold
Copy link
Contributor Author

A job is allocated for a fixed time period. And after that period it is reset to the running state. Our queue could run a query > once at start to reset those cases. That would require an additional column with the allocatedAt timestamp.

We're doing this in our implementation with a field named "status_timeout". If a job is in status "ALLOCATING" or "RUNNING" for more than x minutes, then this job is automatically restarted. I didnt port it to this PR because this would mean another field in the table and another check regularly.

@MatthiasKuehneEllerhold
Copy link
Contributor Author

Added allocationTimeout.

Still open:

  • DB migration (adding fields workerId and allocationTimeout).

@roelvanduijnhoven
Copy link
Collaborator

@MatthiasKuehneEllerhold I think the migration can simply be stated in the changelog. And that we can bump this library a major version. A migration as part of this code-base is hard because it really depends on what migration strategy you use, and whether this SQL user has enough SQL rights to alter tables.

We should however update somes files in the repo: https://github.com/JouwWeb/SlmQueueDoctrine/blob/master/data/queue_default.sql and https://github.com/JouwWeb/SlmQueueDoctrine/blob/master/tests/Asset/queue_default.sqlite for example. And we should think whether a new index is required, or needs to be set.

Do you have other questions at this point @MatthiasKuehneEllerhold?

Apart from that I think we need to update / fix the test suite at the very least. And maybe look at testing the new functionality :).

@MatthiasKuehneEllerhold
Copy link
Contributor Author

Im currently pretty busy, but I hope that I can work on this next week.

@roelvanduijnhoven
Copy link
Collaborator

@MatthiasKuehneEllerhold No problem off course :). Let me know if I can assist in any way.

@roelvanduijnhoven
Copy link
Collaborator

@MatthiasKuehneEllerhold Just a kindly reminder. Do you still care to finish this PR? It not we can look if somebody else is willing to step in, as your work seems promising!

@MatthiasKuehneEllerhold
Copy link
Contributor Author

Im really sorry to leave you hanging like this, but things have gotten kinda crazy here and I wont be able to contribute at least until january. If anyone wants to jump in and help out, I'd be glad!
Again: really sorry for this!

@roelvanduijnhoven
Copy link
Collaborator

Haha take care @MatthiasKuehneEllerhold. No worries. This PR will be here waiting if you ever like to finish to ;).

@MatthiasKuehneEllerhold
Copy link
Contributor Author

MatthiasKuehneEllerhold commented May 6, 2021

Im terribly sorry for abandoning you here but it was a crazy time. Ive had a little bit of time to properly review our logs und BG-Queue and Im terribly saddened to see that this implementation does not fix the problem Ive outlined.
Although the UPDATE locks less rows than the SELECT FOR UPDATE previously, it still locks too many rows. We're still getting the "deadlock while trying to get lock" exception (about ~ 10 times per day in a queue with ~ 1200 jobs per day and 12 workers). Ive seen clashes between the UPDATE and INSERTs, between the UPDATE and other UPDATEs that just modify 1 job. For a lack of proper fixes we implemented a lot of code to remedy the fact that sometimes jobs go missing, get stuck or ... ?

This problem does not seem to be solvable with MariaDB alone. Ive had proper time now to analyse it and what we can do. Whats problematic is SELECT FOR UPDATE locking the whole table and blanket UPDATEs locking all possible matches.
If we could use a normal SELECT and then UPDATE a single job, that would work. But if 2 jobs simultaneously SELECT the same Job and try to fetch it, we get multiple executions of the same job! Not what we want! But with this way we wont have the locked table / row in the DB.

So what we've come up with is this: we're using redis for an exclusive lock on this code path. It boils down to:

while (true) {
  $lock = acquireLock();
  if ($lock) {
    $job = $db->query('SELECT job');
    if ($job) {
      $db->query('UPDATE job set status = running WHERE id = ' . $job);
    }
  
    releaseLock($lock);
    return $job;
  } else {
    sleep(1);
  }
}

So if multiple workers want to pop a now Job from the queue they can do so... in single file! You can only get to the SELECT+UPDATE pair if you have the lock and redis makes sure, that only one worker can have this lock at the same time.

This way every INSERT or UPDATE on existing Jobs is not impacted. So the if the frontend adds a new job it doesnt have to wait, because there is no table locking. Only workers may have to wait a bit, but thats ok.

For more information on exclusive locks in redis you can read the chapter here: https://redislabs.com/ebook/part-2-core-concepts/chapter-6-application-components-in-redis/6-2-distributed-locking/

Im pretty sure adding redis exclusive locks is severly out of scope for this library because for most it would be easier to just use a different Queue-Provider or database. Im unsure how PostgreSQL solves this problem, so maybe just MariaDB is the culprit here and unsuitable for a message queue (a sentiment I read a lot in the last few days).

@roelvanduijnhoven
Copy link
Collaborator

@MatthiasKuehneEllerhold Great that you find out before we merged this :). And thanks for you response.

You obviously know more than me, but hear me out. What about the following strategy. I feel like that would work just fine with MySQL and utilizes the fact that an update query is atomic.

  1. Each worker gets a unique ID when it starts working.
  2. You do an UPDATE queue SET leasedTo = [unique id] WHERE leasedTo IS NULL AND status = 1 .. LIMIT 1
  3. The job worker queries all jobs that are leased to him. And only proceeds if exactly one result found.
  4. We update that job: UPDATE queue SET executed = NOW(), leasedTo = NULL WHERE id = [..]
  5. We execute the job.

There needs to be a mechanism that will unlease the jobs that where leased to a job worker that crashed, after some time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlocks when fetching messages from queue
3 participants