-
Notifications
You must be signed in to change notification settings - Fork 31
Add a second implementation for pop() for MySQL / MariaDB #161
Add a second implementation for pop() for MySQL / MariaDB #161
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really impressed with this nice implementation :).
I think there is one case that could very well happen and is not accounted for. But am not sure yet how we could handle this cleanly.
The case is: a job gets allocated, but never run. This can happen because the job worker crashed halfway during this process. Now there is a job with status allocating, and a worker id assigned that will never run again.
Some ideas:
- Run the complete thing in a transaction. But.. that could possibly introduce the locking we don't want. Not sure if this will work.
- A job is allocated for a fixed time period. And after that period it is reset to the running state. Our queue could run a query once at start to reset those cases. That would require an additional column with the allocatedAt timestamp.
Maybe there are better ideas!
Also I think we would like to test this strategy. It would be great if we could run the test-suite for both styles. Not sure how easy that is. Did you look into this maybe already?
Anyhow: nais work :).
@@ -123,8 +135,39 @@ public function pop(array $options = []): ?JobInterface | |||
// First run garbage collection | |||
$this->purge(); | |||
|
|||
$conn = $this->connection; | |||
$conn->beginTransaction(); | |||
if (self::DATABASE_PLATFORM_POSTGRES == $this->connection->getDatabasePlatform()->getName()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can test more easily if we make a property on this class that determines the pop strategy. And obviously the factory of this queue will only set the MySQL-style pop if MySQL is used.
Furthermore: I think it makes more sense to only enable the mysql pop style for MySQL, and the postgres style as default. As it could well be that MySQL is the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have another "if postgres" check in push()... how do we want to handle this?
What other Doctrine-supported engines are there? SQLite doesnt support "SELECT ... FOR UPDATE", so we'd have to use the MySQL-pop too. Dont know if MSSQL, ... correctly support this.
From my perspective the mysql variant is the most portable because it uses just simple UPDATE and SELECTs without any locking or transactions. PostgreSQL has an optimization which we cant assume for all other db engines.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MatthiasKuehneEllerhold Agreed! Was not sure on other platforms, but you have a convincing argument here :). So: POSTGRES is indeed the special case.
My reasoning for having the config variable is that we can run PDOSqlite in our test suite using both polling strategies. But we could also try to run an actual Postgres database in the test suite, that way we won't need that switch. Ultimately i still think it makes sense to introduce a config variable for the polling strategy, as that is the most easy path to test this.
The other check could remain the same as far as I'm concerned and check explicitly on PostgreSQL. Reasoning being that without that code it won't work on Postgres.
I'll continue the conversation in this PR. Gave a thorough review. Let me know if I can assist in something :). |
We're doing this in our implementation with a field named "status_timeout". If a job is in status "ALLOCATING" or "RUNNING" for more than x minutes, then this job is automatically restarted. I didnt port it to this PR because this would mean another field in the table and another check regularly. |
Added allocationTimeout. Still open:
|
@MatthiasKuehneEllerhold I think the migration can simply be stated in the changelog. And that we can bump this library a major version. A migration as part of this code-base is hard because it really depends on what migration strategy you use, and whether this SQL user has enough SQL rights to alter tables. We should however update somes files in the repo: https://github.com/JouwWeb/SlmQueueDoctrine/blob/master/data/queue_default.sql and https://github.com/JouwWeb/SlmQueueDoctrine/blob/master/tests/Asset/queue_default.sqlite for example. And we should think whether a new index is required, or needs to be set. Do you have other questions at this point @MatthiasKuehneEllerhold? Apart from that I think we need to update / fix the test suite at the very least. And maybe look at testing the new functionality :). |
Im currently pretty busy, but I hope that I can work on this next week. |
@MatthiasKuehneEllerhold No problem off course :). Let me know if I can assist in any way. |
@MatthiasKuehneEllerhold Just a kindly reminder. Do you still care to finish this PR? It not we can look if somebody else is willing to step in, as your work seems promising! |
Im really sorry to leave you hanging like this, but things have gotten kinda crazy here and I wont be able to contribute at least until january. If anyone wants to jump in and help out, I'd be glad! |
Haha take care @MatthiasKuehneEllerhold. No worries. This PR will be here waiting if you ever like to finish to ;). |
Im terribly sorry for abandoning you here but it was a crazy time. Ive had a little bit of time to properly review our logs und BG-Queue and Im terribly saddened to see that this implementation does not fix the problem Ive outlined. This problem does not seem to be solvable with MariaDB alone. Ive had proper time now to analyse it and what we can do. Whats problematic is SELECT FOR UPDATE locking the whole table and blanket UPDATEs locking all possible matches. So what we've come up with is this: we're using redis for an exclusive lock on this code path. It boils down to:
So if multiple workers want to pop a now Job from the queue they can do so... in single file! You can only get to the SELECT+UPDATE pair if you have the lock and redis makes sure, that only one worker can have this lock at the same time. This way every INSERT or UPDATE on existing Jobs is not impacted. So the if the frontend adds a new job it doesnt have to wait, because there is no table locking. Only workers may have to wait a bit, but thats ok. For more information on exclusive locks in redis you can read the chapter here: https://redislabs.com/ebook/part-2-core-concepts/chapter-6-application-components-in-redis/6-2-distributed-locking/ Im pretty sure adding redis exclusive locks is severly out of scope for this library because for most it would be easier to just use a different Queue-Provider or database. Im unsure how PostgreSQL solves this problem, so maybe just MariaDB is the culprit here and unsuitable for a message queue (a sentiment I read a lot in the last few days). |
@MatthiasKuehneEllerhold Great that you find out before we merged this :). And thanks for you response. You obviously know more than me, but hear me out. What about the following strategy. I feel like that would work just fine with MySQL and utilizes the fact that an update query is atomic.
There needs to be a mechanism that will unlease the jobs that where leased to a job worker that crashed, after some time. |
Should fix #160