gh-128679: Fix tracemalloc.stop() race condition#128710
gh-128679: Fix tracemalloc.stop() race condition#128710vstinner wants to merge 11 commits intopython:mainfrom
Conversation
Check again 'tracemalloc_config.tracing' once the GIL is held in tracemalloc_raw_alloc() and PyTraceMalloc_Track(), since another thread can call tracemalloc.stop() during PyGILState_Ensure() call.
|
Should we include a test based off the reproducer in the issue? |
* Hold the table lock while calling _PyTraceMalloc_Stop(). * Add tracemalloc_raw_free().
|
My fix was incomplete, the PyMem_RawFree() hook also had the bug:
Ok, I added a test. It was a good idea, I found PyMem_RawFree() bug with it :-) |
|
@ZeroIntensity: Would you mind to review the PR which now has a test? |
|
Will get to it a little later today :) |
ZeroIntensity
left a comment
There was a problem hiding this comment.
I'm still not too comfortable relying on incidental lock ordering with the tables lock for synchronization here. Everywhere tracemalloc_config.tracing is used the GIL is held, right? We should be able to rely on the GIL for synchronization. Oherwise, let's either use a dedicated lock for it or one of the _Py_atomic APIs.
| { | ||
| PyTraceMalloc_Track(123, 10, 1); | ||
|
|
||
| PyThread_type_lock lock = (PyThread_type_lock)data; |
There was a problem hiding this comment.
It's a lot easier to use a PyMutex, considering you don't have to heap allocate it. I think it works without a thread state if you use _PyMutex_LockFlags(mutex, _Py_LOCK_DONT_DETACH)
| else { | ||
| // gh-128679: tracemalloc.stop() was called by another thread | ||
| } |
There was a problem hiding this comment.
I think we can remove the else:
| else { | |
| // gh-128679: tracemalloc.stop() was called by another thread | |
| } | |
| // gh-128679: If we're not tracing, then tracemalloc.stop() was called by another thread |
|
|
||
| // Lock to synchronize with tracemalloc_raw_free() which checks | ||
| // 'tracing' while holding the lock. | ||
| TABLES_LOCK(); |
There was a problem hiding this comment.
As I said in the other PR, relying on lock ordering like this is really error prone. If we hold the GIL, then we should be able to use tracemalloc_config.tracing without a lock, right?
There was a problem hiding this comment.
As I said in the other PR, relying on lock ordering like this is really error prone.
Would you mind to elaborate?
This PR adds So I decided to reuse TABLES_LOCK() to access
|
Right, so this is only safe because of lock-ordering. This approach with the tables lock only works for this case because we assume the only thread that can modify My other concern is re-entrancy and lock-ordering deadlocks. There are two cases here, depending on what
Either way, it's problematic. |
|
It doesn't release the GIL.
_PyTraceMalloc_Stop() cannot release the GIL. It calls _Py_hashtable_clear() on these tables:
The tricky part is to remove a trace in |
|
Generally, things release the GIL when they're waiting on something, like a lock.
Ok, good to know. I was under the impression that any Anyways, the fix here would be a lot simpler (and less ambiguous) if we went with a dedicated lock for |
There are multiple constraints.
(B) means that My fix for the |
|
|
Are you sure about EDIT: Never mind, found it. |
Well, sort of. It's used for protecting the operations from multithreaded races, not necessarily synchronizing the threads in a certain order. Anyways--the fix works, but only for the specific test. I'd rather dedicated locks or atomics because:
So, I'm willing to approve this as a bandaid for 3.12, but if we go with this let's try to get a better fix in 3.13 and main (where |
| @@ -1317,9 +1353,16 @@ PyTraceMalloc_Track(unsigned int domain, uintptr_t ptr, | |||
|
|
|||
There was a problem hiding this comment.
Let's either lock the read above this or remove it entirely, then LGTM for 3.12.
|
I wrote a different fix based on my refactoring that I just merged: #128893. |
Check again 'tracemalloc_config.tracing' once the GIL is held in tracemalloc_raw_alloc() and PyTraceMalloc_Track(), since another thread can call tracemalloc.stop() during PyGILState_Ensure() call.