[runtime] Thread safety for weak references #1454

glessard · 2016-02-26T14:48:00Z

It has been fairly easy to cause the runtime to crash on multithreaded accesses to weak references (e.g. SR-192). Although weak references are value types, they can get elevated to the heap in multiple ways, such as closure capture and property use in a class object. In such cases, race conditions involving weak references could cause the runtime to perform to multiple decrement operations of the unowned reference count for a single increment; this eventually caused early deallocation, leading to use-after-free, modify-after-free and double-free errors.

This commit changes the weak reference operations to use atomic operations rather than a thread-local logic. In particular, when the weak reference needs to be nulled, it is not done unconditionally, but via an atomic_compare_exchange operation, with the release operation only performed on success in order to avoid duplicated decrement operations. The assign operations assume the destination may be visible to multiple threads; the init operations assume the destination is local to the current thread. In all cases it is assumed that the source may be visible to multiple threads.

With this change, the crasher discussed in SR-192 no longer encounters modify-after-free or double-free crashes.

gribozavr · 2016-02-26T17:01:47Z

test/Runtime/weak-reference.swift

+  let iterations = 200_000
+  for i in 1...iterations {
+    let box = WeakBox(Thing())
+    dispatch_async(q) {


We have race test infrastructure (stdlib/private/StdlibUnittest/RaceTest.swift) that makes it possible to write such tests in a cross-platform way. Could you use that instead of dispatch?

lattner · 2016-02-26T18:40:51Z

@rjmccall ?

glessard · 2016-02-26T19:34:31Z

@rjmccall What are the restrictions on the uses of the Take calls? (The issues I observed seemed to bein swift_weakLoadStrong and swift_weakCopyAssign, and I expanded the logic to the rest of the swift_weak calls in HeapObject.cpp).

Why can you assume that assignment has exclusive access? Aren't those calls used to assign to a pre-existing var? One couldn't assume exclusive access of a var property in a class instance, for example.

glessard · 2016-02-26T20:12:38Z

The problem is due to not knowing the previous state of Value when storing to it. If two threads enter two of these calls simultaneously with a pointer to the same not-yet-nulled reference (to an object with no strong retains), they might both get past object->refCount.isDeallocating(), then both overwrite Value with null, and then they'll both call unownedRelease. One of them comes second and is making a wrong assumption. This is averted by using the exchange operation; the nulling operation is easy to catch both because of the consequence (modify-after-free), and also because more time passes between the read and the write in the original version. Other cases wouldn't lead to such obvious failures, but there could be memory leaks if Value changes between any paired read and write.

glessard · 2016-02-26T20:21:34Z

The only unknownWeak operations I can see are the non-ObjC stubs that directly forward to the HeapObject.cpp definitions from HeapObject.h. Is there another place to look? Thanks

rjmccall · 2016-02-26T21:41:58Z

In general, Swift does not guarantee correctness in the presence of read/write, write/write, or anything/destroy data races on a variable. It is one of the few undefined behavior rules that we embrace. This applies to weak references just as much as it applies to strong references.

Eventually, we want to provide a language concurrency model that makes races impossible outside of explicitly unsafe code. For now, it's the user's responsibility to ensure that these do not occur by properly synchronizing modifications of variables.

That's not to say that you aren't fixing a real bug. It's not supposed to be a race condition for two threads to simultaneously read from a weak reference, but right now it is because of the way that reads can clear the reference when they detect that the object is undergoing deallocation. That's something that needs to be fixed.

rjmccall · 2016-02-26T21:51:19Z

It's not easy to fix this particular race, however, because it isn't safe for other threads to access the referent at all if a thread is decrementing the unowned reference count, but other readers will naturally read the reference and try to increment the reference count of the referent. We basically need to have some way to tell other threads that there are reads in progress. One idea floated for fixing that is to use an activity count; please refer to the discussion thread with Mike Ash on swift-dev starting on 12/10. I'm not sure what the current status of his fix is, though.

gribozavr · 2016-02-26T21:52:35Z

test/Runtime/weak-reference.swift

+      _ = box.weak
+    }
+    dispatch_async(q) {
+      _ = box.weak


You might also want to use _blackHole() from StdlibUnittest to ensure that the compiler does not eliminate the unused weak load.

rjmccall · 2016-02-26T21:53:10Z

The implementation of unknownWeak that I'm referring to is the non-trivial implementation in SwiftObject.mm. This is only an issue for platforms with ObjC compatibility requirements.

glessard · 2016-02-26T22:13:34Z

I see; I forgot to search the *.mm files. (I always do.)

glessard · 2016-02-26T22:44:16Z

I had thought about something like the activity count, but assumed the WeakReference struct's ABI was set in stone. Is it changeable? The CAS solution as I implemented is not perfect, but it's better, as the window of opportunity for badness is much smaller, without being too onerous.

jckarter · 2016-02-26T22:45:01Z

The ABI is not set in stone yet.

glessard · 2016-02-26T23:01:58Z

It's probably worth doing, then.

glessard · 2016-02-29T19:17:03Z

Here's another attempt; it's a rather simple spinlock approach. On thinking about options that expand the WeakReference struct, I can't think of a way to implement it as a lock-free algorithm. If the read-read race is to be resolved with a read-write lock (which is what Mike Ash's activity count suggestion looks like to me), I'm not convinced the extra complexity over a spinlock is warranted (please advise).

This being said, the tests that I called "direct capture" now complete without crashing, but the "class instance property" ones do not. When this occurs the crashed thread has attempted to dereference the spinlock illegal pointer, but the call stack does not contain any of the modified functions; the shared weak reference must be getting read as thread-local by error, but I don't know where at the moment.

jckarter · 2016-02-29T19:40:16Z

We need a lock that can yield to the scheduler, since iOS's scheduler doesn't guarantee preemption. If a background thread took the lock, a spinlock will deadlock any higher-priority threads that contend it.

glessard · 2016-02-29T20:38:54Z

Agreed; however, I'm unsure of the appropriate way to yield (I hope there is one.)
A lock-free method would be much better, but with this interplay of non-adjacent variables I'm not seeing what that would be.

glessard · 2016-02-29T21:08:49Z

If the strong and weak counts could both be CAS'd at once, refCount.isDeallocating() could be replaced by a call that tries to increment the weak count but fails if the deallocating flag is true (retainUnlessDeallocating?). This would solve the case where an unownedRetain loses a race with an unownedRelease. The other race is the one to null the weak reference and avoid a double-release; that one is easily solved with a CAS on WeakReference.Value.

I think this combination would successfully be lock-free.

jckarter · 2016-03-01T03:45:44Z

We might be able to pull that off on 64-bit platforms, yeah. On 32-bit, the object header refcounts are only 32-bit aligned, and I don't think either i386 or armv7 can do an unaligned 64-bit atomic cmpxchg. We could align the refcounts, at the cost of four bytes per object, maybe.

rjmccall · 2016-03-01T04:07:17Z

I don't understand how a CAS on WeakReference.Value can eliminate the nulling-out race. Could you spell that one out? There's a tempting solution that looks like that, but it is not actually correct because simply having loaded the reference does not guarantee its validity. You basically need transaction memory for anything like that to work.

Other than this weak-reference-nulling issue, there is no way to validly get a race between an unowned-retain and the final unowned-release that triggers deallocation. You cannot perform an unowned retain without some reason to think that the referent is valid. For example, unowned reference variables maintain an invariant of owning an unowned-retain of their referent. As long as that invariant holds, it's safe to load/copy from the variable because the unowned reference count is always at least 1. Anything that would race with the load/copy would have to be a change to the value of the variable (or destroying it), or in other words, a store; and a load/store race is undefined behavior.

glessard · 2016-03-01T07:30:27Z

You're right, in both cases I forgot that the pointer copied from Value becomes untrustworthy on the very next clock cycle if there's a race. So the lock-free thoughts go out the window.

As for racing between a retain and a release,: I was thinking of a situation that starts with one weak reference, visible to 2 threads; strong count is 1. Thread A gets false from isDeallocating while attempting to copy the reference, then gets preempted; then thread B sets the deallocating flag on the strong count, then thread C (working with the same weak reference as A) gets true from isDeallocating. At that point an unownedRelease could happen before the unownedRetain. Depending on the length of the preemption on A, it could end up with a use-after-free. This being said, this story involves trusting the pointer from Value for too long.

glessard · 2016-03-11T03:51:56Z

New version: uses 2 bits from WeakReference.Value to avoid races. One allows the unknownWeak functions to call through without having to test the reference (which led to some crashes,) while the other acts as a spin lock for weakLoadStrong and weakCopyInit. sched_yield is invoked after some small amount of spinning in the event of contention.

jckarter · 2016-03-11T17:17:26Z

@gparker42 Is sched_yield sufficient to prevent deadlocking on iOS?

gparker42 · 2016-03-12T06:05:32Z

Maybe. I don't know what the kernel folks can promise, but I have been unable to make simple tests of QOS_CLASS_DEFAULT versus QOS_CLASS_BACKGROUND deadlock when there is a sched_yield in the loop (whereas similar tests do hang with a naive spinlock).

rjmccall · 2016-04-04T20:56:12Z

stdlib/public/runtime/HeapObject.cpp

+  auto ptr = __atomic_fetch_or(&ref->Value, WR_READING, __ATOMIC_RELAXED);
+  while (ptr & WR_READING) {
+    short c = 0;
+    while (ref->Value & WR_READING) {


This is not guaranteed to actually reload the data.

The intent is to not hit the cache line too often. See line 730: an atomic_fetch occurs after line 724 returned true. If it's preferable to just repeatedly call atomic_fetch, I can change it.

You need to at least do something to prevent the compiler from folding the repeated loads into a single one.

My (weak) circumstantial evidence was that the benchmarked spin time scales with the counter limit at line 725, though I haven't looked at the generated assembly. Given that this depends on the current abilities of the optimizer and we agree that it isn't a high contention case, would you think it's justified to simply do atomic_fetch repeatedly? Otherwise I'm not entirely sure what would ensure repeated loads.

Repeated atomic_fetch would be fine. You're doing a relaxed load, so this really should just compile down to an ordinary load; it's just that the compiler will not be able to combine the loads.

rjmccall · 2016-05-02T22:30:44Z

@swift-ci Please test.

rjmccall · 2016-05-03T02:10:45Z

That test failure isn't this patch's fault.

glessard · 2016-05-03T03:17:55Z

should I squash the later changes?

rjmccall · 2016-05-03T15:19:44Z

If you wouldn't mind.

rjmccall · 2016-05-03T15:21:19Z

stdlib/public/runtime/HeapObject.cpp

+
+  WR_NATIVEMASK = WR_NATIVE | swift::heap_object_abi::ObjCReservedBitsMask,
+};
+


Please add a static_assert that WR_READING doesn't interfere with normal pointer bits, i.e. that WR_READING < alignof(void*).

rjmccall · 2016-05-03T15:22:21Z

Sorry about the long delay reviewing this, but with that static_assert and the squash, this should be good for 3.0.

It has been fairly easy to cause the runtime to crash on multithreaded read-read access to weak references (e.g. https://bugs.swift.org/browse/SR-192). Although weak references are value types, they can get elevated to the heap in multiple ways, such as when captured by a closure or when used as a property in a class object instance. In such cases, race conditions involving weak references could cause the runtime to perform to multiple decrement operations of the unowned reference count for a single increment; this eventually causes early deallocation, leading to use-after-free, modify-after-free and double-free errors. This commit changes the weak reference operations to use a spinlock rather than assuming thread-exclusive access, when appropriate. With this change, the crasher discussed in SR-192 no longer encounters crashes due to modify-after-free or double-free errors.

Dispatch-based tests exist because (on OS X) they are more likely to encounter the race condition than `StdlibUnitTest`'s `runRaceTest()` is.

glessard · 2016-05-04T03:09:56Z

Added the static_assert and squashed.
The test files are in a new directory, test/Runtime. I put them there because I couldn't find a spot; is there a better one?
I also tried to make it happen with an added field to the WeakReference struct, but wasn't successful at that. It would conceivably be less onerous in the native case, and it would support the far-fetched case of two compatibility modes for a given platform -- alas.

rjmccall · 2016-05-04T18:03:38Z

I do think we should bump weak references out to occupy two words for ABI stability, but we don't need to do that in this patch. Committing.

rjmccall · 2016-05-04T18:03:58Z

Thanks for taking care of this!

glessard · 2016-05-04T19:45:28Z

Thanks!

jckarter · 2016-05-05T01:08:46Z

Thanks for taking this on @glessard, great to see this fixed!

glessard · 2016-05-05T03:35:24Z

@jckarter I got lots of help from you guys!

[pull] swiftwasm from master

gribozavr reviewed Feb 26, 2016
View reviewed changes

glessard changed the title ~~[runtime] Thread safety for weak references~~ WIP [runtime] Thread safety for weak references Feb 29, 2016

glessard force-pushed the weakref-threadsafety branch 2 times, most recently from 440108b to 9a42cec Compare March 11, 2016 03:36

glessard changed the title ~~WIP [runtime] Thread safety for weak references~~ [runtime] Thread safety for weak references Mar 11, 2016

rjmccall reviewed Apr 4, 2016
View reviewed changes

glessard force-pushed the weakref-threadsafety branch from 9a42cec to a85e4fa Compare April 6, 2016 01:46

glessard force-pushed the weakref-threadsafety branch from a85e4fa to fa0078e Compare April 8, 2016 20:02

glessard force-pushed the weakref-threadsafety branch from fa0078e to 8fdd8d7 Compare April 15, 2016 22:48

jckarter assigned gparker42 Apr 25, 2016

rjmccall reviewed May 3, 2016
View reviewed changes

glessard added 2 commits May 3, 2016 20:58

[test] Tests for weak reference read-read thread safety

c87309e

Dispatch-based tests exist because (on OS X) they are more likely to encounter the race condition than `StdlibUnitTest`'s `runRaceTest()` is.

glessard force-pushed the weakref-threadsafety branch from 8fdd8d7 to c87309e Compare May 4, 2016 03:00

rjmccall merged commit 9d2dfc0 into swiftlang:master May 4, 2016

glessard deleted the weakref-threadsafety branch May 9, 2016 21:31

swiftlyfalling mentioned this pull request Jul 2, 2016

Fix for OperationQueue.delegate crash (SR-192) ProcedureKit/ProcedureKit#353

Merged

mattgallagher mentioned this pull request Dec 4, 2017

CustomDebugStringConvertible AtomicBox mattgallagher/CwlUtils#15

Closed

MaxDesiatov pushed a commit that referenced this pull request Oct 19, 2020

Merge pull request #1454 from swiftwasm/master

b38a897

[pull] swiftwasm from master

glessard mentioned this pull request Feb 26, 2016

[SR-192] Weak properties are not thread safe when reading #42814

Closed

maxep mentioned this pull request Apr 24, 2023

RUMM-3185 Telemetry on Message Bus DataDog/dd-sdk-ios#1245

Merged

6 tasks

ncreated mentioned this pull request Apr 3, 2024

RUM-3461 feat: track Fatal App Hangs DataDog/dd-sdk-ios#1751

Merged

8 tasks


		WR_NATIVEMASK = WR_NATIVE \| swift::heap_object_abi::ObjCReservedBitsMask,
		};

[runtime] Thread safety for weak references #1454

[runtime] Thread safety for weak references #1454

Conversation

glessard commented Feb 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lattner commented Feb 26, 2016

Uh oh!

glessard commented Feb 26, 2016

Uh oh!

glessard commented Feb 26, 2016

Uh oh!

glessard commented Feb 26, 2016

Uh oh!

rjmccall commented Feb 26, 2016

Uh oh!

rjmccall commented Feb 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjmccall commented Feb 26, 2016

Uh oh!

glessard commented Feb 26, 2016

Uh oh!

glessard commented Feb 26, 2016

Uh oh!

jckarter commented Feb 26, 2016

Uh oh!

glessard commented Feb 26, 2016

Uh oh!

glessard commented Feb 29, 2016

Uh oh!

jckarter commented Feb 29, 2016

Uh oh!

glessard commented Feb 29, 2016

Uh oh!

glessard commented Feb 29, 2016

Uh oh!

jckarter commented Mar 1, 2016

Uh oh!

rjmccall commented Mar 1, 2016

Uh oh!

glessard commented Mar 1, 2016

Uh oh!

glessard commented Mar 11, 2016

Uh oh!

jckarter commented Mar 11, 2016

Uh oh!

gparker42 commented Mar 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjmccall commented May 2, 2016

Uh oh!

rjmccall commented May 3, 2016

Uh oh!

glessard commented May 3, 2016

Uh oh!

rjmccall commented May 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjmccall commented May 3, 2016

Uh oh!

glessard commented May 4, 2016

Uh oh!

rjmccall commented May 4, 2016

Uh oh!

rjmccall commented May 4, 2016

Uh oh!

glessard commented May 4, 2016