dispatch_sync vs. dispatch_async on main queue

Bear with me, this is going to take some explaining. I have a function that looks like the one below.

Context: "aProject" is a Core Data entity named LPProject with an array named 'memberFiles' that contains instances of another Core Data entity called LPFile. Each LPFile represents a file on disk and what we want to do is open each of those files and parse its text, looking for @import statements that point to OTHER files. If we find @import statements, we want to locate the file they point to and then 'link' that file to this one by adding a relationship to the core data entity that represents the first file. Since all of that can take some time on large files, we'll do it off the main thread using GCD.

- (void) establishImportLinksForFilesInProject:(LPProject *)aProject {
    dispatch_queue_t taskQ = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
     for (LPFile *fileToCheck in aProject.memberFiles) {
         if (//Some condition is met) {
            dispatch_async(taskQ, ^{
                // Here, we do the scanning for @import statements. 
                // When we find a valid one, we put the whole path to the imported file into an array called 'verifiedImports'. 

                // go back to the main thread and update the model (Core Data is not thread-safe.)
                dispatch_sync(dispatch_get_main_queue(), ^{

                    NSLog(@"Got to main thread.");

                    for (NSString *import in verifiedImports) {  
                            // Add the relationship to Core Data LPFile entity.
                    }
                });//end block
            });//end block
        }
    }
}

Now, here's where things get weird:

This code works, but I'm seeing an odd problem. If I run it on an LPProject that has a few files (about 20), it runs perfectly. However, if I run it on an LPProject that has more files (say, 60-70), it does NOT run correctly. We never get back to the main thread, the NSLog(@"got to main thread"); never appears and the app hangs. BUT, (and this is where things get REALLY weird) --- if I run the code on the small project FIRST and THEN run it on the large project, everything works perfectly. It's ONLY when I run the code on the large project first that the trouble shows up.

And here's the kicker, if I change the second dispatch line to this:

dispatch_async(dispatch_get_main_queue(), ^{

(That is, use async instead of sync to dispatch the block to the main queue), everything works all the time. Perfectly. Regardless of the number of files in a project!

I'm at a loss to explain this behavior. Any help or tips on what to test next would be appreciated.

This is a common issue related to disk I/O and GCD. Basically, GCD is probably spawning one thread for each file, and at a certain point you've got too many threads for the system to service in a reasonable amount of time.

Every time you call dispatch_async() and in that block you attempt to to any I/O (for example, it looks like you're reading some files here), it's likely that the thread in which that block of code is executing will block (get paused by the OS) while it waits for the data to be read from the filesystem. The way GCD works is such that when it sees that one of its worker threads is blocked on I/O and you're still asking it to do more work concurrently, it'll just spawn a new worker thread. Thus if you try to open 50 files on a concurrent queue, it's likely that you'll end up causing GCD to spawn ~50 threads.

This is too many threads for the system to meaningfully service, and you end up starving your main thread for CPU.

The way to fix this is to use a serial queue instead of a concurrent queue to do your file-based operations. It's easy to do. You'll want to create a serial queue and store it as an ivar in your object so you don't end up creating multiple serial queues. So remove this call:

dispatch_queue_t taskQ = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);

Add this in your init method:

taskQ = dispatch_queue_create("com.yourcompany.yourMeaningfulLabel", DISPATCH_QUEUE_SERIAL);

Add this in your dealloc method:

dispatch_release(taskQ);

And add this as an ivar in your class declaration:

dispatch_queue_t taskQ;

I believe Ryan is on the right path: there are simply too many threads being spawned when a project has 1,500 files (the amount I decided to test with.)

So, I refactored the code above to work like this:

- (void) establishImportLinksForFilesInProject:(LPProject *)aProject
{
        dispatch_queue_t taskQ = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);

     dispatch_async(taskQ, 
     ^{

     // Create a new Core Data Context on this thread using the same persistent data store    
     // as the main thread. Pass the objectID of aProject to access the managedObject
     // for that project on this thread's context:

     NSManagedObjectID *projectID = [aProject objectID];

     for (LPFile *fileToCheck in [backgroundContext objectWithID:projectID] memberFiles])
     {
        if (//Some condition is met)
        {
                // Here, we do the scanning for @import statements. 
                // When we find a valid one, we put the whole path to the 
                // imported file into an array called 'verifiedImports'. 

                // Pass this ID to main thread in dispatch call below to access the same
                // file in the main thread's context
                NSManagedObjectID *fileID = [fileToCheck objectID];


                // go back to the main thread and update the model 
                // (Core Data is not thread-safe.)
                dispatch_async(dispatch_get_main_queue(), 
                ^{
                    for (NSString *import in verifiedImports)
                    {  
                       LPFile *targetFile = [mainContext objectWithID:fileID];
                       // Add the relationship to targetFile. 
                    }
                 });//end block
         }
    }
    // Easy way to tell when we're done processing all files.
    // Could add a dispatch_async(main_queue) call here to do something like UI updates, etc

    });//end block
    }

So, basically, we're now spawning one thread that reads all the files instead of one-thread-per-file. Also, it turns out that calling dispatch_async() on the main_queue is the correct approach: the worker thread will dispatch that block to the main thread and NOT wait for it to return before proceeding to scan the next file.

This implementation essentially sets up a "serial" queue as Ryan suggested (the for loop is the serial part of it), but with one advantage: when the for loop ends, we're done processing all the files and we can just stick a dispatch_async(main_queue) block there to do whatever we want. It's a very nice way to tell when the concurrent processing task is finished and that didn't exist in my old version.

The disadvantage here is that it's a bit more complicated to work with Core Data on multiple threads. But this approach seems to be bulletproof for projects with 5,000 files (which is the highest I've tested.)

Are there any disadvantages to using a 4096-bit encrypted SSL certificate?

Where's the IE7/8/9/10-emulator in IE11 dev tools? [duplicate]

Is std::memcpy between different trivially copyable types undefined behavior?

What do * (single star) and / (slash) do as independent parameters? [duplicate]

Spring beans redefinition in unit test environment

Finding undocumented APIs in Windows

Difference between REMOTE_HOST and REMOTE_ADDR

Is there any way how to install just ADB without whole SDK?

what is the performance impact of using int64_t instead of int32_t on 32-bit systems?

How do I specify DataContext (ViewModel) type to get design-time binding checking in XAML editor without creating a ViewModel object? [duplicate]

Difference between SAM template and Cloudformation template

How to do a data.table merge operation