Effectively using Dart Isolates - part 1

A deep dive into how Dart Isolates work and when and how to use them correctly. This is not a discussion on the Isolate API, rather we explore how Is

With the release of Dart 2.19 and the introduction of isolate.run() there has been a lot of community discussion around Isolates and it's clear that there are still many misconceptions about what isolates are and when to use them.

So rather than this being another article about the minutia of the isolate API I'm going to have a deep dive into what isolates are and perhaps more importantly when to use them and how to use them correctly.

During my research for this article, I found a performance optimisation that I wasn't aware of, it doesn't appear to be documented and can seriously improve performance when using an isolate. But more on this little gem later.

To ensure we include everyone in this discussion, I'm going to start with a primer on isolates, threads, green threads, cores and processors.

If you can already clearly enunciate what each of these are, feel free to jump to the next section on when and how to use isolates

So let's start from the hardware and work our way up defining each term as we go.

Don't get too hung up on the precise details, it's the broad concepts we are interested in.

Processor

If you crack open your desktop machine you will find a large block of black plastic, usually with a large fan attached, this is the processor. Yes, I know you knew that but it helps with the description as we go forward.

Most PCs have a single processor, but high-end PC and many servers many contain multiple processors.

Cores

Embedded within the plastic of each processor are little squares of silicon, each square of silicon is a Core.

A core is the fundamental compute engine that runs your code. It contains a floating point unit (FPU) for doing calculations of floats and doubles, an arithmetic logic unit (alu) for integer calculations, registers, cache etc.

If a PC has 12 cores then it has 12 silicon squares in the Processor (they may not actually be separate pieces of silicon but describing it this way makes understanding the concept of a core easier).

When we talk about a 'thread' we are usually talking about a 'hardware thread' (as opposed to a green thread - more later).

Each core can run a single thread (hardware thread) at a time.

I'm going to ignore the whole 'hyper-threading' discussion because it won't improve our understanding. Think of it as a 'possible' hardware optimisation and let's move on.

hyper threading

hardware threads

As mentioned, a 'thread' is commonly used as a shortcut for the term 'hardware thread',

On a Linux computer if you run 'top -H', the second line shows the thread count. On my PC it shows 1470 threads but I have only 8 cores.

Whilst I have 1470 threads running, I only have 576 processes (apps) running so clearly it's common for a process to run more than one thread.

In reality, most of the threads on my PC will be asleep waiting for some event to occur (a time out, a disk read completes, a TCP packet is received). Only when they wake up and the OS puts them in a 'ready to run' state will they be given time on a core.

To start your app, the OS loads your app into memory and creates a thread.

Creating a thread involves allocating some memory for the stack and setting the Core's instruction pointer (IP) to point to your main function.
The instruction pointer essentially indicates which line of code is to be executed when the thread is allocated time on a core. Each time a line of code executes the instruction pointer is updated to point to the next line of code.

In reality, each line of code is compiled into multiple machine instructions (machine code) and the IP points to the next machine instruction to execute.

Once the thread is all set up, the OS marks the thread as ready to run and then places the thread in a queue. As cores become available, the OS takes the next thread from the queue and runs it on a core.

The queue is actually a priority queue. If the priority of a thread is increased it will get scheduled on a core before other threads. This can be dangerous as you can starve lower-priority threads.

The act of removing one thread from a core and placing a new thread from the ready to run queue onto the core is called a 'context switch'. Context switches are expensive and typically take in the order of 5 μs before we take into account the fact that the core's cache is now useless because the data it contains was of interest to the prior thread but not to the new one.

Whilst most of my threads are asleep, it is normal for many threads to be 'ready to run' waiting to get time on a core.

My PC manages to run all 1470 threads by giving each one a 'slice of time'.

The length of a 'time slice' varies based on the OS (Linux 1-6 ms, windows 20-120 ms) and whether you are running a Desktop or Server version of the OS.

An OS running on a desktop will generally have a shorter time slice to ensure that the desktop is always responsive to the user.

The length of a time slice is the 'maximum' duration a thread will be allowed to run before it is tossed out off the core and another thread is scheduled to run on the core.

This is called 'preemptive threading' (as opposed to non-preemptive or cooperative threading).

With preemptive threading, you are not in control. Once your time slice is up, you get turfed off the core and someone else is given a turn and you are placed on the ready to run queue - and then we go around the merry-go-round again.

green threads

Green threads, non-preemptive threads and cooperative threads are essentially all the same thing.

The term 'cooperative' is probably the most evocative name.

With green/cooperative threads you are in control, you get to run as long as you want and only when you are done do you give up your thread. The process of giving up your thread is called 'yielding'.

Dart actually has a yield method which does exactly this but many other actions (such as a call to await or an IO operation), will also cause your thread to yield.

Now, when I said that 'you are in control', I lied.

The OS and the Cores know nothing about green threads. Green threads were invented by software developers and only exist within your application.

Green threads run on top of hardware threads and as such are still constrained by underlying time slices.

So if green threads actually run on hardware threads, why bother with green threads?

Because green threads make some software problems easier to solve.

Green threads (mostly) remove the need for locks, mutexes, semaphores etc that are required to stop two threads from updating the same piece of memory (variable) at the same time.

You can still deadlock a green thread but this usually involves Completers and is a fairly rare occurrence.

With green threads, you are guaranteed that only one piece of code runs at a time because under the hood you have only a single hardware thread.

The Dart runtime implements green threads, the most visible evidence are calls to await and yield.

Isolates

At its simplest level, you can equate an Isolate to a hardware thread.

When an Isolate starts, the Dart VM starts up its green thread engine.

While it's fairly complex, there is no reason that you couldn't implement your own green thread model, there is nothing magical here, it's all Dart code.

Dart essentially has its own (green) thread 'ready to run' queue.

When you call await you are essentially asking dart to put your thread asleep while Dart runs the called function on a green thread.

Once the called (awaited) function completes, Dart will put your function into the ready to run queue and your code will be run in its turn.

As we are running green threads, it doesn't matter how long the called function takes to complete, our function will never resume until the called function completes.

Now if the called function in turn calls another async function with an await, it will in turn be put to sleep and wait for that function to return and so on.

Dart may also place other functions in a 'ready to run' state, such as when a Future.delayed completes or a touch event occurs, so these other functions can also be run whilst your function is (a)waiting.

But at no time will any two pieces of code run at the same time (unless you start additional isolates), because each isolate is backed on a single hardware thread.

OK, so I lied again. Dart actually runs additional threads within the isolate but these don't run your code. They are used for operations such as GC and interacting with the underlying OS. Your code never runs on these additional hardware threads so you can continue to ignore them.

When and how to use isolates

So hopefully if you got here you have a good understanding of what an isolate is (from a thread perspective)

When your flutter app starts, you get a single isolate and all your code AND flutter's libraries run on a single hardware thread.

This means that to repaint the screen 60 times a second, we need to ensure that no function runs for more than 16 ms without yielding or calling await (any IO will also yield).

If you break this rule, then your users will experience jank.

Now of course, how much work you can do is going to depend on the hardware your app is running on.

At some point, you need to decide what hardware you are going to support (iPhone 5 anyone) and take that into account when working out how much you can do within 16ms.

So we have decided that we need to use an isolate because on some devices we expect aLongFunction to take more than 16 ms.

We should note that it's not just a single call to aLongFunction that can cause problems but a collection of calls that together break the 16ms barrier. In these cases, we may need to use multiple isolates or refactor our code so that we can run the group of functions in a single isolate.

There are still some things we need to consider that will dictate how we use an isolate.

The core consideration is how long it takes to start an isolate and shut it down.

When I was still fairly new to flutter we had an app that fetched a large amount of json data from the server. We found that the parsing of the json data was taking way longer than 16ms.

So being the hardened multi-threaded programmers we are, we decided to spin up an isolate.

Alas, our efforts failed. Even after processing the json data in an isolate the act of retrieving the processed data still took longer than 16ms.

The story here is that isolates are not just going to solve every jank problem, you need to understand how isolates utilise memory to get the full picture and potentially use several ancillary techniques to solve the problem.

Let's dig deeper into isolates to understand why our efforts failed and how isolates have changed since that ill-fated attempt.

First, let's get an idea of the overhead of using an isolate.

You can run the following app on your device of choice to see how long it takes to start an isolate (without copying any data).


import 'dart:isolate';

Future<void> main(List<String> args) async {
  var truth = false;

  // warm the jit up
 await Isolate.run(() {});
  var stopwatch = Stopwatch()
    ..start()
    ..elapsed;
  stopwatch = Stopwatch()
    ..start()
    ..elapsed;
  stopwatch = Stopwatch()..start();

  stopwatch = Stopwatch()..start();
  truth = true;
  final t1 = stopwatch.elapsed;

  // time the run time of an isolate.
  stopwatch = Stopwatch()..start();
  await Isolate.run(() {
    truth = false;
  });

  final t2 = stopwatch.elapsed;

  print('t1: $t1, t2: $t2');
}

On my desktop computer, the output was: t1: 0:00:00.000002, t2: 0:00:00.000589

So on a fairly serious desktop computer, it takes 0.6 ms to start/stop an isolate that doesn't do anything.

My initial tests showed a startup time of about 6ms. One of my colleagues suggested that I should first warm up the jit compiler by calling Isolate.run() before I started the timing run. The difference was stark 6ms vs 0.6ms. This shouldn't be an issue with a normal compiled app.

The start-up time on an old phone is likely to be longer.

I suggest that you run the above code on your target device(s) to get some metrics on isolate start-up time.

But this is only half the story.

Memory management

To deliver on its promise of 'simple' threading, Isolates use a policy of 'share nothing'.

This isn't entirely true as of about 2.15 the Dart team changed isolates so that they share code pages (memory that stores your compiled code) and some VM, internals resulting in a considerable performance increase for Isolate start-up.

This means that when an isolate starts it gets its own stack (because it is a hardware thread), heap and garbage collector.

That's right, each isolate gets its own GC. This greatly simplifies the Dart GC. With a Dart isolate, there is only one thread that the GC needs to worry about.

But the isolated heap (so now you can see where the name Isolate came from) is also a cause of performance problems with isolates.

In our little performance test, we passed no data to the isolate and then essentially passed no data back and we also cheated by using Isolate.run which uses a special 'exit' condition to reduce shutdown overhead.

Passing data to and retrieving it from an isolate adds additional overhead and the more you pass the bigger the overhead.

memory

Let's take a closer look at the isolate’s use of memory.

One can expect the base memory overhead of an isolate to be in the order of 30 KB.

An Isolate's 'isolated' heap is both its superpower and its kryptonite.

As a superpower, it completely removes thread contention which can be a particularly nasty domain to debug.

As kryptonite, an isolated heap means that you can't use caches for items such as a database because you can't keep each isolate's copy of the cache in sync. It also means that when you pass data to an isolate it needs to be copied from the current isolate's heap to the new isolate’s heap. To return any data the results also need to be copied back.

For small chunks of memory, the overhead is fairly small. For larger chunks, it can be a real problem.

I've often toyed with the idea of using FFI to directly allocate memory from the OS to create a cross-isolate cache. This would also require the use of the OS's locking APIs. If you have a few spare hours...

Sending objects

Now here is the little performance gem I hinted at.

As part of my research, I read Martin Kustermann's article on Isolate.run().

At first glance, I thought nothing of it but then I noticed something:

String filename = 'myfile.json';
final jsonData = await Isolate.run(() async {
    final fileData = await File(filename).readAsString();
    final jsonData = jsonDecode(fileData) as Map<String, dynamic>;
    return jsonData;
});

So note line 3 and its use of the variable 'filename':

final fileData = await File(filename).readAsString();

Hang on, filename is declared in the main isolate (in its own heap) but then accessed in the closure passed to 'run'. How is this possible? The closure runs in a separate isolate, which has its own heap. The 'two' filenames cannot be the same object so how does the code work?

Well, Martin helpfully answered my question with:

We send the <closure> that is given to Isolate.run(<closure>) and tell it to run the closure. Sending the <closure>, like anything we send to SendPort.send(<closure>), is transitively copied (isolate's cannot share mutable data structures). Closures are just Dart objects that internally refer to heap objects that contain the variables they close over.

Edited for clarity.

Essentially what Martin is telling us, is that any variables referenced in the closure will be copied to the new isolate and we can use the same technique when using SendPort.send directly.

But wait, there is more and here is the kicker:

In this particular example the filename is an immutable String object, so we actually share that string instead of copying it.

Wait, what?

Martin is essentially saying that passing immutable objects to an isolate is done at zero cost!

Let's look at this in practice:

The following code passed a 1GB list to an isolate that accesses the first and last element in the array.

Pass 1GB mutable memory


import 'dart:isolate';

void main() async {
  // create a 1GB list.
  final list = List.filled(1024 * 1024 * 1024, 'a');

  final stopwatch = Stopwatch()..start();

  final result = await Isolate.run(() {
    return Tuple(list[0], list[1024 * 1024 * 1024 - 1]);
  });
  print('Elapsed: ${stopwatch.elapsed}');
  print('Tuple: ${result.t1} ${result.t2}');
}

class Tuple<T1, T2> {
  Tuple(this.t1, this.t2);

  T1 t1;
  T2 t2;
}

It takes 5 seconds to run the isolate when we pass a 1GB mutable list. Most of this time is taken copying the memory. The copy runs in our main isolate so we have 5 seconds with no screen updates - from our user's perspective the app has just locked up for 5 seconds.

Pass 1GB immutable memory

Now let's modify the code to pass an immutable list:

import 'dart:isolate';

void main() async {
  final list = List.filled(1024 * 1024 * 1024, 'a');
  // create a mutable view of the list
  final ulist = List<String>.unmodifiable(list);

  final stopwatch = Stopwatch()..start();

  final result = await Isolate.run(() {
    print('list: ${ulist[0]}');
    return Tuple(ulist[0], ulist[1024 * 1024 * 1024 - 1]);
  });
  print('Elapsed: ${stopwatch.elapsed}');
  print('Tuple: ${result.t1} ${result.t2}');
}

class Tuple<T1, T2> {
  Tuple(this.t1, this.t2);

  T1 t1;
  T2 t2;
}

Now the call to isolate.run takes 2ms to run!

We have gone from 5 sec to 2 ms by changing to an immutable list.

I hate to say it but; 'this one simple trick will help you lose belly fat'.

For once it's true!!!

But, it's not :D

The problem is that the data must be in an unmodifiable section of memory.

It's not sufficient to just use a roll-your-own immutable class.

If you look at the code for List.unmodifiable it's actually an external call to the Dart VM. The call to List.unmodifiable is copying the entire list and I suspect that it is tagging the memory as unmodifiable, something that you can't do.

But there is a solution.

Have a look at this code:

void main() async {
  var stopwatch = Stopwatch()..start();

  final list = List<int>.generate(1024 * 1024 * 1024, (index) => index);
  print('BuildList: ${stopwatch.elapsed}');

  stopwatch = Stopwatch()..start();
  // break the list into unmodifiable segments
  final segments = await segment(list).toList();

  print('Segment List: ${stopwatch.elapsed}');

  stopwatch = Stopwatch()..start();

  // copy the list of segments to an unmodifiable list
  // the segments are not copied, just the references to them.
  final ulist = List<List<int>>.unmodifiable(segments);

  print('Unmodify segments: ${stopwatch.elapsed}');

  stopwatch = Stopwatch()..start();

  final result = await Isolate.run(() {
    return Tuple(ulist[0], ulist[(1024 * 1024) - 1]);
  });
  print('Isolate.run: ${stopwatch.elapsed}');
  print('Tuple: ${result.t1[0]} ${result.t2[0]}');
}

Stream<List<T>> segment<T>(List<T> list) async* {
  for (var i = 0; i < 1024 * 1024; i++) {
    final ulist = List<T>.unmodifiable(list.sublist(i * 1024, 1024 * (i + 1)));
    yield ulist;
  }
}

class Tuple<T1, T2> {
  Tuple(this.t1, this.t2);

  T1 t1;
  T2 t2;
}
BuildList: 0:00:09.975946
Segment List: 0:00:11.187821
Unmodify segments: 0:00:00.052785
Isolate.run: 0:00:00.005120
Tuple: 0 1073740800

The clue here is the use of the segment function.

Instead of copying the list in one hit we split it into segments each of which is unmodifiable.

We then finally assemble the segments into an unmodifiable list of segments:

final ulist = List<List<int>>.unmodifiable(segments);

This call is quick as it doesn't need to copy the segments, just the list of segments.

Our call to isolate.run is once again fast (5ms).

But how does this help, the call to the segment function takes longer than the original call to List.immutable (10 seconds vs 7)?

How it helps is that it is closer to how we work in the real world.

If I go back to my original problem (parsing a large set of json) then the segment approach leads to a solution.

When reading a large json response we typically use a the http.listen method:


var  segments = <List<String>>[];
// make a call to fetch a large chunk of json
var _response = await Client().send(Request("GET",  
    Uri.parse("https://my.webserver.com/get/bigjson.json")));
      _response.stream.listen((value) {
        /// build the list of segments.
          segments.add[List.unmodifiable(value)];
        setState(() {
            });
      }).onDone(() {

            final ulist = List<List<int>>.unmodifiable(segments);
            var results = await Isolate.run(() {
              // parse json here.
            }));
            showResults(results);
      });

With this code, we are spreading out the calls to unmodifiable as we receive each packet back from the server.

Spreading the load avoids jank. When we are finally ready to process the json its already held as immutable data and so we minimise our call overhead to the isolate.

The result is that we have a real-world solution to passing large chunks of data to an isolate for processing.

It doesn't matter if you are processing json, images or assets. If you can load the resource in pieces then we can avoid jank when passing it to an isolate.

Returning objects

Now, as I'm in the habit, I lied. As of about dart 2.15, google updated Dart's isolates to allow for a more efficient method of returning results.

The isolate.exit method allows you to return a chunk of memory without it having to be copied between the isolates. Instead, the exiting isolate is immediately terminated and the ownership of the result is assigned to the parent isolate. This means that there is almost no overhead to return the data.

The downside is that you can't re-use the isolate.

I've not tested it (that's work for next week) but I suspect that if we need to re-use an isolate we can also return immutable data for zero cost and keep the isolate alive.

conclusions

Understanding how Dart Isolates pass memory is critical in using isolates effectively and during this article we presented some techniques that hopefully help you get more out of isolates.

  • Any variables referenced by a closure (lambda) that you pass to Isolate.run are copied to the spawned isolate unless they are immutable. Copying large amounts of mutable data is expensive.

  • Declaring a class as @immutable or final does not make it immutable. Only intrinsic types (String, int) and lists created with List.immutable are immutable.

  • Be careful what you reference from the closure, as the copy is transitive and you can end up copying large chunks of data. In some circumstances, you may be better off moving some data into separate fields to limit the scope of what gets copied. In the above (silly) examples we could have copied the first and last elements of our gigabyte array and avoided the whole issue of having to make the array immutable.

  • We built our immutable list as we received it which spreads the work across multiple refresh frames. In our example, the HTTP packets are essentially received in the background allowing screen refreshes to continue without jank. Building an immutable list in the background makes it cheap to pass to an isolate when we have the final full data set.

  • When using Isolate.run the cost of returning data is zero whether the data is mutable or immutable.

  • Starting an isolate is pretty quick (faster than I expected).

I originally intended to cover the topic of latency and using pools but this article has already dragged on, so I will write a part 2 next week to finish off our discussion on isolates and why you might need to re-use them.

Isolate.run is a real game changer for Dart and with a little care you can use it to remove jank when processing large data sets.

The Dart Side Blog is supported by OnePub the Dart private repository SaaS.