expLog

[WIP] Exploring asyncio

I've always found Python's asyncio to be simultaneously intuitive, and yet magical. This as an attempt at looking behind the curtain to understand exactly what happens while running asyncio.

Starting with an extremely minimal example:

import asyncio

async def sleepy_hello():
    await asyncio.sleep(1)
    print("Hello, world")

asyncio.run(sleepy_hello())
Hello, world

It should be instructive to inspect the source for both asyncio.sleep and asyncio.run to start:

Inside Asyncio

PEP 3156

Notes from the PEP that added the asyncio module:

The api was designed in 2 parts:

  • An interface for managing the current event loop: set_event_loop
  • An interface for a conforming event loop: AbstractEventLoop
    • The actual methods are specified in the overview.
      • I didn't realize that the network/socket mechanisms were part of the event loop.

Design decisions:

  • The API doesn't explicitly rely on async/await, which is something I hadn't consciously realized while reading the code.
  • Event loop methods returning coroutines can return a coroutine or a future.

There's an implicit scheduler that can work with other event loops: I need to find out more about this: the behavior is implemented through the Task and Future classes relying on the event loop interface.

The Task class (which I must admit to disliking) is to allow for interoperability between code written using coroutines and other async frameworks.

The asyncio module also standardizes the API between the transport and the protocol.

There are references to passing around an event loop explicitly, but that behavior has since been deprecated.

Exploring async def

A function execution is fairly easy to reason about, but that's not true for a coroutine, generator and async generator. Interestingly enough, the disassembly of both the functions and any calls to these functions is identical: a difference is only observable in the flags for the code object for coroutines, and the return value from a coroutine which is the code object and not the function's result.

import dis

def function():
    print("Hello, world")
print("Plain function disassembly")
print(dis.dis(function))

async def coroutine():
    print("Hello, world")
print("Plain coroutine disassembly")
print(dis.dis(coroutine))

def call_function():
    res = function()
print("Call function disassembly")
print(dis.dis(call_function))

def call_coroutine():
    res = coroutine()
print("Call coroutine disassembly")
print(dis.dis(call_coroutine))

print("Function code flags")
print(function.__code__.co_flags)
print("Couroutine code flags")
print(coroutine.__code__.co_flags)

print("Function return value")
print(function())
print("Coroutine return value")
print(coroutine())

asyncio.run

def run(main, *, debug=False):
    """ <... snip almost exactly the code above ...> """
    if events._get_running_loop() is not None:
        raise RuntimeError(
            "asyncio.run() cannot be called from a running event loop")

    if not coroutines.iscoroutine(main):
        raise ValueError("a coroutine was expected, got {!r}".format(main))

    loop = events.new_event_loop()
    try:
        events.set_event_loop(loop)
        loop.set_debug(debug)
        return loop.run_until_complete(main)
    finally:
        try:
            _cancel_all_tasks(loop)
            loop.run_until_complete(loop.shutdown_asyncgens())
            loop.run_until_complete(loop.shutdown_default_executor())
        finally:
            events.set_event_loop(None)
            loop.close()

run creates its own event loop, enabling debugging appropriately and then goes on to run_until_complete. There's a lot to unpack here as well: the underlying event loop implementation, how run_until_complete functions; exception propagation from the loop (can main) break out and trigger an exception outside the event loop? amongst other things.

asyncio.sleep

async def sleep(delay, result=None, *, loop=None):
    """Coroutine that completes after a given time (in seconds)."""
    if delay <= 0:
        await __sleep0()
        return result

    if loop is None:
        loop = events.get_running_loop()
    else:
        warnings.warn("The loop argument is deprecated since Python 3.8, "
                      "and scheduled for removal in Python 3.10.",
                      DeprecationWarning, stacklevel=2)

    future = loop.create_future()
    h = loop.call_later(delay,
                        futures._set_result_unless_cancelled,
                        future, result)
    try:
        return await future
    finally:
        h.cancel()

This is interesting: sleep relies on the underlying event loop's call_later function for the actual implementation.

Navigating through the definitions of events.get_running_loop() – and detouring through _asynciomodule.c to get_event_loop_policy – I can see that asyncio's __init__.py chooses which events file to use. In the interest of my sanity, I'm going to stick to Unix: which brings me to unix_events.py: _UnixDefaultEventLoopPolicy.

The actual call definition of call_later happens in the base event loop (in base_events.py) with a TimerHandle (basically a marker to define when the callback will happen). More interestingly, the timer is pushed into a heap maintained in the class: _scheduled.

The actual magic happens in BaseEventLoop._run_once: the event loop does book-keeping to keep a reasonably sized heap, discarding cancelled events, and then it calls self._selector.select(timeout) and proceeds to trigger callbacks for all scheduled work after the timeout.

The Selector

self._selector looks interesting enough to warrant a section to itself, and where I expect a significant amount of the magic to happen.

It looks like the selector is set up for a "Selector"EventLoop, and unless otherwise specified it's set up as the default from the selectors python module.

Navigating over to selectors shows several different implementations of selectors, of which I'd like to know more about epoll, which relies on select.epoll, defined in selectmodule.c.

Looking briefly at the code under selectmodule.c I basically see a pythonic wrapper for the epoll api; using epoll_create1, and the other syscalls you would expect.

One way to observe what's happening would be to strace this program: I ran

strace -o strace_output -vyyff python3 -X dev minasyncio.py
<..snip.., removed inline getpid() calls as well>
read(3</usr/lib/python3.8/asyncio/selector_events.py>, "\"\"\"Event loop using a selector a"..., 39009) = 39008
read(3</usr/lib/python3.8/asyncio/selector_events.py>, "", 1) = 0
close(3</usr/lib/python3.8/asyncio/selector_events.py>) = 0
epoll_create1(EPOLL_CLOEXEC)            = 3<anon_inode:[eventpoll]>
socketpair(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0, [4<UNIX:[43847->43848]>, 5<UNIX:[43848->43847]>]) = 0
getsockname(4<UNIX:[43847->43848]>, {sa_family=AF_UNIX}, [128->2]) = 0
getsockname(5<UNIX:[43848->43847]>, {sa_family=AF_UNIX}, [128->2]) = 0
ioctl(4<UNIX:[43847->43848]>, FIONBIO, [1]) = 0
ioctl(5<UNIX:[43848->43847]>, FIONBIO, [1]) = 0
epoll_ctl(3<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 4<UNIX:[43847->43848]>, {EPOLLIN, {u32=4, u64=140647294042116}}) = 0
epoll_wait(3<anon_inode:[eventpoll]>, [], 1, 0) = 0
epoll_wait(3<anon_inode:[eventpoll]>, [], 1, 1000) = 0
epoll_wait(3<anon_inode:[eventpoll]>, [], 1, 0) = 0
write(1</dev/pts/1<char 136:1>>, "Hello, world!\n", 14) = 14
epoll_wait(3<anon_inode:[eventpoll]>, [], 1, 0) = 0
epoll_wait(3<anon_inode:[eventpoll]>, [], 1, 0) = 0
epoll_wait(3<anon_inode:[eventpoll]>, [], 1, 0) = 0
epoll_ctl(3<anon_inode:[eventpoll]>, EPOLL_CTL_DEL, 4<UNIX:[43847->43848]>, 0x7fff9591888c) = 0
close(4<UNIX:[43847->43848]>)           = 0
close(5<UNIX:[43848->43847]>)           = 0
close(3<anon_inode:[eventpoll]>)        = 0
<...snip...>

This is fairly fascinating, because I can also see it looking up the python file for selectors right before using epoll.

There's the epoll_wait call, with a timeout of 1000 – or a second, after which we print hello world. Finally, the fd is closed with another call to epoll_ctl.

What's interesting in there is the additional calls to socketpair, looking up their names, and then the ioctl calls.

The ioctl calls with FIONBIO mark the sockets to be nonblocking. Socket 4 is added as a file descriptor to watch for while epoll is waiting, and finally removed.

socketpair(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0, [4<UNIX:[43847->43848]>, 5<UNIX:[43848->43847]>]) = 0
getsockname(4<UNIX:[43847->43848]>, {sa_family=AF_UNIX}, [128->2]) = 0
getsockname(5<UNIX:[43848->43847]>, {sa_family=AF_UNIX}, [128->2]) = 0
ioctl(4<UNIX:[43847->43848]>, FIONBIO, [1]) = 0
ioctl(5<UNIX:[43848->43847]>, FIONBIO, [1]) = 0

The answer to this particular mystery is a "self pipe" created in BaseSelectorEventLoop: it goes on to add a self reader, and sets up a mechanism to receive messages on the socket from the other pair. The actual magic tentatively happens in BaseSelectorEventLoop._process_self_data, but the default implementation is empty. Happily enough, it's defined in unix_events.py and looks to be used for signal handling.

It goes on to register a fake signal handler: which will wake up the second socket (fd 5) in this case and wake up socket (fd 4); but we're not quite done yet. The callback for the signal handler is added to the BaseEventLoop._ready deque, and another byte is written to the same self-pipe to wake up the event loop (because epoll is also waiting on this file descriptor). Whew, that's tricky.

(I later found out that this is known as the self-pipe trick, to allow polling without worrying about signal handlers.)

It's good to keep the file hierarchy in mind while reading this code: base_events.py defines BaseEventLoop, selector_events.py defines BaseSelectorEventLoop which allows "notify-when-ready" multiplexing (from the docs). Finally, unix_events.py defines the actual version: SelectorEventLoop = _UnixSelectorEventLoop which has the custom signal handling.

A custom trace

As part of building a deeper understanding of Python, I've started working on a tracer, particularly for asyncio. As a first attempt, I was also able to set up a simpler tracer to explore exactly what code gets executed with asyncio.

I prefer using sys.settrace and chrome://tracing manually because they give me much more flexibility about what's shown and I it's a more intuitive API for me compared to the tracer module which has always felt somewhat clunky, and particularly hard to parse meaningfully.

My life in Android introduced me to Catapult, and the tracer that comes with chrome: it's incredibly powerful, and allows easily zooming around with keyboard shortcuts (almost like a video game).

minasynciotrace.png

If you'd like to actually play with the trace, I've also uploaded it: you can download it, navigate to chrome://tracing (in Chrome), and upload it there. The trace shows every function call, and within a function call the lines of code that were executed.

I'm on Python version 3.8.3, which sets the line numbers accordingly. The source code is the same as the example up top except for the duration of the sleep, which I reduced to 5ms to allow other components of the program to show up.

The fascinating part of seeing this trace is observing more calls to code that I can't really explain yet: there's a lot of book-keeping happening at the end of the loop that I hadn't really expected to see.

I'd like to extend this tracer to also show lines connecting work being scheduled and work actually running, but that might prove to be extremely tricky.

There are 2 caveats here, one of which I find fairly fascinating:

  • The call to the coroutine does not show up in the trace. I would have expected the coroutine's evaluation to show up, but that clearly doesn't happen.
    • This also suggests that debugging coroutines is probably fairly tricky.
  • The call to C functions don't show up in the trace, but that's well documented and fairly expected.

dis.dis

Another way to look at the program is to look at the disassembled output, which is happily terse and explorable:

In [4]: dis.dis(minasyncio)
Disassembly of hello:
  6           0 LOAD_GLOBAL              0 (asyncio)
              2 LOAD_METHOD              1 (sleep)
              4 LOAD_CONST               1 (1)
              6 CALL_METHOD              1
              8 GET_AWAITABLE
             10 LOAD_CONST               0 (None)
             12 YIELD_FROM
             14 POP_TOP

  7          16 LOAD_GLOBAL              2 (print)
             18 LOAD_CONST               2 ('Hello, world!')
             20 CALL_FUNCTION            1
             22 POP_TOP
             24 LOAD_CONST               0 (None)
             26 RETURN_VALUE

The most interesting bytecode for me here is GET_AWAITABLE: which is briefly described in the documentation for dis as something that either returns the object if it has a particular flag (CO_ITERABLE_COROUTINE) set, or resolves object.__await__.

The other most interesting bytecode in here is YIELD_FROM, which is defined as "Pops top of stack and delegates to it as a subiterator from a generator." I also can't particularly explain the bytecode at offset 10: LOAD_CONST 0.

Looking into the definition of coroutines, async, etc. in Python I ran across the initial commit by Yury Selivanov that added support for async to Python. The discussion around this commit is on bugs.python.org, and takes less than a month to land. It's a very impressive change, apparently touching everything significant in Python: the parser, evaluation, grammar, etc. The Pep is also worth a read.

I don't understand what that actually means, so I'll have to look around the python source code a little bit.

Running under GDB

Yet another way to look at what the program is doing is to have it run under gdb: which should also help me find the right places to look into in Python itself.

epoll

I've been aiming to look behind the curtain in this post, but there's a pretty big elephant in the room that I haven't talked about in any detail whatsoever: and that bright, pink elephant is known as epoll.

Understanding how epoll itself works, how the kernel handles threads, etc. and how the epoll_wait syscall work promises to be a somewhat harder challenge.

Consider a very simple program which lets 10 tasks sleep in parallel, and then prints out the results.

import asyncio
import time


async def sleeper(message):
    await asyncio.sleep(1) # seconds
    print(f"{message}", end="…")


async def concurrent_sleepers():
    await asyncio.gather(*[sleeper(label) for label in range(10)])


async def main():
    # TODO Introduce the timer abstraction to reduce noise
    start_time_s = time.perf_counter()
    await concurrent_sleepers()
    end_time_s = time.perf_counter()
    print(f"({end_time_s - start_time_s} seconds)")


asyncio.run(main())
0…1…2…3…4…5…6…7…8…9…(1.0026649279752746 seconds)
view source