#+TITLE: [WIP] Exploring asyncio
#+OPTIONS: ^:nil

I've always found Python's asyncio to be simultaneously intuitive, and
yet magical. This as an attempt at looking behind the curtain to
understand exactly what happens while running asyncio.

Starting with an extremely minimal example:
#+begin_src python :results output :exports both
import asyncio

async def sleepy_hello():
    await asyncio.sleep(1)
    print("Hello, world")

asyncio.run(sleepy_hello())
#+end_src

It should be instructive to inspect the source for both asyncio.sleep
and asyncio.run to start:

** Inside Asyncio
*** PEP 3156
Notes from the PEP that added the asyncio module:

The api was designed in 2 parts:
- An interface for managing the current event loop: set_event_loop
- An interface for a conforming event loop: AbstractEventLoop
  - The actual methods are specified in the overview.
    - I didn't realize that the network/socket mechanisms were part of
      the event loop.
  
Design decisions:
- The API doesn't explicitly rely on async/await, which is something I
  hadn't consciously realized while reading the code.
- Event loop methods returning coroutines can return a coroutine or a
  future.

There's an implicit scheduler that can work with other event loops: I
need to find out more about this: the behavior is implemented through
the Task and Future classes relying on the event loop interface.

The Task class (which I must admit to disliking) is to allow for
interoperability between code written using coroutines and other async
frameworks.

The asyncio module also standardizes the API between the transport and
the protocol.

There are references to passing around an event loop explicitly, but
that behavior has since been deprecated.




*** Exploring async def
A function execution is fairly easy to reason about, but that's not
true for a coroutine, generator and async generator. Interestingly
enough, the disassembly of both the functions and any calls to these
functions is identical: a difference is only observable in the flags
for the code object for coroutines, and the return value from a
coroutine which is the code object and not the function's result.

#+begin_src python :results output
import dis

def function():
    print("Hello, world")
print("Plain function disassembly")
print(dis.dis(function))

async def coroutine():
    print("Hello, world")
print("Plain coroutine disassembly")
print(dis.dis(coroutine))

def call_function():
    res = function()
print("Call function disassembly")
print(dis.dis(call_function))

def call_coroutine():
    res = coroutine()
print("Call coroutine disassembly")
print(dis.dis(call_coroutine))

print("Function code flags")
print(function.__code__.co_flags)
print("Couroutine code flags")
print(coroutine.__code__.co_flags)

print("Function return value")
print(function())
print("Coroutine return value")
print(coroutine())

#+end_src

#+RESULTS:
#+begin_example
Plain function disassembly
  4           0 LOAD_GLOBAL              0 (print)
              2 LOAD_CONST               1 ('Hello, world')
              4 CALL_FUNCTION            1
              6 POP_TOP
              8 LOAD_CONST               0 (None)
             10 RETURN_VALUE
None
Plain coroutine disassembly
  9           0 LOAD_GLOBAL              0 (print)
              2 LOAD_CONST               1 ('Hello, world')
              4 CALL_FUNCTION            1
              6 POP_TOP
              8 LOAD_CONST               0 (None)
             10 RETURN_VALUE
None
Call function disassembly
 14           0 LOAD_GLOBAL              0 (function)
              2 CALL_FUNCTION            0
              4 STORE_FAST               0 (res)
              6 LOAD_CONST               0 (None)
              8 RETURN_VALUE
None
Call coroutine disassembly
 19           0 LOAD_GLOBAL              0 (coroutine)
              2 CALL_FUNCTION            0
              4 STORE_FAST               0 (res)
              6 LOAD_CONST               0 (None)
              8 RETURN_VALUE
None
Function code flags
67
Couroutine code flags
195
Function return value
Hello, world
None
Coroutine return value
<coroutine object coroutine at 0x7fa3f24283c0>
#+end_example

    
*** asyncio.run
#+begin_src python
def run(main, *, debug=False):
    """ <... snip almost exactly the code above ...> """
    if events._get_running_loop() is not None:
        raise RuntimeError(
            "asyncio.run() cannot be called from a running event loop")

    if not coroutines.iscoroutine(main):
        raise ValueError("a coroutine was expected, got {!r}".format(main))

    loop = events.new_event_loop()
    try:
        events.set_event_loop(loop)
        loop.set_debug(debug)
        return loop.run_until_complete(main)
    finally:
        try:
            _cancel_all_tasks(loop)
            loop.run_until_complete(loop.shutdown_asyncgens())
            loop.run_until_complete(loop.shutdown_default_executor())
        finally:
            events.set_event_loop(None)
            loop.close()
#+end_src

run creates its own event loop, enabling debugging appropriately and
then goes on to run_until_complete. There's a lot to unpack here as
well: the underlying event loop implementation, how run_until_complete
functions; exception propagation from the loop (can main) break out 
and trigger an exception outside the event loop? amongst other things.

*** asyncio.sleep
#+begin_src python
async def sleep(delay, result=None, *, loop=None):
    """Coroutine that completes after a given time (in seconds)."""
    if delay <= 0:
        await __sleep0()
        return result

    if loop is None:
        loop = events.get_running_loop()
    else:
        warnings.warn("The loop argument is deprecated since Python 3.8, "
                      "and scheduled for removal in Python 3.10.",
                      DeprecationWarning, stacklevel=2)

    future = loop.create_future()
    h = loop.call_later(delay,
                        futures._set_result_unless_cancelled,
                        future, result)
    try:
        return await future
    finally:
        h.cancel()
#+end_src

This is interesting: sleep relies on the underlying event loop's
call_later function for the actual implementation.

Navigating through the definitions of events.get_running_loop() -- and
detouring through _asynciomodule.c to get_event_loop_policy -- I can
see that asyncio's __init__.py chooses which events file to use. In
the interest of my sanity, I'm going to stick to Unix: which brings me
to unix_events.py: _UnixDefaultEventLoopPolicy.

The actual call definition of call_later happens in the base event
loop (in base_events.py) with a TimerHandle (basically a marker to
define when the callback will happen). More interestingly, the timer
is pushed into a heap maintained in the class: _scheduled.

The actual magic happens in BaseEventLoop._run_once: the event loop
does book-keeping to keep a reasonably sized heap, discarding
cancelled events, and then it calls self._selector.select(timeout) and
proceeds to trigger callbacks for all scheduled work after the timeout.

*** The Selector
self._selector looks interesting enough to warrant a section to
itself, and where I expect a significant amount of the magic to happen.   

It looks like the selector is set up for a "Selector"EventLoop, and
unless otherwise specified it's set up as the default from the
selectors python module.

Navigating over to selectors shows several different implementations
of selectors, of which I'd like to know more about epoll, which relies
on select.epoll, defined in selectmodule.c.

Looking briefly at the code under selectmodule.c I basically see a
pythonic wrapper for the epoll api; using epoll_create1, and the other
syscalls you would expect. 

One way to observe what's happening would be to strace this program: I
ran
#+begin_src sh
strace -o strace_output -vyyff python3 -X dev minasyncio.py
#+end_src

#+begin_src c
<..snip.., removed inline getpid() calls as well>
read(3</usr/lib/python3.8/asyncio/selector_events.py>, "\"\"\"Event loop using a selector a"..., 39009) = 39008
read(3</usr/lib/python3.8/asyncio/selector_events.py>, "", 1) = 0
close(3</usr/lib/python3.8/asyncio/selector_events.py>) = 0
epoll_create1(EPOLL_CLOEXEC)            = 3<anon_inode:[eventpoll]>
socketpair(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0, [4<UNIX:[43847->43848]>, 5<UNIX:[43848->43847]>]) = 0
getsockname(4<UNIX:[43847->43848]>, {sa_family=AF_UNIX}, [128->2]) = 0
getsockname(5<UNIX:[43848->43847]>, {sa_family=AF_UNIX}, [128->2]) = 0
ioctl(4<UNIX:[43847->43848]>, FIONBIO, [1]) = 0
ioctl(5<UNIX:[43848->43847]>, FIONBIO, [1]) = 0
epoll_ctl(3<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 4<UNIX:[43847->43848]>, {EPOLLIN, {u32=4, u64=140647294042116}}) = 0
epoll_wait(3<anon_inode:[eventpoll]>, [], 1, 0) = 0
epoll_wait(3<anon_inode:[eventpoll]>, [], 1, 1000) = 0
epoll_wait(3<anon_inode:[eventpoll]>, [], 1, 0) = 0
write(1</dev/pts/1<char 136:1>>, "Hello, world!\n", 14) = 14
epoll_wait(3<anon_inode:[eventpoll]>, [], 1, 0) = 0
epoll_wait(3<anon_inode:[eventpoll]>, [], 1, 0) = 0
epoll_wait(3<anon_inode:[eventpoll]>, [], 1, 0) = 0
epoll_ctl(3<anon_inode:[eventpoll]>, EPOLL_CTL_DEL, 4<UNIX:[43847->43848]>, 0x7fff9591888c) = 0
close(4<UNIX:[43847->43848]>)           = 0
close(5<UNIX:[43848->43847]>)           = 0
close(3<anon_inode:[eventpoll]>)        = 0
<...snip...>
#+end_src

This is fairly fascinating, because I can also see it looking up the
python file for selectors right before using epoll.

There's the epoll_wait call, with a timeout of 1000 -- or a
second, after which we print hello world. Finally, the fd is closed
with another call to epoll_ctl.

What's interesting in there is the additional calls to socketpair,
looking up their names, and then the ioctl calls. 

The ioctl calls with FIONBIO mark the sockets to be
nonblocking. Socket 4 is added as a file descriptor to watch for while
epoll is waiting, and finally removed.

#+begin_src c
socketpair(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0, [4<UNIX:[43847->43848]>, 5<UNIX:[43848->43847]>]) = 0
getsockname(4<UNIX:[43847->43848]>, {sa_family=AF_UNIX}, [128->2]) = 0
getsockname(5<UNIX:[43848->43847]>, {sa_family=AF_UNIX}, [128->2]) = 0
ioctl(4<UNIX:[43847->43848]>, FIONBIO, [1]) = 0
ioctl(5<UNIX:[43848->43847]>, FIONBIO, [1]) = 0
#+end_src

The answer to this particular mystery is a "self pipe" created in
BaseSelectorEventLoop: it goes on to add a self reader, and sets up a
mechanism to receive messages on the socket from the other pair. The
actual magic tentatively happens in
BaseSelectorEventLoop._process_self_data, but the default
implementation is empty. Happily enough, it's defined in
unix_events.py and looks to be used for signal handling.

It goes on to register a fake signal handler: which will wake up the
second socket (fd 5) in this case and wake up socket (fd 4); but we're
not quite done yet. The callback for the signal handler is added to
the BaseEventLoop._ready deque, and another byte is written to the
same self-pipe to wake up the event loop (because epoll is also
waiting on this file descriptor). Whew, that's tricky.

(I later found out that this is known as the self-pipe trick, to allow
polling without worrying about signal handlers.)

It's good to keep the file hierarchy in mind while reading this code:
base_events.py defines BaseEventLoop, selector_events.py defines
BaseSelectorEventLoop which allows "notify-when-ready" multiplexing
(from the docs). Finally, unix_events.py defines the actual version:
SelectorEventLoop = _UnixSelectorEventLoop which has the custom signal
handling.

*** A custom trace
As part of building a deeper understanding of Python, I've started
working on a tracer, particularly for asyncio. As a first attempt, I
was also able to set up a simpler tracer to explore exactly what code
gets executed with asyncio.

I prefer using sys.settrace and chrome://tracing manually because they
give me much more flexibility about what's shown and I it's a more
intuitive API for me compared to the tracer module which has always
felt somewhat clunky, and particularly hard to parse meaningfully.

My life in Android introduced me to Catapult, and the tracer that
comes with chrome: it's incredibly powerful, and allows easily zooming
around with keyboard shortcuts (almost like a video game).

../static/images/minasynciotrace.png

If you'd like to actually play with the trace, I've also uploaded it:
you can download it, navigate to chrome://tracing (in Chrome), and
upload it there. The trace shows every function call, and within a
function call the lines of code that were executed. 

I'm on Python version 3.8.3, which sets the line numbers
accordingly. The source code is the same as the example up top except
for the duration of the sleep, which I reduced to 5ms to allow other
components of the program to show up.

The fascinating part of seeing this trace is observing more calls to
code that I can't really explain yet: there's a lot of book-keeping
happening at the end of the loop that I hadn't really expected to see.

I'd like to extend this tracer to also show lines connecting work
being scheduled and work actually running, but that might prove to be
extremely tricky.

There are 2 caveats here, one of which I find fairly fascinating:
- The call to the coroutine does not show up in the trace. I would
  have expected the coroutine's evaluation to show up, but that
  clearly doesn't happen.
  - This also suggests that debugging coroutines is probably fairly
    tricky.
- The call to C functions don't show up in the trace, but that's well
  documented and fairly expected.

*** dis.dis
Another way to look at the program is to look at the disassembled
output, which is happily terse and explorable:
#+begin_src python
In [4]: dis.dis(minasyncio)
Disassembly of hello:
  6           0 LOAD_GLOBAL              0 (asyncio)
              2 LOAD_METHOD              1 (sleep)
              4 LOAD_CONST               1 (1)
              6 CALL_METHOD              1
              8 GET_AWAITABLE
             10 LOAD_CONST               0 (None)
             12 YIELD_FROM
             14 POP_TOP

  7          16 LOAD_GLOBAL              2 (print)
             18 LOAD_CONST               2 ('Hello, world!')
             20 CALL_FUNCTION            1
             22 POP_TOP
             24 LOAD_CONST               0 (None)
             26 RETURN_VALUE
#+end_src

The most interesting bytecode for me here is GET_AWAITABLE: which is
briefly described in the documentation for dis as something that
either returns the object if it has a particular flag
(CO_ITERABLE_COROUTINE) set, or resolves object.__await__. 

The other most interesting bytecode in here is YIELD_FROM, which is
defined as "Pops top of stack and delegates to it as a subiterator
from a generator." I also can't particularly explain the bytecode at
offset 10: LOAD_CONST 0.

Looking into the definition of coroutines, async, etc. in Python I ran
across the initial commit by Yury Selivanov that added support for
async to Python. The discussion around this commit is on
bugs.python.org, and takes less than a month to land. It's a very
impressive change, apparently touching everything significant in
Python: the parser, evaluation, grammar, etc. The Pep is also worth a read.

I don't understand what that actually means, so I'll have to look
around the python source code a little bit.

*** Running under GDB
Yet another way to look at what the program is doing is to have it run
under gdb: which should also help me find the right places to look
into in Python itself.


** epoll
I've been aiming to look behind the curtain in this post, but there's
a pretty big elephant in the room that I haven't talked about in any
detail whatsoever: and that bright, pink elephant is known as epoll.

Understanding how epoll itself works, how the kernel handles threads,
etc. and how the epoll_wait syscall work promises to be a somewhat
harder challenge.


Consider a very simple program which lets 10 tasks sleep in parallel,
and then prints out the results. 

#+begin_src python :results output :exports both
import asyncio
import time


async def sleeper(message):
    await asyncio.sleep(1) # seconds
    print(f"{message}", end="…")


async def concurrent_sleepers():
    await asyncio.gather(*[sleeper(label) for label in range(10)])


async def main():
    # TODO Introduce the timer abstraction to reduce noise
    start_time_s = time.perf_counter()
    await concurrent_sleepers()
    end_time_s = time.perf_counter()
    print(f"({end_time_s - start_time_s} seconds)")


asyncio.run(main())
#+end_src

#+RESULTS:
: 0…1…2…3…4…5…6…7…8…9…(1.0022245349828154 seconds)


* Rough Notes                                               :ignore:noexport:
** TODO Figure out what voice to use
** Talk about errors  
 - Task got bad yield

** Background
@@comment: Make a background section in the outline@@
AsyncIO, and event-loops in general, have always been a little magical
to me: it's never been entirely clear to me how they actually
work. I've always had a vague idea that some magical syscalls make all
of this work under the hood, but that's about where it ends.

** References
***  AOSA Book: A Web Crawler With asyncio Coroutines 
This is an excellent article and probably a better place to start
reading about this. It covers asyncio's concepts from the ground up,
explaining why things are the way they are and how the API was built up.
*** https://github.com/universe-proton/universe-topology/issues/12

** The Linux Programming Interface
Has a full chapter on I/O multiplexing
*** http://man7.org/linux/man-pages/man7/epoll.7.html

** Linux System Programming / Robert Love
Has a full chapter on Event Poll
** Epoll internals
*** https://idndx.com/tag/epoll/
** Cindy Sridharan
*** https://medium.com/@copyconstruct/the-method-to-epolls-madness-d9d2d6378642
*** https://medium.com/@copyconstruct/nonblocking-i-o-99948ad7c957

** Julia evans
*** https://jvns.ca/blog/2017/06/03/async-io-on-linux--select--poll--and-epoll/

** ebpf and python
*** https://lwn.net/Articles/753601/
*** https://www.collabora.com/news-and-blog/blog/2019/05/14/an-ebpf-overview-part-5-tracing-user-processes/

** Tentative outline
*** Explore what PYTHONASYNCIODEBUG gets me 

*** Why Do this?
Talk about how asyncio can be used everywhere; also needs a reference
to the nodejs event loop. Talk about how C works.

** Figure out how python's asyncio implementations

** Talk about how the kernel supports select/poll

** Notes on how to actually read the c code for anything