Fix CI to trigger benchmarks on `run-codspeed-benchmarks` label addition
Reduce scope of async benchmark to save time on CI
Waiting to merge this PR until we figure out how to use walltime on
local runners.
The first in a sequence of PRs focusing on improving performance in
core. We're starting with reducing import times for common structures,
hence the benchmarks here.
The benchmark looks a little bit complicated - we have to use a process
so that we don't suffer from Python's import caching system. I tried
doing manual modification of `sys.modules` between runs, but that's
pretty tricky / hacky to get right, hence the subprocess approach.
Motivated by extremely slow baseline for common imports (we're talking
2-5 seconds):
<img width="633" alt="Screenshot 2025-04-09 at 12 48 12 PM"
src="https://github.com/user-attachments/assets/994616fe-1798-404d-bcbe-48ad0eb8a9a0"
/>
Also added a `make benchmark` command to make local runs easy :).
Currently using walltimes so that we can track total time despite using
a manual proces.