Destroyed Profiling (markdown)

Henri DF
2016-04-28 14:04:46 -07:00
parent a60bcbd848
commit 5fd69db1b1

@@ -1,147 +0,0 @@
This page documents some basic profiling of digwatch that I did in the March 2016.
## Methodology
I used phoronix test suite to run some benchmarks. For a given benchmark, I did three runs:
- Baseline: only the test suite running
- Sysdig: `sysdig -N proc.pid=0` while test suite runs
- Digwatch: `digwatch` runs with a reasonable size ruleset. most events do *not* match the ruleset, which means that most of the digwatch-compiled filter will be evaluated for most events.
The full ruleset is [here](https://github.com/draios/digwatch/blob/972c84707fd7a786214cc0fbb4a8b753a0488ebb/rules/base.txt). It comprises 18 rules. Many of these rules just have a few expressions, but some are very long. In particular, one rule uses `system_binaries`, which expands to something like `proc.name in (truncate, sha1sum, numfmt, fmt, fold, uniq,...`, enumerating over one hundred names. The full rule is `fd.sockfamily = ip and system_binaries`, which means that for every network read/write event, a comparison of the process name will be made against this list of 100+ names.
## Summary of results
For three benchmarks (apache, nginx, sqlite), digwatch has little or no performance degradation compared to sysdig. But two out of these (apache and nginx) show a significant degradation of both sysdig and digwatch compared to baseline.
For the redis benchmark, digwatch _does_ show a significant degradation compared to sysdig. My guess was that it might be the `system_binaries` rule (which has a chain of 100+ tests like`proc.name = truncate OR proc.name = ls OR ...`) but that turned out not to be the case. Running the same ruleset with the `fd.sockfamily = ip and system_binaries` rule disabled did not change the result. (As so often, obvious explanations are not the right ones when it comes to performance)...
So I tried running digwatch in a variety of ways with a subset of rules, and found there was basically a linear improvement as I removed more and more rules. There wasn't any single rule that explained the dropoff. Removing the first chunk of rules (the first two-thirds) made no difference. Then after that, each removed rule increased performance by order of 5-10%, until the sysdig performance was reached. This corresponds to the digwatch CPU going from 100% flatline down to the 30%-50% level.
Profile of system calls for nginx:
(about 1-1.5m events per second per cpu from looking at the kernel docs with insmod verbose=1).
```
# Calls Syscall
--------------------------------------------------------------------------------
2935528 gettimeofday
2879033 epoll_ctl
1223131 close
1217542 write
1120296 recvfrom
909506 epoll_wait
906429 read
803527 fcntl
799420 connect
```
## Environment
I ran all these tests on a m3.large EC2 instance (2vCPU, 6.5 ECU, 7.5 GB RAM).
## Test results
### pts/nginx
Description: This is a test of ab, which is the Apache benchmark program. This test profile measures how many requests per second a given system can sustain when carrying out 500,000 requests with 100 requests being carried out concurrently.
#### baseline
```
NGINX Benchmark 1.0.11:
pts/nginx-1.1.0
Test 1 of 1
Estimated Trial Run Count: 3
Estimated Time To Completion: 5 Minutes
Running Pre-Test Script @ 19:27:18
Started Run 1 @ 19:27:23
Started Run 2 @ 19:28:00
Started Run 3 @ 19:28:37 [Std. Dev: 0.40%]
Running Post-Test Script @ 19:29:11
Test Results:
14543.39
14579.73
14658.75
Average: 14593.96 Requests Per Second
```
#### scap-open
```
NGINX Benchmark 1.0.11:
pts/nginx-1.1.0
Test 1 of 1
Estimated Trial Run Count: 3
Estimated Time To Completion: 5 Minutes
Running Pre-Test Script @ 19:05:52
Started Run 1 @ 19:05:57
Started Run 2 @ 19:06:43
Started Run 3 @ 19:07:30 [Std. Dev: 0.38%]
Running Post-Test Script @ 19:08:14
Test Results:
11429.92
11405.57
11490.17
Average: 11441.89 Requests Per Second
```
#### sysdig -N
```
NGINX Benchmark 1.0.11:
pts/nginx-1.1.0
Test 1 of 1
Estimated Trial Run Count: 3
Estimated Time To Completion: 5 Minutes
Running Pre-Test Script @ 18:43:42
Started Run 1 @ 18:43:47
Started Run 2 @ 18:44:38
Started Run 3 @ 18:45:30 [Std. Dev: 0.12%]
Running Post-Test Script @ 18:46:19
Test Results:
10155.7
10180.53
10170.71
```
#### digwatch
```
NGINX Benchmark 1.0.11:
pts/nginx-1.1.0
Test 1 of 1
Estimated Trial Run Count: 3
Estimated Time To Completion: 4 Minutes
Running Pre-Test Script @ 03:35:39
Started Run 1 @ 03:35:44
Started Run 2 @ 03:36:58
Started Run 3 @ 03:38:12 [Std. Dev: 0.64%]
Running Post-Test Script @ 03:39:25
Test Results:
6977.67
6986.85
6905.94
Average: 6956.82 Requests Per Second
```