SitePerformanceMonitoringTools
Contents
What (summary)
Instrumentation that provides a history of performance statistics for each part of the page load pipeline.
Related pages:
Why this is important
The responsiveness and performance of the site makes a big difference in how many pages visitors will view, and how often they will come back. A poorly performing site will also wear out our active members causing some of them to leave.
DoneDone
- :A Dashboard with Red, Yellow or Green for each pipeline item.
- :The history for each pipeline item stored in the database.
- :Definitions of acceptable benchmarks for each pipeline item.
- :Definitions of acceptable freshness for each pipeline item.
- :All monitoring meets the definitions for minimum freshness.
Performance Priorities
- View normal page
- View random page
- Edit click until available
- Save click until rendered
- Render invalidated frontpage
Instrumentation Steps
=== # End to end on each
- Deploy instrumentation boxes in various locations
- Determine and instrument the pieces
- MediaWiki profiling
- Raw database queries
Steps to DoneDone
-
Articulate the request pipeline-
Identify each request in the pipeline -
How to perform / retrieve data for each request
-
-
Aggregate pipeline benchmarks-
Tuesday ... Implement the probes -
Wednesday portland devs ... Articulate database schema -
Wednesday portland devs ... Push XML results to central HTTPS server -
Wednesday portland devs ... Remote location / benchmark results stored in database - Project revised for local monitoring only using Zabbix. Service on nimbus communicates with agents installed on each server and stored in MySQL.
-
-
Integrate into monitoring-
Dashboard to identify overall health -
Notifications via email/paging critical problem arises - Available at https://admin.aboutus.org/zabbix/
-
-
Analyze pipeline benchmarks-
XML output for monitoring integration -
Graph performance for each location -
Detailed graph view on each request - Information available at http://www.aboutus.org/AboutUsPerformanceMonitoring/
-
-
Define acceptable benchmarks for each request
Repositories
- cd into the directory you want to check the client out into
- git clone nimbus:/opt/git/geophone-client
- make a few changes in that directory
- git status ... to see what is different
- git diff ... to go through the change and make sure you aren't including something accidentally
- git add YYYY ... to include modified or new file YYYY in the commit
- git commit -m 'Here is why I made the changes I did for this commit'
- git push ... to make sure that the remote repository has your changes
- gitk ... from within the directory shows the tree of revisions
Pipeline
DNS request - www.aboutus.org & images.aboutus.org
- Local resolver / cache
Queries against the resolvers at the remote location provides little insight into health of the www.aboutus.org site. If the record does not exist in the local resolver cache (or the TTL has expired), the DNS root servers will be contacted and the authoritative servers. If the record already exists in the cache then it will respond immediately. If the local resolver does not reply as expected, then the issue likely lies with the remote location or possibly the authoritative name servers (or somewhere between).
- Authoritative name server
ns1.dnscloud.com ns2.dnscloud.com
Response time of the authoritative server is critical. This can also be measured from any location, though, network latency and connectivity will be a factor.
- dig www.aboutus.org @ns1.dnscloud.com
- dig www.aboutus.org @ns2.dnscloud.com
- dig images.aboutus.org @ns1.dnscloud.com
- dig images.aboutus.org @ns2.dnscloud.com
Results
- www.aboutus.org
- ns1.dnscloud.com
- IP Address
- Query time (ms)
- ns2.dnscloud.com
- IP Address
- Query time (ms)
- images.aboutus.org
- ns1.dnscloud.com
- IP Address
- Query time (ms)
- ns2.dnscloud.com
- IP Address
- Query time (ms)
IP Connectivity
Network connectivity and latency can be measured using ping and traceroute utilities. Most issues with connectivity will most likely be caused by network problems between the two locations which we have no control over. In some cases, the issues could be caused by router, switch, or load-balancer issues on the AboutUs side, but these items will affect all remote locations.
- ping -c 5 -i 0.2 -q www.aboutus.org
- traceroute -n www.aboutus.org
Results - Five ICMP packets
- packet loss (%)
- average response time (ms)
HTTP Frontpage
/index.php and requisite pages
Response time of the frontpage request is critical. Performance relies on a number of factors.
- Physical server load
- CPU
- Disk I/O
- Available memory - too little memory causes swapping thereby causing disk I/O performance degradation
- Network throughput
- Apache process performance
- CPU usage
- Available threads
- Memcached
- Down cache / timeout
- Database query
- ?Query for each /index.php request?
- physical DB (slave) server loads
- replication
- MySQL performance
- DNS request - images.aboutus.org
- Image GET request
- image size
- number of images per page
- NFS server load (disk I/O)
- network throughput
- curl --silent --write-out %{time_total} --output > /dev/null http://www.aboutus.org/index.php | tail -1
- write-output variables: http_code time_total time_namelookup size_download speed_download
Acceptable Benchmarks
Max cold-request to fully rendered time for front page
- 0
- curl --location --form wpSave=Save\ page --form wpTextbox1=replacement\ text --form wpEditToken=\\ --form wpEdittime=$edittime http://www.aboutus.org/index.php?title=ObsidiansAnd.com\&action=submit
- warning: will replace all article body with text in wpTextbox1
Results
- Request time (ms)
HTTP Render Invalidated
- curl http://www.aboutus.org/Wiki -d action=purge
Results
- Request time (ms)
Potential Hurdles
- False positives
- Caching
Questions
Does the pipeline include all of these? Record a history of how long to
- lookup DNS for www.aboutus.org, images.aboutus.org, ... from different parts of the world
- Setup a port 80 TCP connection with each of the squal boxes from different parts of the world
- Load the frontpage without any client caching
- Retrieve a memcache item from each combination of two squal boxes (one client, one memcached server)
- Load the core css files
- Load the core js files
...