Desktop version

Home arrow Computer Science arrow Designing Data-Intensive Applications. The Big Ideas Behind Reliable, Scalable and Maintainable Systems

Simple Log Analysis

Various tools can take these log files and produce pretty reports about your website traffic, but for the sake of exercise, let’s build our own, using basic Unix tools. For example, say you want to find the five most popular pages on your website. You can do this in a Unix shell as follows:[1]

cat /var/log/nginx/access.log | О awk '{print $7}' | sort |

uniq -c |

sort -r -n |

head -n 5 О

О Read the log file.

О Split each line into fields by whitespace, and output only the seventh such field from each line, which happens to be the requested URL. In our example line, this request URL is /css/typography.css.

© Alphabetically sort the list of requested URLs. If some URL has been requested n times, then after sorting, the file contains the same URL repeated n times in a row.

О The uniq command filters out repeated lines in its input by checking whether two adjacent lines are the same. The -c option tells it to also output a counter: for every distinct URL, it reports how many times that URL appeared in the input.

© The second sort sorts by the number (-n) at the start of each line, which is the number of times the URL was requested. It then returns the results in reverse (-r) order, i.e. with the largest number first.

© Finally, head outputs just the first five lines (-n 5) of input, and discards the rest. The output of that series of commands looks something like this:

  • 4189 /favicon.ico
  • 3631 /2013/05/24/improving-security-of-ssh-private-keys.htmt
  • 2124 /2012/12/05/schema-evotution-in-avro-protocot-buffers-thrift.html
  • 1369 /
  • 915 /css/typography.css

Although the preceding command line likely looks a bit obscure if you’re unfamiliar with Unix tools, it is incredibly powerful. It will process gigabytes of log files in a matter of seconds, and you can easily modify the analysis to suit your needs. For example, if you want to omit CSS files from the report, change the awk argument to '$7 !~ /.css$/ {print $7}'. If you want to count top client IP addresses instead of top pages, change the awk argument to '{print $1}'. And so on.

We don’t have space in this book to explore Unix tools in detail, but they are very much worth learning about. Surprisingly many data analyses can be done in a few minutes using some combination of awk, sed, grep, sort, uniq, and xargs, and they perform surprisingly well [8].

  • [1] Some people love to point out that cat is unnecessary here, as the input file could be given directly as anargument to awk. However, the linear pipeline is more apparent when written like this.
< Prev   CONTENTS   Source   Next >

Related topics