Apache 304

  

Imessage pink contact. I'm running an Apache web server, serving up large static images off of the file system. I've configured the server to return a valid cache header, like so: Header merge Cache-Control 'public, max-age=31536000, immutable' That's all well and good, but when the user reloads the image in the browser, Apache never returns a 304 Not Modified response. HTTP/1.1 200 OK Date: Sun, 23 Jun 2013 17:39:47 GMT Server: Apache/2.5.0-dev (Unix) OpenSSL/1.0.1c Last-Modified: Wed, 23 Nov 2011 20:04:58 GMT ETag: '58c5-4b26c6f28ce80-gzip' Accept-Ranges: bytes Vary: Accept-Encoding Content-Encoding: gzip Content-Length: 206 Keep-Alive: timeout=5, max=100 Connection: Keep-Alive Content-Type: text/html.

If a 304 response indicates an entity not currently cached, then the cache MUST disregard the response and repeat the request without the conditional. If a cache uses a received 304 response to update a cache entry, the cache MUST update the entry to reflect any new field values given in the response. Nice to see 304's getting an airing. But I am curious to know what you want Apache to do with a 410 that it can't already do. Rather an irritating comment system, btw. Won't let me use id or class attributes. Posted by Danny at 5:17AM; AFAIK, Last-Modified is an HTTP 1.0 header and ETag is an HTTP 1.1 one. The 304 response is the result of a matching If-Match or If-Modified-Since header in the client request. What's happening here is your server is sending out either/or/both of the following headers with its original response.

There are many different packages that allow you to generate reports onwho's visiting your site and what they're doing. The most popular at thistime appear to be 'Analog', 'The Webalizer' and 'AWStats'which are installed by default on many shared servers.

While such programs generate attractive reports, they only scratchthe surface of what the log files can tell you. In this section we lookat ways you can delve more deeply - focussing on the use of simplecommand line tools, particularly grep, awk and sed.

Combined log format

The following assumes an Apache HTTP Server combined log format where each entry in the log filecontains the following information:

%h %l %u %t '%r' %>s %b '%{Referer}i' '%{User-agent}i'

where:

%h = IP address of the client (remote host) which made the request%l = RFC 1413 identity of the client%u = userid of the person requesting the document%t = Time that the server finished processing the request%r = Request line from the client in double quotes%>s = Status code that the server sends back to the client%b = Size of the object returned to the client

The final two items: Referer and User-agent give details on where the request originated andwhat type of agent made the request.

Sample log entries:

66.249.64.13 - - [18/Sep/2004:11:07:48 +1000] 'GET /robots.txt HTTP/1.0' 200 468 '-' 'Googlebot/2.1'66.249.64.13 - - [18/Sep/2004:11:07:48 +1000] 'GET / HTTP/1.0' 200 6433 '-' 'Googlebot/2.1'

Note: The robots.txt file gives instructionsto robots as to which parts of your site they are allowed to index. Arequest for / is a request for the default index page, normallyindex.html.

Using awk

Apache 304

The principal use of awk is to break up each line of a file into'fields' or 'columns' using a pre-defined separator. Because each line ofthe log file is based on the standard format we can do many things quiteeasily.

Using the default separator which is any white-space (spaces or tabs)we get the following:

awk '{print $1}' combined_log # ip address (%h)awk '{print $2}' combined_log # RFC 1413 identity (%l)awk '{print $3}' combined_log # userid (%u)awk '{print $4,5}' combined_log # date/time (%t)awk '{print $9}' combined_log # status code (%>s)awk '{print $10}' combined_log # size (%b)

You might notice that we've missed out some items. To get to them weneed to set the delimiter to the ' character which changesthe way the lines are 'exploded' and allows the following:

awk -F' '{print $2}' combined_log # request line (%r)awk -F' '{print $4}' combined_log # refererawk -F' '{print $6}' combined_log # user agent

Now that you understand the basics of breaking up the log file andidentifying different elements, we can move on to more practicalexamples.

Examples

You want to list all user agents ordered by the number of times theyappear (descending order):

awk -F' '{print $6}' combined_log sort uniq -c sort -fr

All we're doing here is extracing the user agent field from the logfile and 'piping' it through some other commands. The first sortis to enable uniq to properly identify and count unique useragents. The final sort orders the result by number and name(both descending).

The result will look similar to a user agents report generated by oneof the above-mentioned packages. The difference is that you can generatethis ANY time from ANY log file or files.

If you're not particulary interested in which operating system thevisitor is using, or what browser extensions they have, then you can usesomething like the following:

awk -F' '{print $6}' combined_log sed 's/(([^;]+; [^;]+)[^)]*)/(1)/' sort uniq -c sort -fr

Note: The at the end of a line simplyindicates that the command will continue on the next line.

This will strip out the third and subsequent values in the 'bracketed'component of the user agent string. For example:

becomes:

The next step is to start filtering the output so you can narrow downon a certain page or referer. Would you like to know which pages Googlehas been requesting from your site?

awk -F' '($6 ~ /Googlebot/){print $2}' combined_log awk '{print $2}'

Or who's been looking at your guestbook?

awk -F' '($2 ~ /guestbook.html/){print $6}' combined_log

It's just too easy isn't it!

Using just the examples above you can already generate your ownreports to back up any kind of automated reporting your ISP provides. You could even write your own log analysis program.

Using log files to identify problems with your site

The steps outlined below will let you identify problems with your siteby identifying the different server responses and the requests that causedthem:

awk '{print $9}' combined_log sort uniq -c sort

The output shows how many of each type of request your site is getting. A 'normal' request results in a 200 code which means a page or file hasbeen requested and delivered but there are many other possibilities.

The most common responses are:

200 - OK206 - Partial Content301 - Moved Permanently302 - Found304 - Not Modified401 - Unauthorised (password required)403 - Forbidden404 - Not Found

Note: For more on Status Codes you can read the article HTTP Server Status Codes.

A 301 or 302 code means that the request has been re-directed. Whatyou'd like to see, if you're concerned about bandwidth usage, is a lot of304 responses - meaning that the file didn't have to be delivered becausethey already had a cached version.

A 404 code may indicate that you have a problem - a broken internallink or someone linking to a page that no longer exists. You might needto fix the link, contact the site with the broken link, or set up a PURL so that the link can workagain.

The next step is to identify which pages/files are generating thedifferent codes. The following command will summarise the 404('Not Found') requests:

Apache 304 Error

# list all 404 requestsawk '($9 ~ /404/)' combined_log# summarise 404 requestsawk '($9 ~ /404/)' combined_log awk '{print $9,$7}' sort

Or, you can use an inverted regular expression to summarise therequests that didn't return 200 ('OK'):

3040awk '($9 !~ /200/)' combined_log awk '{print $9,$7}' sort uniq

Or, you can include (or exclude in this case) a range of responses,in this case requests that returned 200 ('OK') or 304('Not Modified'):

awk '($9 !~ /200 304/)' combined_log awk '{print $9,$7}' sort uniq

Suppose you've identifed a link that's generating a lot of 404 errors. Let's see where the requests are coming from:

awk -F' '($2 ~ '^GET /path/to/brokenlink.html'){print $4,$6}' combined_log

Now you can see not just the referer, but the user-agent making therequest. You should be able to identify whether there is a broken linkwithin your site, on an external site, or if a search engine or similaragent has an invalid address.

If you can't fix the link, you should look at using Apache mod_rewrite or a similar scheme to redirect(301) the requests to the most appropriate page on your site. By using a301 instead of a normal (302) redirect you are indicating to searchengines and other intelligent agents that they need to update their linkas the content has 'Moved Permanently'.

Who's 'hotlinking' my images?

Something that really annoys some people is when their bandwidth isbeing used by their images being linked directly on other websites.

Here's how you can see who's doing this to your site. Just changewww.example.net to your domain, and combined_log to yourcombined log file.

awk -F' '($2 ~ /.(jpg gif)/ && $4 !~ /^http://www.example.net/){print $4}' combined_log sort uniq -c sort

Translation:

  • explode each row using ';
  • the request line (%r) must contain '.jpg' or '.gif';
  • the referer must not start with your website address (www.example.net in this example);
  • display the referer and summarise.

You can block hot-linking usingmod_rewrite but that can also result in blocking various searchengine result pages, caches and online translation software. To see ifthis is happening, we look for 403 ('Forbidden') errors in the imagerequests:

# list image requests that returned 403 Forbiddenawk '($9 ~ /403/)' combined_log awk -F' '($2 ~ /.(jpg gif)/){print $4}' sort uniq -c sort

Translation:

  • the status code (%>s) is 403 Forbidden;
  • the request line (%r) contains '.jpg' or '.gif';
  • display the referer and summarise.

You might notice that the above command is simply a combination ofthe previous, and one presented earlier. It is necessary to callawk more than once because the 'referer' field is onlyavailable after the separator is set to ', wheras the 'statuscode' is available directly.

Blank User Agents

A 'blank' user agent is typically an indication that the request isfrom an automated script or someone who really values their privacy. The following command will give you a list of ip addresses for thoseuser agents so you can decide if any need to be blocked:

awk -F' '($6 ~ /^-?$/)' combined_log awk '{print $1}' sort uniq

A further pipe through logresolve will give you thehostnames of those addresses.

References

Related Articles - Log Files

304
  • Controlling what logs where with rsyslog.conf[SYSTEM]
  • Logging sFTP activity for chrooted users[SYSTEM]
  • Analyzing Apache Log Files[SYSTEM]
  • Referer Spam from Microsoft Bing[SYSTEM]
  • Bash script to generate broken links report[SYSTEM]
  • Blocking Unwanted Spiders and Scrapers[SYSTEM]
  • Fake Traffic from AVG[SYSTEM]
  • Referer Spam from Live Search[SYSTEM]

Send a message to The Art of Web:

press <Esc> or click outside this box to close

User Comments

Post your comment or question

Learn how to set up custom error pages in Apache. The Apache web server provides a default set of generic error pages for 404, 500, and other common Apache errors.

However, creating custom error pages allows you to:

Apache 304 Disable

  • Continue your branding on these pages
  • Integrate their design into the look and feel of your website
  • Direct lost visitors to their intended destinations
  • Provide error pages in languages other than English

Requirements

Apache 304 Not Modified

  • Cloud Server running Linux (CentOS 7 or Ubuntu 14.04)
  • Apache installed and running

vServer (VPS) from IONOS

Low-cost, powerful VPS hosting for running your custom applications, with a personal assistant and 24/7 support.

Ready in 55 sec.

Create the Custom Error Page

First, you will need to create the custom error page. For testing purposes, we will create an example error page to handle 404 errors.

Use SSH to connect to your server and go to your website's document root. Create a new page named my-404.html with the command:

Save and exit the file.

3049 Apache Dr Pace Fl

3049

You can view the file by going to http://example.com/my-404.html to make sure it is displaying correctly.

Configure Apache to Use the Custom Error Page

To tell Apache to use a custom error page, you will need to add an ErrorDocument directive. The syntax for this directive is:

For this example, since the my-404.html file is in the site's document root, we will be adding the directive:

This directive needs to go inside the VirtualHost command block in the site's main Apache configuration file.

By common convention, this Apache configuration file is usually:

  • CentOS 7/etc/httpd/conf.d/example.com.conf
  • Ubuntu 14.04/etc/apache2/sites-available/example.com.conf

The location and filename of a site's Apache configuration file can vary based on how you or your server administrator has set up hosting.

Edit this file with your editor of choice, for example with the command:

  • CentOS 7sudo nano /etc/httpd/conf.d/example.com.conf
  • Ubuntu 14.04sudo nano /etc/apache2/sites-available/example.com.conf

Scroll through the file until you find the VirtualHost command block, which will look like:

Add the ErrorDocument to the VirtualHost command block, but be sure to put it outside any Directory command blocks. For example:

Save and exit the file, then restart Apache for the changes to take effect:

  • CentOS 7sudo systemctl restart httpd
  • Ubuntu 14.04sudo services apache2 restart

Apache 304 Error

Apache

Finally, test your error document by going to an invalid URL for your website. You will be redirected to your new custom 404 page instead.

Apache 304 Cors

Other HTTP Error Codes

The most common custom error page is for a 404 error. However, you may want to create custom error pages for other Apache errors as well.

These pages can be configured for any 4xx or 5xx error code. A full list of these HTTP error codes can be found on Wikipedia.

Related articles