Alexander Hanff writes:
So today I was working on some code for a new web site I am about to launch for one of my privacy projects. I wanted a way to be able to log some statistics about my site visitors without retaining any information which might be considered as private, identifying or could be used to track them; these statistics are important for attracting sponsors. As a rule, I always disable logging everything apart from the date/time, requested page and result (whether or not the page was retrieved successfully) in Apache's access log - but this is a little cumbersome to navigate and create meaningful information from. So I decided I wanted to save some statistical data to a database which I can then access and display in a number of useful ways such as tables & charts. I also wanted to know where my users are coming from without retaining their IP address - so I installed a module for Apache called GeoIP which allows me to see which country a visitor is coming from based on their IP address, without actually having to store their IP address.
I have my test server set up on my local network which is not addressable from the outside world and therefore GeoIP doesn't work correctly - so I uploaded my script to my production web server and set up a temporary web site in Apache to check that the correct data was being saved to the database. As this was just a temporary web site for my own testing purposes with no links publicly known, I didn't bother changing the default Apache log configuration, which means the log was capturing all the usual data including User Agent string, IP Address and much more.
I then sent myself the following text link via a DM in TweetDeck (I have multiple Twitter accounts):
The purpose of the ref string at the end of the URL was to test that it was recorded in the database because I need to restore some log data relating to my sponsors. The output of my script looks like this:
8 » 2013-08-25 19:44:07 » stats.php?ref=twitter » US
9 » 2013-08-25 19:44:07 » stats.php?ref=twitter » US
10 » 2013-08-25 19:44:07 » stats.php?ref=twitter » US
11 » 2013-08-25 19:44:14 » stats.php?ref=twitter » PL
12 » 2013-08-25 19:44:14 » stats.php?ref=twitter » US
13 » 2013-08-25 19:45:06 » stats.php?ref=twitter » PL
As you can see there is nothing within the results which could be considered as harmful to privacy - which is exactly the point. The following is a description of the data:
RowID » TimeStamp » Requested Page » Country Code
The results above are the actual results from my database and my country code is PL so you can see from the result that row 11 and 13 were visits from my computer, however, rows 8, 9, 10 and 12 all have the country code US. Rows 8 - 10 were created immediately after I sent the DM to my Twitter account from TweetDeck.
I don't have the actual Twitter account I sent the DM to set up in TweetDeck yet but I did have the account open in one of my web browsers so I went to the page and clicked on the URL in the DM (row 11 in the list above). When I saw the results I was confused as to why there were US entries in the database since the URL was private and had never been made public. So I checked my Apache access logs and there I could see that for each of the US rows, one of Twitter's servers using IP 220.127.116.11 identifying itself as Twitterbot/1.0 had sent a GET request to the URL (including the extra URL parameters "?ref=twitter".
Many people might think "So what's the problem, they just visited your web site" right?
Read the rest here.