Standard Website Data Collection: How, Why, Mistakes, Encouragement

In 2018, it’s the current era of privacy, or at least we believe it is. Facebook’s the latest one to be questioned by the US government – Let this sink in – A website, a social website questioned by the government. That’s quite insane. However, this is our current world.

More importantly, mainstream media outlets love to bend the truth and make things much worse than they are. The data spoken of which was “scraped” by third-party companies is very similar, if not the same current websites, advertisers, and publishers have been collecting, trading and selling for years – Basically since the Internet existed. I have no sources to cite for this, but if it exists now, and it’s so easily profitable, I can only assume it’s been a natural structure of the Internet and web alike. I’m part (iSnick) of the advertisement & publisher cycle, however I do disclose this on any site with ads, and I must say: I disable “interest-based” advertising. As far as I’m aware, no ads are directly targeted to you (according to the ad service I use). With this being said, it still doesn’t stop webmasters from obtaining critical, identifying, human-driven data. I want to breakdown the data that is so easily collected, and how it’s collected. Don’t worry too much about the details as I’ll make it as simple as possible.

The website’s code

https://www.w3schools.com/html/tryit.asp?filename=tryhtml_default

First, there lays several lines of code generating a webpage for you to view, but almost no HTML page is innocent these days. There are included JavaScript files, styling sheets, and much more to proactively counter errors, assist with smooth human input, draw fluid designs, and even rapidly build new ones. The web browser has evolved from a simple document viewer to a universal language tool, empowering millions to own knowledge in many fields. Applications are built using web languages, deployed throughout entire enterprises, and updated with a few simple tasks. Every employee or client no longer worries about tedious files to download, and possibly configure.

Several mechanisms are used on websites to compile and then display interfaces to you. There are several reasons why webmasters and developers may want to implement tracking, many which are not quite understood by the general web surfer. Certain data helps drive the application’s fundamental protocols and goals, such as IP (your computer’s telephone number) address, web browser type, operating system, and screen resolution. Applications require at least your IP, and some applications require your web browser. I haven’t run across too many websites making a web browser variable a requirement, however you’ll find websites harvesting large download databases deflect anything but desktop browsers to help counter robots. For example, a tool named wget is a terminal based application to retrieve files. Websites may understand this is wget, and will offer up a direct download when it’s used. Others may deflect it entirely. So this is one reason, robots – You may be a human, but the following 10 IPs may not be human, and the webmaster comparing web browser information helps determine robotic activity.

Cloudflare Threat Detection

Determining robotic activity sometimes proves difficult when mastering a high-trafficked website. You may find viewers complaining about being banned or having to repeat robot checks several times. Services such as Cloudflare help ensure a website’s safety by offering a layer of bot and denial-of-service protection. With these security practices readily available, you may not find them to be common on general websites, such as blogs, news sites, and even YouTube (unless you’re dealing with payments). Domain registrars, banks, and similar high-risk websites may enact these measures. So far we have the following data collected:

  • IP address (general location)
  • Web browser (Chrome, Firefox, etc)
  • Screen resolution (1080p?)

Without diving into too much detail about what your web browser can offer up, here’s a screenshot:

https://www.whatsmybrowser.org/

All of above information is standard on the web among websites. Why screen resolution? Maybe the webmaster needs to understand how many columns a homepage should have or if you have cookies enabled to save particular session information. JavaScript is a standard language on the web used to help fancy up data display, input/output, track visitor stats (including the above information), serve advertisements (dynamic advertisements), and it’s even used to exploit vulnerabilities in websites or directly in web browsers. CVE Details is a great place to keep track of vulnerabilities on various software. Here’s a view of Chrome’s vulnerability stats:

https://www.cvedetails.com/product/15031/Google-Chrome.html?vendor_id=1224

Some of the information in this picture may make no sense to you, but as a quick crash course: Some of these discovered exploits are used to steal data, inject malicious code to your browser, or simply deny access to a website, or even help attack a website. Every day computer code is changing, and sometimes this leaves open gigantic weak points for those with malicious intent to use against random Internet citizens. It’d be completely understandable in strict circumstances to disable JavaScript. This of course comes with a great sacrifice.

Security from websites

With web browser vulnerabilities always being discovered and your web browser information easily transmitted to websites, an attacker has the ability to craft dynamic malicious scripts. These dynamic scripts when intended for malicious purpose, have the ability to determine all of the above (browser version, etc) and execute its attack based on these variables. It’s difficult to say what exploit would be used, and why. Financial websites (including gambling) are at high-risk for attacks – Malicious individuals understand this will be one of the best places with the greatest data.

It used to be that obtaining relatively decent security for your website was difficult. Webmasters that chose to deploy self-signed certificates were met with uneasy viewers with browser certificate errors. This made website visitors distrust your website because it did not fall into the accepted, secure category – the “green.” Today, services such as Let’s Encrypt empower Internet citizens the ability to secure (SSL/HTTPS) their services with no cost. Data is now freely encrypted from website visitor to the webmaster’s home, or that is the idea. Webmasters everywhere now have the ability to conduct safe online commerce, mail delivery and more. Costs prior to this solution started at $10 per year for your basic security certificate. More appealing certificates started at $100 and up. These costs do not appear they will decline.

Of course even with secure websites in place, there are other factors to consider. The webmaster, depending on if they’re using a content distribution network has origins to maintain, web servers to configure, directory/file permission to audit, and most importantly security certificates to keep active. iSnick is currently routed through Cloudflare ensuring safety and additional bandwidth where needed.

iSnick.net requests over Cloudflare’s network

This means not only are you sharing data with iSnick.net, but also Cloudflare services. Any server Cloudflare decides it wants to put you is where your data will first be sent. Securely of course, the data is sent from Cloudflare to iSnick.net. It’s critical to consider from Cloudflare to iSnick.net it must be enabled to serve over a secure connection, otherwise Cloudflare can or may send it over non-HTTP, and data will be freely exposed, regardless of your data first being securely sent to Cloudflare. It’s also critical to consider iSnick.net’s web service is configured to force any non-HTTP connection to HTTPS, this is not automatic. Given the mistake of not configuring the web service this way may lead user input (username, password) being stolen by an attacker in the middle, or the webmaster may forget to name a link in HTTPS – Such small things, but will amount to huge failures given the “right” circumstances fall into place. (also, iSnick.net is not exactly configured this way because it’s just a blog. dex.isnick.net is, though)

RewriteEngine on
RewriteCond %{SERVER_NAME} =dex.isnick.net
RewriteRule ^ https://%{SERVER_NAME}%{REQUEST_URI} [END,QSA,R=permanent]

The above code in your apache configuration file may save a migraine, and it’s added by certbot found at https://certbot.eff.org/. Providing the initial non-HTTP apache configuration is settled, a web directory exists and visible to the world, this bot will automatically configure a SSL version of the website’s configuration. You may always do this manually… But with that being said, there is little excuse for webmasters to take advantage of this. Viewer security does matter. There are a few things in the wild that may give away some of your browsing history if you’re not too careful.

Referrer

This innocent gem delivers the webmaster the site you navigated from. For example, if you searched iSnick.net, clicked its link and ended up here, your referrer would be google.com. As harmless as it sounds, do you really want iSnick.net to know you finished watching Justin Bieber? Perhaps you finished viewing toiletries – Hey, I’m only trying to warn you. This is a simple string of data that can, may or will leak sensitive information, even if it sounds harmless right now. These are not necessarily weak points of the web, but webmasters and Internet citizens may want to take into consideration when attempting to remain anonymous or lower security risks.

User input

Critical to the web itself, web browsers rely on specific user input, namely mouse-clicking, typing, usernames, and passwords. This data is transmitted, hopefully, over a secure connection and then results are posted back to the user. Basic tasks on any website – however given a piece of malicious JavaScript and bad web development, attackers are able to grab user input directly from the webpage, or when the data is transmitted.

All of it together

With all of this initial data and talk presented, most of the data that is transmitted and considered private around the web is dependent on webmasters. It’s most dependent on the Internet citizen’s eye, their ability to spot oddness on a webpage and to understand when not to submit data. Internet citizens need to retract access to their data by simply not giving it. Engage in NoScript, HTTPS Anywhere, Let’s Encrypt and encourage everyone else to do so. Avoid using universal login systems such as Facebook Connect, Twitter, Google+ and so on – By using universal logins, you’re enabling the ability to harvest data, and you’re also prone to their vulnerabilities ontop of the service you’re using.

Webmasters do collect the above data for good purposes. We need to understand where robotic activity is happening and how to counter it, how to design our pages for visitors, to debug errors on pages, help the general web safe, and to provide content at low costs  – At least I’m looking for it to be safe and mostly free!

$1 Donation

Do you like what iSnick does? Consider a $1 donation. Used to buy coffee and server essentials.

$1.00
Processing ...
X item(s)

Write a Comment

Your email address will not be published. Required fields are marked *

4 × 1 =