As SEOs, a big part of our job is to create compelling content that is indexed by Google and easy for searchers to find. That’s simple enough, but there are also plenty of things on a web server that you don’t want Google to find.
Log files, configuration files, personal data, customer databases and administration documents are just a few examples of files that shouldn’t be crawled by search engines. Should Google or another engine crawl and index sensitive data, your site becomes vulnerable to all sorts of things.
How does this work? Hackers or curious searchers can use advanced query operators in search engines to specify the type of file and data they are looking for. Typically, they rely on some sort of footprint that will be present on a large number of sites. This footprint can come from text on the page or URL/site structure. The best way to understand this is to execute a few Google hacks yourself.
Below are five examples of advanced queries that utilize footprints left by files and folders that people typically do not want available for the public consumption. None of these are particularly sinister and all have been widely known for a few years. Use these at your own risk for research purposes only
1. View ‘Confidential’ Government Documents
site:.gov type:.pdf “This document is CONFIDENTIAL”
Google’s search operators allow you to specify a bunch of different operators, including TLD and file type. This query returns a list of PDF files on government sites that are ‘confidential’ to everybody but Google.
2. Take Control of Panasonic Webcams
This query takes advantage of a footprint left by Panasonic webcams that still use their default settings. Typically, no password is required and you can pan, zoom and tilt the camera. You can also find and control many Axis webcams with this query: inurl:indexFrame.shtml Axis. Fun stuff.
3. Find Plain Text Passwords
“your password is” filetype:log
These log files contain plain text passwords. Most of the time, the password will have been changed, but if not…
4. Gain Administrator Access To Print Servers
intitle:”Network Print Server” filetype:shtm
Toner isn’t cheap … if your printing network is unsecure, somebody from far away can run wild on printing test pages … and that is possibly the ‘nicest’ thing they could do.
5. View College Grades & Personal Info
Professors oftentimes post final grades online so all of the students have easy access. Unfortunately, so does Google. Professors often use personal information such as student IDs to separate the students.
These are just a few examples of thousands of vulnerabilities that search engines can find. There are far more advanced queries that have been formulated to hack sites, steal credit card information, gain MySQL access and all sorts of nasty things.
Here are a few tips on how you can prevent information leakage like this on your own site.
1. Always hire IT people who are knowledgeable about security. Failing to make security a priority from the start can lead to major problems down the road.
2. Ask Google to remove a URL from their index.
3. Put Robots.txt to use. This configuration file tells search engine bots which pages and folders on your site they aren’t allowed to crawl.
4. Password protect sensitive information. Duh.
5. Setup Google Alerts to signal you when potential information leakage occurs.