In this chapter, we will cover the following topics:
Gathering information using the Shodan API
Scripting a Google+ API search
Downloading profile pictures using the Google+ API
Harvesting additional results using the Google+ API pagination
Getting screenshots of websites using QtWebKit
Screenshots based on port lists
Spidering websites
Open Source Intelligence (OSINT) is the process of gathering information from Open (overt) sources. When it comes to testing a web application, that might seem a strange thing to do. However, a great deal of information can be learned about a particular website before even touching it. You might be able to find out what server-side language the website is written in, the underpinning framework, or even its credentials. Learning to use APIs and scripting these tasks can make the bulk of the gathering phase a lot easier.
In this chapter, we will look at a few of the ways we can use Python to leverage the power of APIs to gain insight into our target.
Shodan is essentially a vulnerability search engine. By providing it with a name, an IP address, or even a port, it returns all the systems in its databases that match. This makes it one of the most effective sources for intelligence when it comes to infrastructure. It's like Google for internet-connected devices. Shodan constantly scans the Internet and saves the results into a public database. Whilst this database is searchable from the Shodan website (https://www.shodan.io), the results and services reported on are limited, unless you access it through the Application Programming Interface (API).
Our task for this section will be to gain information about the Packt Publishing website by using the Shodan API.
At the time of writing this, Shodan membership is $49, and this is needed to get an API key. If you're serious about security, access to Shodan is invaluable.
If you don't already have an API key for Shodan, visit www.shodan.io/store/member and sign up for it. Shodan has a really nice Python library, which is also well documented at https://shodan.readthedocs.org/en/latest/.
To get your Python environment set up to work with Shodan, all you need to do is simply install the library using cheeseshop
:
$ easy_install shodan
Here's the script that we are going to use for this task:
import shodan import requests SHODAN_API_KEY = "{Insert your Shodan API key}" api = shodan.Shodan(SHODAN_API_KEY) target = 'www.packtpub.com' dnsResolve = 'https://api.shodan.io/dns/resolve?hostnames=' + target + '&key=' + SHODAN_API_KEY try: # First we need to resolve our targets domain to an IP resolved = requests.get(dnsResolve) hostIP = resolved.json()[target] # Then we need to do a Shodan search on that IP host = api.host(hostIP) print "IP: %s" % host['ip_str'] print "Organization: %s" % host.get('org', 'n/a') print "Operating System: %s" % host.get('os', 'n/a') # Print all banners for item in host['data']: print "Port: %s" % item['port'] print "Banner: %s" % item['data'] # Print vuln information for item in host['vulns']: CVE = item.replace('!','') print 'Vulns: %s' % item exploits = api.exploits.search(CVE) for item in exploits['matches']: if item.get('cve')[0] == CVE: print item.get('description') except: 'An error occured'
The preceding script should produce an output similar to the following:
IP: 83.166.169.231 Organization: Node4 Limited Operating System: None Port: 443 Banner: HTTP/1.0 200 OK Server: nginx/1.4.5 Date: Thu, 05 Feb 2015 15:29:35 GMT Content-Type: text/html; charset=utf-8 Transfer-Encoding: chunked Connection: keep-alive Expires: Sun, 19 Nov 1978 05:00:00 GMT Cache-Control: public, s-maxage=172800 Age: 1765 Via: 1.1 varnish X-Country-Code: US Port: 80 Banner: HTTP/1.0 301 https://www.packtpub.com/ Location: https://www.packtpub.com/ Accept-Ranges: bytes Date: Fri, 09 Jan 2015 12:08:05 GMT Age: 0 Via: 1.1 varnish Connection: close X-Country-Code: US Server: packt Vulns: !CVE-2014-0160 The (1) TLS and (2) DTLS implementations in OpenSSL 1.0.1 before 1.0.1g do not properly handle Heartbeat Extension packets, which allows remote attackers to obtain sensitive information from process memory via crafted packets that trigger a buffer over-read, as demonstrated by reading private keys, related to d1_both.c and t1_lib.c, aka the Heartbleed bug.
I've just chosen a few of the available data items that Shodan returns, but you can see that we get a fair bit of information back. In this particular instance, we can see that there is a potential vulnerability identified. We also see that this server is listening on ports 80
and 443
and that according to the banner information, it appears to be running nginx
as the HTTP server.
Firstly, we set up our static strings within the code; this includes our API key:
SHODAN_API_KEY = "{Insert your Shodan API key}" target = 'www.packtpub.com' dnsResolve = 'https://api.shodan.io/dns/resolve?hostnames=' + target + '&key=' + SHODAN_API_KEY
The next step is to create our API object:
api = shodan.Shodan(SHODAN_API_KEY)
In order to search for information on a host using the API, we need to know the host's IP address. Shodan has a DNS resolver but it's not included in the Python library. To use Shodan's DNS resolver, we simply have to make a GET request to the Shodan DNS Resolver URL and pass it the domain (or domains) we are interested in:
resolved = requests.get(dnsResolve) hostIP = resolved.json()[target]
The returned JSON data will be a dictionary of domains to IP addresses; as we only have one target in our case, we can simply pull out the IP address of our host using the
target
string as the key for the dictionary. If you were searching on multiple domains, you would probably want to iterate over this list to obtain all the IP addresses.Now, we have the host's IP address, we can use the Shodan libraries
host
function to obtain information on our host. The returned JSON data contains a wealth of information about the host, though in our case we will just pull out the IP address, organization, and if possible the operating system that is running. Then we will loop over all of the ports that were found to be open and their respective banners:host = api.host(hostIP) print "IP: %s" % host['ip_str'] print "Organization: %s" % host.get('org', 'n/a') print "Operating System: %s" % host.get('os', 'n/a') # Print all banners for item in host['data']: print "Port: %s" % item['port'] print "Banner: %s" % item['data']
The returned data may also contain potential Common Vulnerabilities and Exposures (CVE) numbers for vulnerabilities that Shodan thinks the server may be susceptible to. This could be really beneficial to us, so we will iterate over the list of these (if there are any) and use another function from the Shodan library to get information on the exploit:
for item in host['vulns']: CVE = item.replace('!','') print 'Vulns: %s' % item exploits = api.exploits.search(CVE) for item in exploits['matches']: if item.get('cve')[0] == CVE: print item.get('description')
That's it for our script. Try running it against your own server.
We've only really scratched the surface of the Shodan Python library with our script. It is well worth reading through the Shodan API reference documentation and playing around with the other search options. You can filter results based on "facets" to narrow down your searches. You can even use searches that other users have saved using the "tags" search.
Tip
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Social media is a great way to gather information on a target company or person. Here, we will be showing you how to script a Google+ API search to find contact information for a company within the Google+ social sites.
Some Google APIs require authorization to access them, but if you have a Google account, getting the API key is easy. Just go to https://console.developers.google.com and create a new project. Click on API & auth | Credentials. Click on Create new key and Server key. Optionally enter your IP or just click on Create. Your API key will be displayed and ready to copy and paste into the following recipe.
Here's a simple script to query the Google+ API:
import urllib2 GOOGLE_API_KEY = "{Insert your Google API key}" target = "packtpub.com" api_response = urllib2.urlopen("https://www.googleapis.com/plus/v1/people? query="+target+"&key="+GOOGLE_API_KEY).read() api_response = api_response.split("\n") for line in api_response: if "displayName" in line: print line
The preceding code makes a request to the Google+ search API (authenticated with your API key) and searches for accounts matching the target; packtpub.com
. Similarly to the preceding Shodan script, we set up our static strings including the API key and target:
GOOGLE_API_KEY = "{Insert your Google API key}" target = "packtpub.com"
The next step does two things: first, it sends the HTTP GET
request to the API server, then it reads in the response and stores the output into an api_response
variable:
api_response = urllib2.urlopen("https://www.googleapis.com/plus/v1/people? query="+target+"&key="+GOOGLE_API_KEY).read()
This request returns a JSON formatted response; an example snippet of the results is shown here:

In our script, we convert the response into a list so it's easier to parse:
api_response = api_response.split("\n")
The final part of the code loops through the list and prints only the lines that contain displayName
, as shown here:

In the next recipe, Downloading profile pictures using the Google+ API, we will look at improving the formatting of these results.
By starting with a simple script to query the Google+ API, we can extend it to be more efficient and make use of more of the data returned. Another key aspect of the Google+ platform is that users may also have a matching account on another of Google's services, which means you can cross-reference accounts. Most Google products have an API available to developers, so a good place to start is https://developers.google.com/products/. Grab an API key and plug the output from the previous script into it.
Now that we have established how to use the Google+ API, we can design a script to pull down pictures. The aim here is to put faces to names taken from web pages. We will send a request to the API through a URL, handle the response through JSON, and create picture files in the working directory of the script.
Here's a simple script to download profile pictures using the Google+ API:
import urllib2 import json GOOGLE_API_KEY = "{Insert your Google API key}" target = "packtpub.com" api_response = urllib2.urlopen("https://www.googleapis.com/plus/v1/people? query="+target+"&key="+GOOGLE_API_KEY).read() json_response = json.loads(api_response) for result in json_response['items']: name = result['displayName'] print name image = result['image']['url'].split('?')[0] f = open(name+'.jpg','wb+') f.write(urllib2.urlopen(image).read()) f.close()
The first change is to store the display name into a variable, as this is then reused later on:
name = result['displayName'] print name
Next, we grab the image URL from the JSON response:
image = result['image']['url'].split('?')[0]
The final part of the code does a number of things in three simple lines: firstly it opens a file on the local disk, with the filename set to the name
variable. The wb+
flag here indicates to the OS that it should create the file if it doesn't exist and to write the data in a raw binary format. The second line makes a HTTP GET
request to the image URL (stored in the image
variable) and writes the response into the file. Finally, the file is closed to free system memory used to store the file contents:
f = open(name+'.jpg','wb+') f.write(urllib2.urlopen(image).read()) f.close()
After the script is run, the console output will be the same as before, with the display names shown. However, your local directory will now also contain all the profile images, saved as JPEG files.
By default, the Google+ APIs return a maximum of 25 results, but we can extend the previous scripts by increasing the maximum value and harvesting more results through pagination. As before, we will communicate with the Google+ API through a URL and the urllib
library. We will create arbitrary numbers that will increase as requests go ahead, so we can move across pages and gather more results.
The following script shows how you can harvest additional results from the Google+ API:
import urllib2 import json GOOGLE_API_KEY = "{Insert your Google API key}" target = "packtpub.com" token = "" loops = 0 while loops < 10: api_response = urllib2.urlopen("https://www.googleapis.com/plus/v1/people? query="+target+"&key="+GOOGLE_API_KEY+"&maxResults=50& pageToken="+token).read() json_response = json.loads(api_response) token = json_response['nextPageToken'] if len(json_response['items']) == 0: break for result in json_response['items']: name = result['displayName'] print name image = result['image']['url'].split('?')[0] f = open(name+'.jpg','wb+') f.write(urllib2.urlopen(image).read()) loops+=1
The first big change in this script that is the main code has been moved into a while
loop:
token = "" loops = 0 while loops < 10:
Here, the number of loops is set to a maximum of 10 to avoid sending too many requests to the API servers. This value can of course be changed to any positive integer. The next change is to the request URL itself; it now contains two additional trailing parameters maxResults
and pageToken
. Each response from the Google+ API contains a pageToken
value, which is a pointer to the next set of results. Note that if there are no more results, a pageToken
value is still returned. The maxResults
parameter is self-explanatory, but can only be increased to a maximum of 50:
api_response = urllib2.urlopen("https://www.googleapis.com/plus/v1/people? query="+target+"&key="+GOOGLE_API_KEY+"&maxResults=50& pageToken="+token).read()
The next part reads the same as before in the JSON response, but this time it also extracts the nextPageToken
value:
json_response = json.loads(api_response) token = json_response['nextPageToken']
The main while
loop can stop if the loops
variable increases up to 10, but sometimes you may only get one page of results. The next part in the code checks to see how many results were returned; if there were none, it exits the loop prematurely:
if len(json_response['items']) == 0: break
Finally, we ensure that we increase the value of the loops
integer each time. A common coding mistake is to leave this out, meaning the loop will continue forever:
loops+=1
They say a picture is worth a thousand words. Sometimes, it's good to get screenshots of websites during the intelligence gathering phase. We may want to scan an IP range and get an idea of which IPs are serving up web pages, and more importantly what they look like. This could assist us in picking out interesting sites to focus on and we also might want to quickly scan ports on a particular IP address for the same reason. We will take a look at how we can accomplish this using the QtWebKit
Python library.
The QtWebKit is a bit of a pain to install. The easiest way is to get the binaries from http://www.riverbankcomputing.com/software/pyqt/download. For Windows users, make sure you pick the binaries that fit your python/arch
path. For example, I will use the PyQt4-4.11.3-gpl-Py2.7-Qt4.8.6-x32.exe
binary to install Qt4 on my Windows 32bit Virtual Machine that has Python version 2.7 installed. If you are planning on compiling Qt4 from the source files, make sure you have already installed SIP
.
Once you've got PyQt4 installed, you're pretty much ready to go. The following script is what we will use as the base for our screenshot class:
import sys import time from PyQt4.QtCore import * from PyQt4.QtGui import * from PyQt4.QtWebKit import * class Screenshot(QWebView): def __init__(self): self.app = QApplication(sys.argv) QWebView.__init__(self) self._loaded = False self.loadFinished.connect(self._loadFinished) def wait_load(self, delay=0): while not self._loaded: self.app.processEvents() time.sleep(delay) self._loaded = False def _loadFinished(self, result): self._loaded = True def get_image(self, url): self.load(QUrl(url)) self.wait_load() frame = self.page().mainFrame() self.page().setViewportSize(frame.contentsSize()) image = QImage(self.page().viewportSize(), QImage.Format_ARGB32) painter = QPainter(image) frame.render(painter) painter.end() return image
Create the preceding script and save it in the Python Lib
folder. We can then reference it as an import in our scripts.
The script makes use of QWebView
to load the URL and then creates an image using QPainter. The get_image
function takes a single parameter: our target. Knowing this, we can simply import it into another script and expand the functionality.
Let's break down the script and see how it works.
Firstly, we set up our imports:
import sys import time from PyQt4.QtCore import * from PyQt4.QtGui import * from PyQt4.QtWebKit import *
Then, we create our class definition; the class we are creating extends from QWebView
by inheritance:
class Screenshot(QWebView):
Next, we create our initialization method:
def __init__(self): self.app = QApplication(sys.argv) QWebView.__init__(self) self._loaded = False self.loadFinished.connect(self._loadFinished) def wait_load(self, delay=0): while not self._loaded: self.app.processEvents() time.sleep(delay) self._loaded = False def _loadFinished(self, result): self._loaded = True
The initialization method sets the self.__loaded
property. This is used along with the __loadFinished
and wait_load
functions to check the state of the application as it runs. It waits until the site has loaded before taking a screenshot. The actual screenshot code is contained in the get_image
function:
def get_image(self, url): self.load(QUrl(url)) self.wait_load() frame = self.page().mainFrame() self.page().setViewportSize(frame.contentsSize()) image = QImage(self.page().viewportSize(), QImage.Format_ARGB32) painter = QPainter(image) frame.render(painter) painter.end() return image
Within this get_image
function, we set the size of the viewport to the size of the contents within the main frame. We then set the image format, assign the image to a painter object, and then render the frame using the painter. Finally, we return the processed image.
To use the class we've just made, we just import it into another script. For example, if we wanted to just save the image we get back, we could do something like the following:
import screenshot s = screenshot.Screenshot() image = s.get_image('http://www.packtpub.com') image.save('website.png')
That's all there is to it. In the next script, we will create something a little more useful.
In the previous script, we created our base function to return an image for a URL. We will now expand on that to loop over a list of ports that are commonly associated with web-based administration portals. This will allow us to point the script at an IP and automatically run through the possible ports that could be associated with a web server. This is to be used in cases when we don't know which ports are open on a server, rather than when where we are specifying the port and domain.
In order for this script to work, we'll need to have the script created in the Getting screenshots of a website with QtWeb Kit recipe. This should be saved in the Pythonxx/Lib
folder and named something clear and memorable. Here, we've named that script screenshot.py
. The naming of your script is particularly essential as we reference it with an important declaration.
This is the script that we will be using:
import screenshot import requests portList = [80,443,2082,2083,2086,2087,2095,2096,8080,8880,8443,9998,4643, 9001,4489] IP = '127.0.0.1' http = 'http://' https = 'https://' def testAndSave(protocol, portNumber): url = protocol + IP + ':' + str(portNumber) try: r = requests.get(url,timeout=1) if r.status_code == 200: print 'Found site on ' + url s = screenshot.Screenshot() image = s.get_image(url) image.save(str(portNumber) + '.png') except: pass for port in portList: testAndSave(http, port) testAndSave(https, port)
We first create our import declarations. In this script, we use the screenshot
script we created before and also the requests
library. The requests
library is used so that we can check the status of a request before trying to convert it to an image. We don't want to waste time trying to convert sites that don't exist.
Next, we import our libraries:
import screenshot import requests
The next step sets up the array of common port numbers that we will be iterating over. We also set up a string with the IP address we will be using:
portList = [80,443,2082,2083,2086,2087,2095,2096,8080,8880,8443,9998,4643, 9001,4489] IP = '127.0.0.1'
Next, we create strings to hold the protocol part of the URL that we will be building later; this just makes the code later on a little bit neater:
http = 'http://' https = 'https://'
Next, we create our method, which will do the work of building the URL string. After we've created the URL, we check whether we get a 200
response code back for our get
request. If the request is successful, we convert the web page returned to an image and save it with the filename being the successful port number. The code is wrapped in a try
block because if the site doesn't exist when we make the request, it will throw an error:
def testAndSave(protocol, portNumber): url = protocol + IP + ':' + str(portNumber) try: r = requests.get(url,timeout=1) if r.status_code == 200: print 'Found site on ' + url s = screenshot.Screenshot() image = s.get_image(url) image.save(str(portNumber) + '.png') except: pass
Now that our method is ready, we simply iterate over each port in the port list and call our method. We do this once for the HTTP protocol and then with HTTPS:
for port in portList: testAndSave(http, port) testAndSave(https, port)
And that's it. Simply run the script and it will save the images to the same location as the script.
You might notice that the script takes a while to run. This is because it has to check each port in turn. In practice, you would probably want to make this a multithreaded script so that it can check multiple URLs at the same time. Let's take a quick look at how we can modify the code to achieve this.
First, we'll need a couple more import declarations:
import Queue import threading
Next, we need to create a new function that we will call threader
. This new function will handle putting our testAndSave
functions into the queue:
def threader(q, port): q.put(testAndSave(http, port)) q.put(testAndSave(https, port))
Now that we have our new function, we just need to set up a new Queue
object and make a few threading calls. We will take out the testAndSave
calls from our FOR
loop over the portList
variable and replace it with this code:
q = Queue.Queue() for port in portList: t = threading.Thread(target=threader, args=(q, port)) t.deamon = True t.start() s = q.get()
So, our new script in total now looks like this:
import Queue import threading import screenshot import requests portList = [80,443,2082,2083,2086,2087,2095,2096,8080,8880,8443,9998,4643, 9001,4489] IP = '127.0.0.1' http = 'http://' https = 'https://' def testAndSave(protocol, portNumber): url = protocol + IP + ':' + str(portNumber) try: r = requests.get(url,timeout=1) if r.status_code == 200: print 'Found site on ' + url s = screenshot.Screenshot() image = s.get_image(url) image.save(str(portNumber) + '.png') except: pass def threader(q, port): q.put(testAndSave(http, port)) q.put(testAndSave(https, port)) q = Queue.Queue() for port in portList: t = threading.Thread(target=threader, args=(q, port)) t.deamon = True t.start() s = q.get()
If we run this now, we will get a much quicker execution of our code as the web requests are now being executed in parallel with each other.
You could try to further expand the script to work on a range of IP addresses too; this can be handy when you're testing an internal network range.
Many tools provide the ability to map out websites, but often you are limited to style of output or the location in which the results are provided. This base plate for a spidering script allows you to map out websites in short order with the ability to alter them as you please.
In order for this script to work, you'll need the BeautifulSoup
library, which is installable from the apt
command with apt-get install python-bs4
or alternatively pip install beautifulsoup4
. It's as easy as that.
This is the script that we will be using:
import urllib2 from bs4 import BeautifulSoup import sys urls = [] urls2 = [] tarurl = sys.argv[1] url = urllib2.urlopen(tarurl).read() soup = BeautifulSoup(url) for line in soup.find_all('a'): newline = line.get('href') try: if newline[:4] == "http": if tarurl in newline: urls.append(str(newline)) elif newline[:1] == "/": combline = tarurl+newline urls.append(str(combline)) except: pass for uurl in urls: url = urllib2.urlopen(uurl).read() soup = BeautifulSoup(url) for line in soup.find_all('a'): newline = line.get('href') try: if newline[:4] == "http": if tarurl in newline: urls2.append(str(newline)) elif newline[:1] == "/": combline = tarurl+newline urls2.append(str(combline)) except: pass urls3 = set(urls2) for value in urls3: print value
We first import the necessary libraries and create two empty lists called urls
and urls2
. These will allow us to run through the spidering process twice. Next, we set up input to be added as an addendum to the script to be run from the command line. It will be run like:
$ python spider.py http://www.packtpub.com
We then open the provided url
variable and pass it to the beautifulsoup
tool:
url = urllib2.urlopen(tarurl).read() soup = BeautifulSoup(url)
The beautifulsoup
tool splits the content into parts and allows us to only pull the parts that we want to:
for line in soup.find_all('a'): newline = line.get('href')
We then pull all of the content that is marked as a tag in HTML and grab the element within the tag specified as href
. This allows us to grab all the URLs listed in the page.
The next section handles relative and absolute links. If a link is relative, it starts with a slash to indicate that it is a page hosted locally to the web server. If a link is absolute, it contains the full address including the domain. What we do with the following code is ensure that we can, as external users, open all the links we find and list them as absolute links:
if newline[:4] == "http": if tarurl in newline: urls.append(str(newline)) elif newline[:1] == "/": combline = tarurl+newline urls.append(str(combline))
We then repeat the process once more with the urls
list that we identified from that page by iterating through each element in the original url
list:
for uurl in urls:
Other than a change in the referenced lists and variables, the code remains the same.
We combine the two lists and finally, for ease of output, we take the full list of the urls
list and turn it into a set. This removes duplicates from the list and allows us to output it neatly. We iterate through the values in the set and output them one by one.
This tool can be tied in with any of the functionality shown earlier and later in this book. It can be tied to Getting Screenshots of a website with QtWeb Kit to allow you to take screenshots of every page. You can tie it to the email address finder in the Chapter 2, Enumeration, to gain email addresses from every page, or you can find another use for this simple technique to map web pages.
The script can be easily changed to add in levels of depth to go from the current level of 2 links deep to any value set by system argument. The output can be changed to add in URLs present on each page, or to turn it into a CSV to allow you to map vulnerabilities to pages for easy notation.