Scanning for confidential information on external web servers

One of my clients wanted us to scan their web servers for confidential information. This was going to be done both from the Internet, and from an internal intranet location (between cooperative but separate organizations). In particular they were concerned about social security numbers and credit cards being exposed, and wanted us to double-check their servers. These were large Class B network.

I wanted to do something like the Unix “grep”, and search for regular expressions on their web pages. It would be easier if I could log onto the server and get direct access to the file system. But that’s not what the customer wanted.

I looked at a lot of utilities that I could run on my Kali machine. I looked at several tools. It didn’t look hopeful at first. This is what I came up with, using Kali and shell scripts.  I hope it helps others. And if someone finds a better way, please let me know,


Start with Nmap

As I had an entire network to scan, I started with nmap to discover hosts.

NMAP-GREP to the rescue

By chance nmap 7.0 was released that day, and I was using it to map out the network I was testing. I downloaded the new version, and noticed it had the http-grep script. This looked perfect, as it had social security numbers and credit card numbers built in! When I first tried it there was a bug. I tweeted about it and in hours Daniel “bonsaiviking” Miller  fixed it. He’s just an awesome guy.

Anyhow, here is the command I used to check the web servers:

nmap -vv -p T:80,443  $NETWORK --script \
http-grep --script-args \
'http-grep.builtins, http-grep.maxpagecount=-1, http-grep.maxdepth=-1 '

By using ‘http-grep.builtins’ – I could search fo all of the types of confidential information http-grep understood. And by setting maxpagecount and maxdepth to -1, I turned off the limits. It outputs something like:

Nmap scan report for (
Host is up, received syn-ack ttl 45 (0.047s latency).
Scanned at 2015-10-25 10:21:56 EST for 741s
80/tcp open http syn-ack ttl 45
| http-grep:
| (1)
|   (1) email:
|     +
|   (2) phone:
|     + 555-1212

Excellent! Just what I need. A simple grep of the output for ‘ssn:’ would show me any social security numbers (I had tested it on another web server to make sure it worked.) It’ always a good idea to not put too much faith in your tools.

I first used nmap to identify the hosts, and then I iterated through each host, and did a separate scan for each host, storing the outputs in separate files. So my script was  little different. I ended up with a file that contained the URL’s of the top web page of the servers (e.g.,, etc.) So the basic loop would be something like

while IFS= read url
    nmap [arguments....] "$url"
done <list_of_urls.txt

Later on, I used wget instead of nmap, but I’m getting ahead of myself.

Problem #1:  limiting scanning to a specific time of day

We had to perform all actions during a specific time window, so I wanted to be able to break this into smaller steps, allowing me to quit and restart.  I first identified the hosts, and scanned each one separately, in a loop. I also added a double-check to ensure that I didn’t scan past 3PM (as per our client’s request, and that I didn’t fill up the disk. So I added this check in the middle of my loop

LIMIT=5 # always keep 5% of the disk free
HOUR=$(date "+%H") # Get hour in 0..24 format 
AVAIL=$(df . | awk '/dev/ {print $5}'|tr -d '%') # get the available disk space 
if [ "$AVAIL" -lt "$LIMIT" ] 
        echo "Out of space. I have $AVAIL and I need $LIMIT" 
if [ "$HOUR" -ge 15 ] # 3PM or 12 + 3 == 15 
        echo "After 3 PM - Abort"

Problem #2:  Scanning non-text files.

The second problem I had is that a lot of the files on the server were PDF files, Excel spreadsheets, etc. using the http-grep would not help me, as it doesn’t know how to examine non-ASCII files. I therefore needed to mirror the servers.

Creating a mirror of a web site

I needed to find and download all of the files on a list of web servers. After searching for some tools to use, I decided to use wget. To be honest – I wasn’t happy with the choice, but it seemed to be the best choice.

I used wget’s  mirror (-m) option. I also disabled certificate checking (Some servers were using internal certificate an internal network. I also used the –continue command in case I had to redo the scan. I disabled the normal spider behavior of ignoring directories specified the the robots.txt file, and I also changed my user agent to be “Mozilla”

wget -m –no-check-certificate  –continue –convert-links   -p –no-clobber -e robots=off -U mozilla “$URL”

Some servers may not like this fast and furious download. You can slow it down by using these options: “–limit-rate=200k  –random-wait –wait=2 ”

I sent the output to a log file. Let’s call it wget.out. I was watching the output, using

tail -f wget.out

I watched the output for errors.  I did notice that there was a noticeable delay  in a host name lookup. I did a name service lookup, and added the hostname/ip address to my machine’s /etc/hosts file. This made the mirroring faster. I also was counting the number of fies being created, using

find . -type f | wc

Problem #3:  Self-referential links cause slow site mirroring.

I noticed that an hour had passed, and only  10 new files we being downloaded. This was a problem. I also noticed that some of the files being downloaded had several consecutive “/” in the path name. That’s not good.

I first grepped for the string ‘///’ and then I spotted the problem. To make sure, I typed

grep /dir1/dir2/webpage.php wgrep.log | awk '{print $3}' | sort | uniq -c | sort -nr 
         15 `webserver/dir1/dir2/webpage.php' 
          2 http://webserver/dir1/dir2/webpage.php 
          2 http://webserver//dir1/dir2/webpage.php 
          2 http://webserver///dir1/dir2/webpage.php 
          2 http://webserver////dir1/dir2/webpage.php 
          2 http://webserver/////dir1/dir2/webpage.php 
          2 http://webserver//////dir1/dir2/webpage.php 
          2 http://webserver///////dir1/dir2/webpage.php 
          2 http://webserver////////dir1/dir2/webpage.php 
          2 http://webserver/////////dir1/dir2/webpage.php 
          2 http://webserver//////////dir1/dir2/webpage.php 

Not a good thing to see. Time for plan B.

Mirroring a web site with wget –spider

I use a method I had tried before – the wget –spider function. This does not download the files. It just gets their name. As it turns out, this is better in many ways. It doesn’t go “recursive” on you, and it also allows you to scan the results, and obtain a list of URL’s. You can edit this list and not download certain files.

Method 2 was done using the following command:

wget --spider --no-check-certificate --continue --convert-links -r -p --no-clobber -e robots=off -U mozilla "$URL"

I sent the output to a file. But it contains filenames, error messages, and a lot of other information. To get the URL’s from this file, I then extracted all of the URLS using

cat wget.out | grep '^--' | \ grep -v '(try:' | awk '{ print $3 }' | \ grep -v '\.\(png\|gif\|jpg\)$' | sed 's:?.*$::' | grep -v '/$' | sort | uniq >urls.out

This parses the wget output file. It removes all *.png *.gif and *.jpg files. It also strips out any parameters on a URL (i.e. index.html?parm=1&parm=2&parm3=3 becomes index.html). It also removes any URL that ends with a “/”. I then eliminate any duplicate URL’s using sort and uniq.

Now I have a list of URLS. Wget has a way for you to download multiple files using the -i option:

wget -i urls.out --no-check-certificate --continue \
--convert-links -p --no-clobber -e robots=off -U Mozilla

Problem #4:   Using a customer’s search engine

A scan of the network revealed a search engine that searched files in its domain. I wanted to make sure that I had included these files in the audit.

I tried to search for meta-characters like ‘.’ , but the web server complained. Instead, I searched for ‘e’ – the most common letter, and it gave me the largest number of hits – 20 pages long.  I examined the URL for page 1, page 2, etc. and noticed that they were identical except for the value “jump=10”, “jump=20”, etc. I wrote a script that would extract all of the URL’s the search engine reported:


for i in $(seq 0 10 200)
    wget --force-html -r -l2 "$URL" 2>&1  |  grep '^--' | \
    grep -v '(try:' | awk '{ print $3 }'  | \
    grep -v '\.\(png\|gif\|jpg\)$' | sed 's:?.*$::'

It’s ugly, and calls extra processes. I could  write a sed or awk script that replaces five processes with one, but the script would be more complicated and harder to understand to my readers. Also – this was a “throw-away” script. It took me 30 seconds to write it, and the limited factor was network bandwidth. There is always a proper balance between readability, maintainability, time to develop, and time to execute. Is this code consuming excessive CPU cycles? No. Did it allow me to get it working quickly so I can spend time doing something else more productive? Yes.

Problem #5:  wget isn’t consistent

Before I mentioned that I wasn’t happy with wget. That’s because I was not getting consistent results. I ended up repeating the scan of the same server from a different network, and I got different URL’s. I checked, and the second scan found URL’s that the first one missed. I did the best I could to get as many files as possible. I ended up writing some scripts to keep track of the files I scanned before. But that’s another post.

Scanning PDF’s, Word and Excel files.

Now that I had a clone of several websites, I had to scan them for sensitive information. But I have to convert some binary files into ASCII.


Scanning Excel files

I installed gnumeric, and used the program ssconvert to convert the Excel file into text files. I used:

find . -name '*.xls' -o -name '*.xlsx' | \
while IFS= read file; do ssconvert -S "$file" "$file.%s.csv";done

Converting Microsoft Word files into ASCII

I used the following script to convert word files into ASCII

find . -name '*.do[ct]x' -o -name '*. | \
while IFS= read file; do unzip -p "$file" word/document.xml | \
sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' >"$file.txt";done

Potential Problems with converting PDF files

Here are some of the potential problems I expected to face

  1. I didn’t really trust any of the tools. If I knew they were perfect, and I had a lot of experience, I could just pick the best one. But I wasn’t confident, so I did not rely on a single tool.
  2. Some of the tools crashed when I used them. See #1 above.
  3. The PDF to text tools generated different results. Also see #1 above.
  4. PDF files are large. Some were more than 1000 pages long.
  5. It takes a lot of time to convert some of the PDF’s into text files. I really needed a server-class machine, and I was limited to a laptop. If the conversion program crashed when it was 90% through, people would notice my vocabulary in the office.
  6. Some of the PDF files were created by scanning paper documents. A PDF-to-text file would not see patterns unless it had some sort of OCR built-in.

Having said that, this is what I did.

How to Convert Acrobat/PDF files into ASCII

This process is not something that can be automated easily. Some of the times when I converted PDF files into text files, the process either aborted, or went into a CPU frenzy, and I had to abort the file conversion.

Also – there are several different ways to convert a PDF file into text. Because I wanted to minimize the risk of missing some information, I used multiple programs to convert PDF files. If one program broke, the other one might cach something.

The tools I used included

  • pdftotext – part of poppler-utils
  • pdf2txt – part of python-pdfminer

Other useful programs were exiftool and peepdf and Didier Steven’s pdf-tools. I also used pdfgrep, but I had to download the latest source, and then compile it with the perl PCRE library.


ConvertPDF – a script to convert PDF into text

I wrote a script that takes each of the PDF files and converts them into text. I decided to use the following convention:

  • *.pdf..txt – output of the pdf2txt file
  • *.pdf.text – output of the pdftotext file

As the conversion of each file takes time, I used a mechanism to see if the output file exists. If it does, I can skip this step.

I also created some additional files naming conventions

  • *.pdf.txt.err – errors from the pdf2txt program
  • *.pdf.txt.time – output of time(1) when running the pdf2txt program
  • *.pdf.text.err – errors from the pdftotext program
  • *.pdf.text.time – output of time(1) when running the pdftotext program

This is useful because if any of the files generate an error, I can use ‘ls -s *.err|sort -nr’ to identify both the program and the input file that had the problem.

The *.time files could be used to see how long it took to run the conversion. The first time I tried this, my script ran all night, and did not complete. I didn’t know if one of the programs  was stuck in an infinite loop or not. This file allows me to keep track of this information.

I used three helper functions in this script. The “X” function lets me easily change the script to show me what it would do, without doing anything. Also – it made it easier to capture STDERR and the timing information. I called it ConvertPDF

# Usage
#    ConvertPDF filename
FNAME="${1?'Missing filename'}"

# Debug command - do I echo it, execute it, or both?
X() {
# echo "$@" >&2
 /usr/bin/time -o "$OUT.time" "$@" 2> "$OUT.err"

 if [ ! -f "$OUT" ]
     X pdf2txt -o "$OUT" "$IN"

 if [ ! -f "$OUT" ]
     X pdftotext "$IN" "$OUT"
if [ ! -f "$FNAME" ]
 echo missing input file "$FNAME"
 exit 1
echo "$FNAME" >&2 # Output filename to STDERR

Once this script is created, I called it using

find . -name '*.[pP][dD][fF]' | while IFS= read file; do ConvertPDF "$file"; done

Please note that this script  can be repeated. If the conversion previously occurred, it would not repeat it. That is, if the output files already existed, it would skip that conversion.

As I’ve done it often in the past, I used a handy function above called “X” for eXecute. It just executes a command, but it captures any error message, and it also captures the elapsed time. If I move/add/replace the “#” character at the beginning of the line, I can make it just echo, and not execute anything. This makes it easy to debug without it executing anything.   This is Very Useful.


Some of the file conversion process took hours. I could kill these processes. Because I captured the error messages, I could also search them to identify bad conversions, and delete the output files, and try again. And again.

Optimizing the process

Because some of the PDF files are so large, and the process wasn’t refined, I wanted to be more productive, and work on the smallest files first, where I defined smallest by “fewest number of pages”. Finding scripting bugs quickly was desirable.

I used exiftool to examine the PDF metadata.  A snippet of the  output of “exiftool file.pdf” might contain:

ExifTool Version Number : 9.74
File Name : file.pdf
Producer : Adobe PDF Library 9.0
Page Layout : OneColumn
Page Count : 84

As you can see, the page count is available in the meta-data. We can extract this and use it.

Sorting PDF files by page count

I sorted the PDF files by page count using

for i in *.pdf
  NumPages=$(exiftool "$i" | sed -n '/Page Count/ s/Page Count *: *//p')
  printf "%d %s\n" "$NumPages" "$i"
done | sort -n | awk '{print $2}' >pdfSmallestFirst

I used sed to search for ‘Page Count’ and then only print the number after the colon. I then output two columns of information: page count and filename. I sorted by the first column (number of pages) and then printed out the filenames only. I could use that file as input to the next steps.

Searching for credit card numbers, social security numbers, and bank accounts.

If you have been following me, at this point I have directories that contain

  • ASCII based files (.htm, .html, *css, *js, etc.)
  • Excel files converted into ASCII
  • Microsoft Word files converted into ASCII
  • PDF files converted into ASCII.

So it’s a simple matter of using grap to find files.  My tutorial on Regular Expressions is here if you have some questions    Here is what I used to search the files

find dir1 dir2...  -type f -print0| \
xargs -0 grep -i -P '\b\d\d\d-\d\d-\d\d\d\d\b|\b\d\d\d\d-\d\d\d\d-\d\d\d\d-\d\d\d\d\b|\b\d\d\d\d-\d\d\d\d\d\d-\d\d\d\d\d\b|account number|account #'

The regular expressions I used are perl-compatible. See pcre(3) and PCREPATTERN(3) manual pages. The special characters are
\d – a digit
\b – a boundary – either a character, end of line, beginning of line, etc. – This prevents 1111-11-1111 from matching a SSN.

This matches the following patterns
\d\d\d-\d\d-\d\d\d\d – SSN
\d\d\d\d-\d\d\d\d-\d\d\d\d-\d\d\d\d – Credit card number
\d\d\d\d-\d\d\d\d\d\d-\d\d\d\d\d – AMEX credit card

There were some more things I did, but this is a summary
It should be enough to allow someone to replicate the task

Lessons learned

  • pdf2txt is sloooow
  • Your tools aren’t perfect. You can’t assume a single tool will find everything. Plan for failures and backup plans.
  • Look for ways to make your work more productive, e.g. find errors faster. You don’t want to wait 30 minutes to discover a coding error that will cause you to redo the operation. If you can find the error in 5 minutes you have saved 25 minutes.
  • Keep your shell scripts out of the directory containing the files. I downloaded more than 20000 files, and it became difficult to keep track of the names and jobs of the small scripts I was using, and the temporary files they created.
  • Consider using a Makefile to keep track of your actions. It’s a great way to document and reuse various scripts. I’ll write a blog on that later.
  • Watch out for duplicate names/URLs.
  • You have to remember that when you find a match in a file, you have to find the URL that corresponds to it. So consider your naming conventions.
  • Be careful of assumptions. Not all credit cards use the xxxx-xxxx-xxxx-xxxx format. Amex uses xxxx-xxxxxx-xxxxx


Have fun


This entry was posted in Linux, Security, Shell Scripting and tagged , , , , , , , , , , , , , , , , , . Bookmark the permalink.

1 Response to Scanning for confidential information on external web servers

  1. N M says:

    Great article and I appreciate how difficult getting a system like this running. I’d really like to see how you use *nix Make to organize the code for a project like this. I have run into similar code management issues, and just struggled thru. Thanks

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.