Writing a script to check URLs for a 404 Status Code

This idea was given to me by a friend to see how I would check any number of URLs for Status Codes.  The requirements were pretty straight forward:
  1.    Must send an email with results that is nicely formatted
  2.    Must print out how many URLs it will be checking
  3.    Must be able to comment out specific URLs don’t want to check
  4.    Must be able to handle extra spaces in the URL file
  5.    Must print out number of URLs that return a HTTP 404 Status
  6.    Must have basic error handling

I am going to break this down to better understand my thought process.  I try to get the most basic thing working first and in this case was getting the email to send.  The link is below on how to send internet mail using SMTP along with my test code:

  require 'net/smtp'
  email_message = <<MESSAGE_END
  From: Your Name your@mail.address
  To: Destination Address someone@example.com
  Subject: TEST

  Body of email goes here -- Hello World!

  MESSAGE_END

  smtp = Net::SMTP.new(‘your.smtp.server’, 25)
  smtp.enable_starttls
  smtp.start(‘your.smtp.server’, ‘your@mail.address’, ‘Your Password’, :login)
  smtp.send_message email_message, ‘your@mail.address’,‘someone@example.com’
  smtp.finish

You will need to fill in the correct smtp.server used and the correct email addresses.  Now you can run the Ruby file in the terminal and get an automated test email.  That was pretty easy thanks to the wonderful Ruby documentation.

For the next part you will need to create a text file with some test URLs.  For now, all you need to have are some that are working and some that are commented out.  An example would be:
  www.google.com
  http://www.yahoo.com  
  #www.espn.com
  #http://nfl.com
As you can see, we have two valid ones and two commented ones.  Notice that the valid ones both work but one has the full URL with “HTTP://”.  There is a reason for this and I will explain that a little later.

Now that we have our text file created, we can see how many URLs from the text file will be tested.  While writing this code, we also have to handle the commented URLs and the blank lines to get an accurate number.  Here is how I opened the file, read through every URL, separated it into either URL Found or URL IGNORED. The ignored ones are the ones commented out.  Here is that code with a printout showing how many URLs will be tested:
  urls_found = []
  urls_ignored = []

  File = File.open(“/Path/to/txt/file”, “r”)
  File.each do |line|
  Next if line.strip! == “”
  line.insert(0, http://) unless(line.match(/^http\:\/\//) || line.match(/^#/))
    if line.match/^#/)
    urls_ignored.push(line)
   else
    urls_found.push(line)
   end
  end
  file.close
 puts “URLs to be tested: #{urls_found.length}”

This code is clean and easy to read.  It opens the file, loops over every URL, strips out the whitespace, adds HTTP:// to any uncommented URL that doesn’t have it(why we have some in our text file with HTTP:// and some without), then adds any commented lines to the urls_ignored array and all others get added to the urls_found array.  Then we simply print out how many were in the urls_found array to see how many will be tested.

The last bit of code will handle separating the 404 Status Codes, the invalid URLs and give the results of all others.  You will need to add “require 'net/http'” to test URLs.  The link below is how to use Net::HTTP.  Let’s take a look at the code:


  status_code_404 = []
  result = []
  invalid_urls = []

  urls_found.each_with_index do |url, i|
    begin
      res = Net::HTTP.get_response(URL(url))
      if res.code == “404”
        status_code_404.push(res)
      end
      result.push(“#{url} returns: #{res.code}, #{res.message}”)
    rescue
      result.push(“#{url} returns: Error occurred – please check your URL.”)
      invalid_urls.push(url)
    end
    print “* “
  end

  puts “\nTotal # of 404’s: #{status_code_404.length}”
  puts “Total # of Ignored URLs: #{urls_ignored.length}”
  puts “Total # of Invalid URLs: #{invalid_urls.length}”
Let’s break down this block of code.  First we loop over all the array with all the urls_found from earlier in the code.  We check to see if the Status Code return matches “404” add it to the status_code_404 array.  If not, it pushes it to the result array unless it has Error Occurred then it pushes it into the invalid_urls array.  The Begin-End block handles the exception to make sure invalid URLs are handled properly.  Then we simply print the results of all the URLs tested.

The code is working properly and will handle any amount of URLs in the text file.  The last step to follow up on is to format the email with all the results. 

If you would like to see the finished product, you can click the link to my Github account.

No comments:

Post a Comment