Quantcast
Viewing latest article 3
Browse Latest Browse All 10

Use PHP Webbot to Complete GET-Method Forms

A simple webbot can submit GET-based web forms with just a few lines of code, meaning that some often repeated development or business tasks may be thus automated.

Hypertext Transfer Protocol, or HTTP, is the Internet’s most basic method of data communication, and it supports a few request protocols, including GET. The GET method is often used in conjunction with web forms and passes form parameters in the open via the page URL.

At the time of writing, the main search form on Amazon’s website was an example of a GET-based form. Looking at the site’s HTML markup, the associated form element clearly states its method.

  1. <form action='/s/ref=nb_sb_noss' method='get' name='site-search' class='nav-searchbar-inner' >
<form action='/s/ref=nb_sb_noss' method='get' name='site-search' class='nav-searchbar-inner' >
Image may be NSFW.
Clik here to view.
The popular Amazon online store uses the GET method for its site search form.

The popular Amazon online store uses the GET method for its site search form.

Dissect the Form

When trying to automate form submission, this sort of markup is just the place to start. Not only does it explicitly state that the server is expecting information via GET, it can provide other insight about the key and value pairs the server requires.

In his book, _Webbots, Spiders, and Screen Scrapers, a Guide to Developing Internet Agents with PHP/cURL_, author and web developer Michael Schrenk said that “it’s important to submit forms exactly as the webserver expects them to be submitted, or the server may generate errors in its log files.”

These errors can cause two sorts of problems. First, it may be that the webbot will not return the desired information or achieve the expected goal. Second, a webbot designer doesn’t want to interfere with the site’s normal operation.

Looking carefully at a form’s structure in the markup makes it possible to understand what sort of information the server is expecting from the form and how that information should be formatted.

Here is the complete markup for the site search form from Amazon’s home page. The code sample was taken on September 4, 2012.

  1. <div>
  2.   <form action='/s/ref=nb_sb_noss' method='get' name='site-search' class='nav-searchbar-inner'>  
  3.               <span id='nav-search-in' class='nav-sprite'>
  4.                   <span id='nav-search-in-content' data-value="search-alias=aps">
  5. All</span>
  6.                   <span class='nav-down-arrow nav-sprite'></span>
  7.                  <select name="url" id="searchDropdownBox" class="searchSelect" title="Search in"   >
  8.         <option value="search-alias=aps" selected="selected">All Departments</option>
  9.         <option value="search-alias=instant-video">Amazon Instant Video</option>
  10.         <option value="search-alias=appliances">Appliances</option>
  11.         <option value="search-alias=mobile-apps">Apps for Android
  12. </option>
  13.         <option value="search-alias=arts-crafts">Arts, Crafts & Sewing</option>
  14.         <option value="search-alias=automotive">Automotive</option>
  15.         <option value="search-alias=baby-products">Baby</option>
  16.         <option value="search-alias=beauty">Beauty</option>
  17.         <option value="search-alias=stripbooks">Books</option>
  18.         <option value="search-alias=mobile">Cell Phones & Accessories</option>
  19.         <option value="search-alias=apparel">Clothing & Accessories</option>
  20.         <option value="search-alias=collectibles">Collectibles</option>
  21.         <option value="search-alias=computers">Computers</option>
  22.         <option value="search-alias=electronics">Electronics</option>
  23.         <option value="search-alias=gift-cards">Gift Cards Store</option>
  24.         <option value="search-alias=grocery">Grocery & Gourmet Food</option>
  25.         <option value="search-alias=hpc">Health & Personal Care</option>
  26.         <option value="search-alias=garden">Home & Kitchen</option>
  27.         <option value="search-alias=industrial">Industrial & Scientific</option>
  28.         <option value="search-alias=jewelry">Jewelry</option>
  29.         <option value="search-alias=digital-text">Kindle Store</option>
  30.         <option value="search-alias=magazines">Magazine Subscriptions</option>
  31.         <option value="search-alias=movies-tv">Movies & TV</option>
  32.         <option value="search-alias=digital-music">MP3 Downloads</option>
  33.         <option value="search-alias=popular">Music</option>
  34.         <option value="search-alias=mi">Musical Instruments</option>
  35.         <option value="search-alias=office-products">Office Products</option>
  36.         <option value="search-alias=lawngarden">Patio, Lawn & Garden</option>
  37.         <option value="search-alias=pets">Pet Supplies</option>
  38.         <option value="search-alias=shoes">Shoes</option>
  39.         <option value="search-alias=software">Software</option>
  40.         <option value="search-alias=sporting">Sports & Outdoors</option>
  41.         <option value="search-alias=tools">Tools & Home Improvement</option>
  42.         <option value="search-alias=toys-and-games">Toys & Games</option>
  43.         <option value="search-alias=videogames">Video Games</option>
  44.         <option value="search-alias=watches">Watches</option></select>
  45.               </span>
  46.  
  47.             <div class='nav-searchfield-outer nav-sprite'>
  48.               <div class='nav-searchfield-inner nav-sprite'>
  49.                 <div class='nav-searchfield-width'>
  50.                   <div id='nav-iss-attach'>
  51.                     <input type='text' id='twotabsearchtextbox' title='Search For' value='' name='field-keywords' autocomplete='off'>
  52.                   </div>
  53.                 </div>
  54.                 <!--[if IE ]><div class='nav-ie-min-width' style='width: 360px'></div><![endif]-->
  55.               </div>
  56.             </div>
  57.  
  58.             <div class='nav-submit-button nav-sprite'>
  59.               <input type='submit' value='Go' class='nav-submit-input' title='Go'>
  60.             </div>
  61.  
  62.           </form>
  63.         </div>
<div>
	<form action='/s/ref=nb_sb_noss' method='get' name='site-search' class='nav-searchbar-inner'>   
            	<span id='nav-search-in' class='nav-sprite'>
              		<span id='nav-search-in-content' data-value="search-alias=aps">
All</span>
              		<span class='nav-down-arrow nav-sprite'></span>
             		 <select name="url" id="searchDropdownBox" class="searchSelect" title="Search in"   >
				<option value="search-alias=aps" selected="selected">All Departments</option>
				<option value="search-alias=instant-video">Amazon Instant Video</option>
				<option value="search-alias=appliances">Appliances</option>
				<option value="search-alias=mobile-apps">Apps for Android
</option>
				<option value="search-alias=arts-crafts">Arts, Crafts & Sewing</option>
				<option value="search-alias=automotive">Automotive</option>
				<option value="search-alias=baby-products">Baby</option>
				<option value="search-alias=beauty">Beauty</option>
				<option value="search-alias=stripbooks">Books</option>
				<option value="search-alias=mobile">Cell Phones & Accessories</option>
				<option value="search-alias=apparel">Clothing & Accessories</option>
				<option value="search-alias=collectibles">Collectibles</option>
				<option value="search-alias=computers">Computers</option>
				<option value="search-alias=electronics">Electronics</option>
				<option value="search-alias=gift-cards">Gift Cards Store</option>
				<option value="search-alias=grocery">Grocery & Gourmet Food</option>
				<option value="search-alias=hpc">Health & Personal Care</option>
				<option value="search-alias=garden">Home & Kitchen</option>
				<option value="search-alias=industrial">Industrial & Scientific</option>
				<option value="search-alias=jewelry">Jewelry</option>
				<option value="search-alias=digital-text">Kindle Store</option>
				<option value="search-alias=magazines">Magazine Subscriptions</option>
				<option value="search-alias=movies-tv">Movies & TV</option>
				<option value="search-alias=digital-music">MP3 Downloads</option>
				<option value="search-alias=popular">Music</option>
				<option value="search-alias=mi">Musical Instruments</option>
				<option value="search-alias=office-products">Office Products</option>
				<option value="search-alias=lawngarden">Patio, Lawn & Garden</option>
				<option value="search-alias=pets">Pet Supplies</option>
				<option value="search-alias=shoes">Shoes</option>
				<option value="search-alias=software">Software</option>
				<option value="search-alias=sporting">Sports & Outdoors</option>
				<option value="search-alias=tools">Tools & Home Improvement</option>
				<option value="search-alias=toys-and-games">Toys & Games</option>
				<option value="search-alias=videogames">Video Games</option>
				<option value="search-alias=watches">Watches</option></select>
          		</span>

            <div class='nav-searchfield-outer nav-sprite'>
              <div class='nav-searchfield-inner nav-sprite'>
                <div class='nav-searchfield-width'>
                  <div id='nav-iss-attach'>
                    <input type='text' id='twotabsearchtextbox' title='Search For' value='' name='field-keywords' autocomplete='off'>
                  </div>
                </div>
                <!--[if IE ]><div class='nav-ie-min-width' style='width: 360px'></div><![endif]-->
              </div>
            </div>

            <div class='nav-submit-button nav-sprite'>
              <input type='submit' value='Go' class='nav-submit-input' title='Go'>
            </div>

          </form>
        </div>

This markup acts like a specification for what the server is looking for when this form is submitted.

As an example, the form’s action describes the path to the script meant to process the form. In this case that script is located at /s/ref=nb_sb_noss.

  1. <form action='/s/ref=nb_sb_noss' method='get' name='site-search' class='nav-searchbar-inner'>
<form action='/s/ref=nb_sb_noss' method='get' name='site-search' class='nav-searchbar-inner'> 

The method tells us that the server is expecting GET, meaning the GET request will begin like the following, since the parent directory is “www.amazon.com” and the path to the script is “/s/ref=nb_sb_noss_1.”

  1. http://www.amazon.com/s/ref=nb_sb_noss_1
http://www.amazon.com/s/ref=nb_sb_noss_1

The first form input is a select with the name “url.” This will be the first key in a series of key and value pairs.

  1. <select name="url" id="searchDropdownBox" class="searchSelect" title="Search in"   >
<select name="url" id="searchDropdownBox" class="searchSelect" title="Search in"   >

GET requests always start with a “?” so the GET request would now look something like the following.

  1. http://www.amazon.com/s/ref=nb_sb_noss_1?url=
http://www.amazon.com/s/ref=nb_sb_noss_1?url=

The value for the “url” key should match the value set for one of the options.

For example, if one wanted to search all of Amazon, this value would be “search-alias=aps.” Notice that this matches the value field for the option shown below.

  1. <option value="search-alias=aps" selected="selected">All Departments</option>
<option value="search-alias=aps" selected="selected">All Departments</option>

If the search was to be limited to books, the desired value would be “search-alias=stripbooks.”

  1. <option value="search-alias=stripbooks">Books</option>
<option value="search-alias=stripbooks">Books</option>

In GET responses, values from options are often HTML encoded. In this example, the equals sign (=) would be replaced with the entity code “%3D” so that the resulting value would be — using the All Departments example — “search-alias%3Daps”.

As a result, a GET request would look like the following.

  1. http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps
http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps

With GET, additional key and value pairs are prefaced with an ampersand (&). The Amazon search form has another input field with the name “field-keywords.”

  1. <input type='text' id='twotabsearchtextbox' title='Search For' value='' name='field-keywords' autocomplete='off'>
<input type='text' id='twotabsearchtextbox' title='Search For' value='' name='field-keywords' autocomplete='off'>

This name is the next key to be added to the GET request.

  1. http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=
http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=

The final value is a text string as the input field’s type indicates. For the example, the search could be for “webbot.”

  1. http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=webbot
http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=webbot

In typical GET fashion, a multi-word query would use plus signs (+) rather than spaces, so a search for “making webbots” would result in the following GET request. This can be added programmatically with the PHP function urlencode().

  1. http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=making+webbots
http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=making+webbots

As a test, one could copy the GET Request URL and paste it into a browser window. Submitting it should return an Amazon search results page.

Image may be NSFW.
Clik here to view.
the GET request successfully generates results in a browser.

The GET request successfully generates results in a browser.

Automating the Process

If one stopped with just the GET request URL, the entire process would be something of a waste, since it would have been a lot simpler to just use the actual Amazon web form.

Having dissected the form and built a proper request, this process can now be automated with a webbot.

As an example, imagine that Amazon was a competitor, and a merchant wanted to monitor which books Amazon carried for certain topics — such as webbots, process automation, and web development.

A webbot (script) could be built to search Amazon for three, thirty, or three hundred terms each day, parse the resulting web pages, and store the data, so that trends in available products or prices could be monitored over time.

Similarly, a merchant could monitor dozens of competitors in this way.

Without going into the details of how to actually parse the web page content, let’s describe what a very basic, search-form-completing bot might look like.

First, there would be a source of search queries. For this example, an array is used.

  1. <?php
  2. $keywords = array('webbots', 'process automation', 'web development');
<?php
$keywords = array('webbots', 'process automation', 'web development');

A function would be written to retrieve the search results page. This function could use cURL.

  1. function get_search($url){
  2.   $ch = curl_init();
  3.   curl_setopt( $ch, CURLOPT_URL, $url);
  4.   curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1);
  5.   $html = curl_exec($ch);
  6.  
  7.   // code to parse and store the results page goes here.
  8.  
  9.   curl_close($ch);
  10.  
  11. }
function get_search($url){
	$ch = curl_init();
	curl_setopt( $ch, CURLOPT_URL, $url);
	curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1);
	$html = curl_exec($ch);
	
	// code to parse and store the results page goes here.
	
	curl_close($ch);
	
}	

Finally, a foreach statement would call the get_search function for every search query.

  1. foreach( $keywords as $keyword){
  2.   $keyword = urlencode($keyword);
  3.   $url = "http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=" . $keyword;
  4.   get_search($url);
  5. }
  6.  
  7. ?>
foreach( $keywords as $keyword){
	$keyword = urlencode($keyword);
	$url = "http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=" . $keyword;
	get_search($url);
}
	
?>

Short of the code for parsing the HTML page, this webbot will automatically search for each query on the Amazon site.

Summing Up

With an understanding of what kind of response a server is looking for and a relatively few lines of code, it is possible to build a PHP-based webbot that will automatically complete GET-method web forms.

The same principles used here could be applied to automating all sorts of online form activity.


Viewing latest article 3
Browse Latest Browse All 10

Trending Articles