PHP: Extract Outgoing URLs from a Web Page

In PHP, you can download a web page using file_get_contents or curl. Once you have downloaded a web page, you can process it.

We know that the tag structure of hyperlink is as follows

<a href="http://www.example.com">Example</a>

Keeping this in mind, we write the following program

<?php

function extractElementsFromWebPage($webPage, $tagName) {
  //Creating a DOMDocument Object.
  $dom = new DOMDocument;

  //Parsing the HTML from the web page
  if ($dom->loadHTML($webPage)) {
    // Extracting the specified elements from the web page
    @$elements = $dom->getElementsByTagName($tagName);
    return $elements;
  }
  return FALSE;
}

function downloadURL($URL) {
  $webPage = file_get_contents ($URL);
  return $webPage;
}

$webPage = downloadURL("http://www.mozilla.org/");
if ($webPage ) {
  $URLs = extractElementsFromWebPage($webPage, 'a');
  if ($URLs) {
    foreach ($URLs as $URL){
      // Extracting the URLs
      echo $URL->getAttribute('href'), "n";
    }
  }
  else {
    echo "Error in parsing the webPagen";
  }
}
else {
  echo "Error in downloading the webPagen";
}
?>

There are certain things that need to be understood:

Firstly we are using file_get_contents to download a web page. Then we use the DOMDocument class in PHP to parse the HTML page. Check the two functions

  1. downloadURL
  2. extractElementsFromWebPage

downloadURL uses file_get_contents to download the web page and extractElementsFromWebPage uses the DOMDocument class. The function loadHTML is used to parse the HTML page and getElementsByTagName to extract the specified elements. In our case, we want to extract the HTML tag element a.

On executing the program

$ php extractURLs.php 
#main
/
/about/
/community/
/projects/
/contribute/
/about/mission.html
http://www.mozilla.com/firefox/
http://www.mozilla.com/mobile/download/
...

PHP: Extract Image URLs from a Web Page

In PHP, you can download a web page using file_get_contents or curl. Once you have downloaded a web page, you can process it. We want to extract the image URLs from a web page.

We know that the tag structure of an image url is as follows

<img src="image.gif" alt="Image Description" />

Keeping this in mind, we write the following program

<?php

function extractElementsFromWebPage($webPage, $tagName) {
  //Creating a DOMDocument Object.
  $dom = new DOMDocument;

  //Parsing the HTML from the web page
  if ($dom->loadHTML($webPage)) {
    // Extracting the specified elements from the web page
    @$elements = $dom->getElementsByTagName($tagName);
    return $elements;
  }
  return FALSE;
}

function downloadURL($URL) {
  $webPage = file_get_contents ($URL);
  return $webPage;
}

$webPage = downloadURL("http://www.mozilla.org/");
if ($webPage ) {
  $imageURLURLs = extractElementsFromWebPage($webPage, 'img');
  if ($imageURLURLs) {
    foreach ($imageURLURLs as $imageURL){
      // Extracting the URLs
      echo $imageURL->getAttribute('src'), "n";
    }
  }
  else {
    echo "Error in parsing the webPagen";
  }
}
else {
  echo "Error in downloading the webPagen";
}
?>

There are certain things that need to be understood:

Firstly we are using file_get_contents to download a web page. Then we use the DOMDocument class in PHP to parse the HTML page. Check the two functions

  1. downloadURL
  2. extractElementsFromWebPage

downloadURL uses file_get_contents to download the web page and extractElementsFromWebPage uses the DOMDocument class. The function loadHTML is used to parse the HTML page and getElementsByTagName to extract the specified elements. In our case, we want to extract the HTML tag element img.

On executing the program

$ php extractImageURLs.php
/images/promos/join_promo_a.png
/images/template/screen/logo_footer.png
https://statse.webtrendslive.com/dcsis0ifv10000gg3ag82u4rf_7b1e/njs.gif?dcsuri=/nojavascript&WT.js=No&WT.tv=8.6.2

PHP: Extract HTML Tags/Element from a Web Page

In PHP, you can download a web page using file_get_contents or curl. Once you have downloaded a web page, you can process it. Take for example, we want to extract the image URLs from a web page.

We know that the tag structure of an image url is as follows

<img src="image.gif" alt="Image Description" />

Keeping this in mind, we write the following program

<?php

function extractElementsFromWebPage($webPage, $tagName) {
  //Creating a DOMDocument Object.
  $dom = new DOMDocument;

  //Parsing the HTML from the web page
  if ($dom->loadHTML($webPage)) {
    // Extracting the specified elements from the web page
    @$elements = $dom->getElementsByTagName($tagName);
    return $elements;
  }
  return FALSE;
}

function downloadURL($URL) {
  $webPage = file_get_contents ($URL);
  return $webPage;
}

$webPage = downloadURL("http://www.mozilla.org/");
if ($webPage ) {
  $imageURLURLs = extractElementsFromWebPage($webPage, 'img');
  if ($imageURLURLs) {
    foreach ($imageURLURLs as $imageURL){
      // Extracting the URLs
      echo $imageURL->getAttribute('src'), "n";
    }
  }
  else {
    echo "Error in parsing the webPagen";
  }
}
else {
  echo "Error in downloading the webPagen";
}
?>

There are certain things that need to be understood:

Firstly we are using file_get_contents to download a web page. Then we use the DOMDocument class in PHP to parse the HTML page. Check the two functions

  1. downloadURL
  2. extractElementsFromWebPage

downloadURL uses file_get_contents to download the web page and extractElementsFromWebPage uses the DOMDocument class. The function loadHTML is used to parse the HTML page and getElementsByTagName to extract the specified elements. In our case, we want to extract the HTML tag element img.

On executing the program

$ php extractElements.php
/images/promos/join_promo_a.png
/images/template/screen/logo_footer.png
https://statse.webtrendslive.com/dcsis0ifv10000gg3ag82u4rf_7b1e/njs.gif?dcsuri=/nojavascript&WT.js=No&WT.tv=8.6.2

PHP: Download Web Page using file_get_contents

You can use curl to download a webpage in PHP. It is also possible to download a web page using file_get_contents().

<?php

function downloadURL($URL) {
  $webpage = file_get_contents ($URL);
  return $webpage;
}

$webpage = downloadURL("http://www.mozilla.org/");
if ($webpage){
  echo $webpage;
}
else {
  echo "Error in downloading the webpagen";
}
?>

$ php download.php
<html>
....
</body>
</html>

In the above example, we try to download the web page of Mozilla. Let’s try to download a non existing web page

<?php

function downloadURL($URL) {
  $webpage = file_get_contents ($URL);
  return $webpage;
}

$webpage = downloadURL("http://www.mozilla.org/1");
if ($webpage){
  echo $webpage;
}
else {
  echo "Error in downloading the webpagen";
}
?>

We find the following error

$ php download.php

Warning: file_get_contents(http://www.mozilla.org/1): failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found
 in /home/user/Documents/Dropbox/Personal/Programs/downloadWebpage.php on line 4
Error in downloading the webpage

PHP: Google URL Shortener: expand

Now since you have been able to successfully shorten an URL, it’s time to see how to expand a shortened URL. The following program shows how to expands a shortened URL without making use of a key.

<?php

/*Base URL of the Service*/
$BASEURL = "https://www.googleapis.com/urlshortener/";

/*Version of the service*/
$VERSION = "v1";

/*Service*/
$SERVICE = "url";

/*Type of the content, e.g. text/xml, application/json*/
$CONTENT_TYPE = "Content-Type: application/json";

/*API name*/
$REQUEST = "shortUrl";

function printDetails($hash) {
    foreach($hash as $key => $value){
        if (is_array($value)) {
            print "$key:n";
            foreach($value as $val) {
                if(is_object($val) || is_array($val)){
                     printDetails($val);
                }
                else{
                     print "$val,";
                }
            }
            print "n";
        }
        else if (!is_object($value)) {
            print "$key: $valuen";
        }
        else {
            print "$key:n";
            printDetails($value);
        }
    }
}

function get_response($URL, $type) {

    if(!function_exists('curl_init')) {
    die ("Curl PHP package not installedn");
    }

    /*Initializing CURL*/
    $curlHandle = curl_init();

    /*The URL to be downloaded is set*/
    curl_setopt($curlHandle, CURLOPT_URL, $URL);
    curl_setopt($curlHandle, CURLOPT_HEADER, false);
    curl_setopt($curlHandle, CURLOPT_HTTPHEADER, array($type));
    curl_setopt($curlHandle, CURLOPT_RETURNTRANSFER, 1);

    /*Now execute the CURL, download the URL specified*/
    $response = curl_exec($curlHandle);

    /*Return the response as it is, let the application process it accordingly*/
    return $response;
}

/*Specify the url to be expanded*/

if (sizeof($argv) != 2) {
  echo "Usage: $argv[0] urln";
  exit;
}

$url = "$BASEURL$VERSION/$SERVICE";

print "URL: $urln";
print "URL To be expanded: $argv[1]n";

$response = get_response($url."?$REQUEST=$argv[1]", $CONTENT_TYPE);

/*Printing the response on to the console*/
if (isset($response)) {
   $dresponse = json_decode($response);
   if (json_last_error() == JSON_ERROR_NONE){
       if (isset($dresponse->{'longUrl'})) {
           print "The Expanded URL is: ". $dresponse->{'longUrl'}."n";
           print "======================================================n";
           print "More Details:n";
           print "======================================================n";
           printDetails($dresponse);
           print "======================================================n";
       }
   }
}
else {
   print "Failed to get a responsen";
   exit(1);
}
?>

Let’s execute the above program with a shortened URL

$ php expand.php http://goo.gl/nhfUT

URL: https://www.googleapis.com/urlshortener/v1/url
URL To be expanded: http://goo.gl/nhfUT
The Expanded URL is: http://example.com/
======================================================
More Details:
======================================================
kind: urlshortener#url
id: http://goo.gl/nhfUT
longUrl: http://example.com/
status: OK
======================================================

Working with GoogleCL

GoogleCL provides access to a number of services from the command line.

The available services are

  1. picasa
  2. blogger
  3. youtube
  4. docs
  5. contacts
  6. calendar

To work with each of these, let’s see how googleCL works

After installing GoogleCL, enter


$ google

>

Now you can type any command there. Let’s start with help


$ google
> help
Welcome to the Google CL tool!
 Commands are broken into several parts:
 service, task, options, and arguments.
 For example, in the command
 "> picasa post --title "My Cat Photos" photos/cats/*"
 the service is "picasa", the task is "post", the single
 option is a title of "My Cat Photos", and the argument is the
 path to the photos.

 The available services are
'picasa', 'blogger', 'youtube', 'docs', 'contacts', 'calendar'
 Enter "> help <service>" for more information on a service.
 Or, just "quit" to quit.
>

To work with any service, simply type the service name and the options


>help picasa

help picasa
Available tasks for service picasa: 'get', 'create', 'list', 'list-albums', 'tag', 'post', 'delete'
 get: Download photos
 Requires: none Optional: title, query Arguments: LOCATION

 create: Create an album
 Requires: title Optional: date, summary, tags Arguments: PATH_TO_PHOTOS

 list: List photos
 Requires: delimiter Optional: title, query

 list-albums: List albums
 Requires: delimiter Optional: title

 tag: Tag photos
 Requires: tags AND (title OR query)

 post: Post photos to an album
 Requires: title Optional: tags Arguments: PATH_TO_PHOTOS

 delete: Delete photos or albums
 Requires: (title OR query)

So you can see a number of options with picasa like create, list, delete


> picasa list

The above command will display all the URLs of your photographs


> picasa list-albums

This will display only your album URLs

similarly you can work with any service and know the options


> help service-name

GoogleCL: Available Services

GoogleCL provides access to various services from the command line. The services provided by GoogleCL are

  1. picasa
  2. blogger
  3. youtube
  4. docs
  5. contacts
  6. calendar

At any point of time, if you want to figure out services available, simply type help in the googleCL

$ google
> help
Welcome to the Google CL tool!
  Commands are broken into several parts: 
    service, task, options, and arguments.
  For example, in the command
      "> picasa post --title "My Cat Photos" photos/cats/*"
  the service is "picasa", the task is "post", the single
  option is a title of "My Cat Photos", and the argument is the 
  path to the photos.

  The available services are 
'picasa', 'blogger', 'youtube', 'docs', 'contacts', 'calendar'
  Enter "> help <service>" for more information on a service.
  Or, just "quit" to quit.
> 

As you can see the last few lines tell about the available services

The available services are
‘picasa’, ‘blogger’, ‘youtube’, ‘docs’, ‘contacts’, ‘calendar’