Simple HTML DOM baked CakePHP component

cakephp simple html dom component

Baked and ready to eat!

Simple HTML DOM Parser ported to CakePHP

While we loved using the Simple HTML DOM PHP class provided here, we wanted to use it on our CakePHP sites as well. Instead of feeling sorry for ourselves, we decided to port the Simple HTML DOM parser (which uses a delightful jQuery-like DOM element filter) into CakePHP!

On top of that, we even added in the ability to use the PHP cURL library instead of the default file_get_contents() calls. Take a look at the ease of use in our Simple HTML DOM baked CakePHP component.

Update – 7/10/11

7/10/2011: Ralf (down in the comments below) made us aware of some changes with the usage of components in CakePHP, so please take a look at the new implementation of our component below (it’s a minor change).

Load the component into your controller like so:

<?php
class SampleController extends AppController {
    var $helpers = array ('Html','Form');
    var $name = 'Sample';
    var $components = array('SimpleHtmlDomBaked');
?>

Which will then allow you to access the component’s functions throughout the controller using:

$this->SimpleHtmlDomBaked;

So altogether you can use it just like this:

$url = "http://www.lolcats.com";
// you can also just use the class ref directly
// with $this->SimpleHtmlDomBaked if you'd like!
$html = $this->SimpleHtmlDomBaked;
// curl it		
$html->curl_and_load($url, true);
// get page title
$title = $html->find('title', 0)->innertext;
// get first picture src
$firstImage= $html->find('img', 0)->src;

Using the CakePHP HTML parser to do almost anything!

And any normal call that you would make using the Simple HTML DOM parser still works too!

// Find all anchors and images with the "title" attribute
$ret = $html->find('a[title], img[title]');
// Example
echo $html->find("#div1", 0)->children(1)->children(1)->children(2)->id;
// or
echo $html->getElementById("div1")->childNodes(1)->getAttribute('id'); 

You can find more calls in their manual.

Update – 1/13/11

1/13/2011: Thanks to a comment from Eeks, we realized that the Simple HTML DOM function set_callback() was broken. We have updated the component with the fix!

Here is the proper usage for the set_callback() function within CakePHP:

$url = "http://www.lolcats.com";
$html = $this->SimpleHtmlDomBaked;
// curl it		
$html->curl_and_load($url, true);
// set callback 
// @params
// the first is the object reference or class name of the controller
// the second is the name of the function within that controller
$html->set_callback($this,'my_callback');
// dump var
echo $html;

and upon the statement where it says echo, this function will fire from within the same CakePHP controller as the code above:

function my_callback($element){
       // Hide all <b> tags
       if ($element->tag=='b')
            $element->outertext = '';
}

And that should help you get started on the next great search engine or site parser!!

Download Simple HTML DOM baked

Support and questions

We are using CakePHP version 1.3.6 at the time of creation. We expect this to work with most versions of CakePHP starting at 1.2 and upwards.

You can find the CakePHP framework download archive here:
Download CakePHP

Please direct any inquiries for this CakePHP component in the comments below.

Related Posts
  • bubuzzz

    I used the package and it worked great. However, sometimes, my script stop at the #html->curl_and_load (…) without any reason (no error, no warning). Any ideas about it?

  • http://electrokami.com Metawriter

    Can you post a sample block of the code – I would love to help you out!

  • bubuzzz

    hi. Thank for your reply. This is my code sure
    $url = 'www.phonearena.com/news/'.$url;
    $html = new SimpleHtmlDomBakedComponent;
    debug ($url);
    $html->curl_and_load ($url, true);
    debug ($html);
    $this->set ('news', $new);

    i can see the value of $url = http://www.phonearena.com/news/Googles-eBookstore-open-for-business-in-the-U.S._id15120 but after that, the page will stop

  • bubuzzz

    i have 2 function in the NewsController class. the first one all () create a $html and extract data from the main page of phonearena. The second one view ($url) will take the url of the detail of each news and parse it into my view. The all () function runs very well without any problem. Only the second cannot run

  • bubuzzz

    [Update] Sorry, it stop in the line debug ($html). The library works great now :D

  • http://electrokami.com Metawriter

    So it works then? Great!

  • Eeks

    Hi, do you have any samples on using the set_callback function in cakephp?

    • http://electrokami.com Metawriter

      We have updated the component to better support the set_callback() function and allow you to use it within CakePHP!

  • http://electrokami.com Metawriter

    We will provide an update on this regarding the set_callback function.

  • Misha

    Advice please! How to install and use it on CakePHP 1.3?

    • http://electrokami.com Metawriter

      Hi Misha – you can find the official install guide from CakePHP’s creators here in their manual found here:
      http://book.cakephp.org/

      Install CakePHP from here:
      https://github.com/cakephp/cakephp/tree/1.3

      For simple testing usage with a local server like Zend Server or XAMPP, we recommend using the “Development” install method.

    • http://electrokami.com Metawriter

      Then once you have CakePHP installed, place this component into the /app/controllers/components folder to have access it within your page controllers.

  • http://activezero.co.uk T

    Great component, it works well for me. I was making some modifications to the curl_and_load() function to automatically follow http redirects and noticed the referrer is hardcoded to appear to come from google with the search term “speedo usa womens”. What’s with that???

    I’m working on passing an array through to curl_and_load() which will be fed to curl_setopt() to give a bit more flexibility.

    Nice one for all the work on this. If you want the mods I have made just give me a shout.

    • http://electrokami.com Metawriter

      I would love to see what you come up with – we might as well go ahead and make this a branch of Simple HTML DOM as it is, since they don’t use cURL.

      The great thing is, this class file works within CakePHP but still functions by itself inside of any other PHP script.

      About the Speedo thing – that will be removed! I must have been using it as part of testing and left it in by mistake!

      Good catch!

      • http://www.activezero.co.uk T

        Haha, I thought it must be something like that!

        I’ve copied the whole function to http://pastebin.com/nAxqdzgW

        Usage is pretty obvious really – I’m using it like this at the moment:

        curl_and_load($url, true, array(CURLOPT_FOLLOWLOCATION => true, CURLOPT_REFERER => “”));

        The new last parameter is optional for backwards-compatibility.

        It’s all working great for me – saved me loooooads of time!

        Thanks,

        Tom.

  • http://www.24100.net Ralf

    I’ve placed the php file into /app/controllers/components, however, when I do

    $html = new SimpleHtmlDomBakedComponent;

    in any of my controllers, it does not create the object. It stops execution entirely. I’m fairly new to cakephp. Is there anything else I need to configure to load this component?

  • http://www.24100.net Ralf

    Oh, I get an
    Fatal error: Class ‘SimpleHtmlDomBakedComponent’ not found in /…
    error. So I guess the component is not loading. Help is greatly appreciated.

    • http://electrokami.com/author/camsjams/ Cameron

      Is it saying not found in “/…” or does it say what directory it’s looking in (but don’t paste your entire site directory in here for security purposes).

      • http://electrokami.com/author/camsjams/ Cameron

        Oh I found the problem – it appears CakePHP had recently updated how components are implemented – please look at the new section up at the top of the article to see how to use the component now!

        Thanks for catching this!

  • http://www.watertech.gr ksotiris

    Using simple_html_dom_baked to post data
    you can use the nice code of this component to post and get data using curl. Just add the following in the file of simple_html_dom_baked.php, just above of curl_and_load function :

    function curl_and_post($url, $fields, $lowercase=true){
    //url-ify the data for the POST
    $fields_string =”;
    foreach($fields as $key=>$value) {
    $fields_string .= $key.’=’.$value.’&’;
    }
    rtrim($fields_string,’&’);

    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $useragent=”Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)”;
    curl_setopt($ch, CURLOPT_USERAGENT, $useragent);

    curl_setopt($ch,CURLOPT_POST,count($fields));
    curl_setopt($ch,CURLOPT_POSTFIELDS,$fields_string);

    $curl_scraped_page = curl_exec($ch);
    curl_close($ch);
    $this->load($curl_scraped_page,$lowercase);
    }

    • http://meiotech.blogspot.com mord4z

      @ksotiris, I’m trying to post some data with your code, but without success, can you give me a clue?

  • phar

    A few changes -> put file with simplehtmldom into /Vendor ({plugin}/Vendor), rename it to camelCase, in file strip from class name “Component” and include with App::Include(). Now you can call it via “$var = new simpleHtmlDom(‘example.com’)” .