Jump to content

Seriuos problem with collecting result in recursive function


selma

Recommended Posts

Hi there, hope one of you can help me with a problem im just having.

 

Ok, lets start with the explanation what i want to to:

I'd like to collect headlines from a html site, and get the test results in an array, so that the array structure represents the dom-levels of the site.

 

Small Example:

<h1>test1.1<h1/>

<h2>test2.1</h2>

<h2>test2.2</h2>

<h3>test3</h3>

<h1>test1.2<h1/>

Shoul end in a array structure like:

 

array('level 1' => array (
            'sibble1' => array (
                'headline' => 'test1.1',
                'level2' => array(
                    'sibble1' => array (
                        'headline' => 'test2.1',
                        'level3' => array(), // empty data, needs to be processed anyway to find gaps, to maybe a h4 headline would be existing
                    ),
                    'sibble2' => array (
                        'headline' => 'test2.2',
                        'level3' => array (
                            'sibble1' => array (
                                'headline' => 'test3.1',
                                'level4' => array(), // empty data, needs to be processed anyway to find gaps, to maybe a h4 headline would be existing
                            ),
                        ),
                    ),
                ),
            ),
            'sibble2' => array (
                'headline' => 'test1.1',
                'level2' => array(),
             ),
        ),
);

 

So i hope out of this example you can see what i want to do. level represents the healdine level 1-9, sibblin is as name for the childs on the headline level.

 

Ok, so to extract the herefore needed data out of the html, i build a class with a recursive function, that filters the html by a regex from one headline to the next, first iteratin all childs, if there are no more childs i go tho ne nextsibbling element.

 

as an running example code look here:

 

<?php

class Application_Model_DomParser {

    const PREG_MATCH_ONE_ELEMENT_TO_MUCH_LENGTH = 4;

    private $slicedFields = array();

    public function __construct() {
    }

    public function sliceFirstSectionToDataFields($level = 1, $haystack) {

        preg_match_all("#(.*?)<h$level>(.*?)</h$level>(.*?)(<h$level>|$)#s", $haystack, $data);

        // prepare the chunkData
        $pageText = '';
        if (isset($data[1][0])) {
            $pageText = $data[1][0];
        }
        $headline = '';
        if (isset($data[2][0])) {
            $headline = $data[2][0];
        }
        $dataToProcessNextLevel ='';
        if (isset($data[3][0])) {
            $dataToProcessNextLevel = $data[3][0];
        }
        // @todo dirty warnings compression, search why warning occures
        @$posOfNextChild = strlen($data[0][0]) - self::PREG_MATCH_ONE_ELEMENT_TO_MUCH_LENGTH;


// from here goes the debug...
echo $headline ."::" .strlen($dataToProcessNextLevel). "<br>";


        if (strlen($dataToProcessNextLevel) <= self::PREG_MATCH_ONE_ELEMENT_TO_MUCH_LENGTH) {
            if ($level < 9) {
                $dataToProcessNextLevel = $haystack;
            }
            else {
                return;
            }
        }
        //recursive check for next level in $dataToProcessNextLevel
        $nextLevel = $level + 1;
        $this->sliceFirstSectionToDataFields($nextLevel, $dataToProcessNextLevel);


        $haystack = substr($haystack, $posOfNextChild);
        // slized To The End
        if (self::PREG_MATCH_ONE_ELEMENT_TO_MUCH_LENGTH == strlen($haystack)) {
//            die("slized To The End");
            return;
        }

        // recursive check for other childs in actuall level...
        $this->sliceFirstSectionToDataFields($level, $haystack);
    }
}


$htmlString = 'test<h1>headkline1.1</h1>
<p>test test</p>
<h2>headline 2.1</h2>test test
<h2>headline 2.2</h2>
<p>tes test test</p>
<h3>headline 3.1</h3>test
<h3>headline 3.2</h3>test
<h2>headline 2.3</h2>
<p> </p>
<p>test</p>
<h1>headline 1.2</h1>
<h2>headline 2.4</h2>
<p>11111111111112222222222222222222222</p>
<p> ewfwrefg upowmdg w3q09umq09wrt n3q089ty 3q0898943ty -98 41</p>
<h3>headline 3.3</h3>
<p>test</p>
<p>test</p>
<p>test</p>
<h1>1.3 testtest</h1>
<p>test</p>
<h3>head 3.4</h3>
<p>sadfsadfsadf asdfsda f sdaas saf saddas</p>
<h3>head 3.4</h3>
<h3>test 1.3</h3>
<p>test;</p>
';

$model = new Application_Model_DomParser();
$result = $model->sliceFirstSectionToDataFields(1, $this->htmlString);

 

 

So letting this piece of code run, you can se via the debug echo, every headline is found, even in the right order of its occurence.

 

My problem now is dont get it how to return the extracted values, and collect them to get a result as my above shown structure shows.

 

So i spent ours to solve this problem but didnt come to a result.

 

I know its a hard problem, and to help me takes time, cause its a complex situation.

 

Allthough i hope someone knows a anser. I need this problem solved, to get my mind rested!

 

Thanks, and greetings

Selma

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.