Parsing DOMDocument and only keeping a one link

twittoris · September 4, 2010

I am trying to take a specific link from my site and place it into my database. I only want links starts with CORPSEARCH.ENTITY_INFORMATION?p_nameid=

Can someone point me in the right direction here?

Code for this is below:

// make the cURL request to $target_url

$html= curl_exec($ch);

if (!$html) {

echo "<br />cURL error number:" .curl_errno($ch);

echo "<br />cURL error:" . curl_error($ch);

exit;

}

// parse the html into a DOMDocument

$dom = new DOMDocument();

@$dom->loadHTML($html);

// grab all the on the page

$xpath = new DOMXPath($dom);

$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {

$href = $hrefs->item($i);

$url = $href->getAttribute('href');

$sql="INSERT INTO links(cid, nlink)VALUES('$i','$url')";

$result=mysql_query($sql);

echo $result;

echo $url;

twittoris · September 4, 2010

What if I implement preg_match somewhere in the code will it pull the urls containing it?

twittoris · September 4, 2010

Here I have edited it a little and put the script online but it is still spitting out every link on the page.

http://empirebuildingsestate.com/table.php

I just want to grab any link similar to this layout only.

CORPSEARCH.ENTITY_INFORMATION?p_nameid=3236937&p_corpid=3227476&p_entity_name=%41%72%77%65%6E%20%45%71%75%69%74%69%65%73&p_name_type=%41&p_search_type=%42%45%47%49%4E%53&p_srch_results_page=0

$dom = new DOMDocument();

@$dom->loadHTML($html);

// grab all the on the page

$xpath = new DOMXPath($dom);

$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {

$href = $hrefs->item($i);

$url = $href->getAttribute('href');

preg_match_all(nameid,$url);

$sql="INSERT INTO links(cid, nlink)VALUES('$i','$url')";

$result=mysql_query($sql);

echo $result;

echo $url;

// if successfully insert data into database, displays message "Successful".

if($result){

echo "Successful";

echo "<BR>";

}

else {

echo "ERROR";

}

echo "<br />Link stored: $url";

}

?>

wildteen88 · September 4, 2010

Use the built in XPath function, starts_with() to select only the links that begin with 'CORPSEARCH.ENTITY_INFORMATION'

So change this

$hrefs = $xpath->evaluate("/html/body//a");

To

$hrefs = $xpath->evaluate("/html/body//a[starts-with(@href, 'CORPSEARCH.ENTITY_INFORMATION')]");

Or it can be just this

$hrefs = $xpath->evaluate("//a[starts-with(@href, 'CORPSEARCH.ENTITY_INFORMATION')]");

Now your loop will be

for ($i = 0; $i < $hrefs->length; $i++) {
   $href = $hrefs->item($i);
   $url = $href->getAttribute('href');
   
   echo '<p>Found:<br />' . $url. '<br />Adding it to the database... ';
   
   $sql="INSERT INTO links(cid, nlink)VALUES('$i','$url')";
   $result = mysql_query($sql);
   
   echo (($result) ? 'Success!' : 'FAIL') . '</p>';
}

twittoris · September 4, 2010

Awesome! That was it. Thanks so much for your help.

Sign In

Parsing DOMDocument and only keeping a one link

Recommended Posts

twittoris

Link to comment

Share on other sites

twittoris

Link to comment

Share on other sites

twittoris

Link to comment

Share on other sites

wildteen88

Link to comment

Share on other sites

twittoris

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information