Jump to content

cURL with link fixer - Input needed


The Letter E

Recommended Posts

I'm working on a fix for cURL that replaces all relative urls with absolute.

 

Here's the code:

<?php
//Get web address 
//FORMAT: url=site.com, url=site.net, url=site.org
$page = $_GET['url'];

//Format web address
$http = 'http:\/\/';
$www = 'www.';
if(preg_match('/'.$http.'/', $page)){preg_replace('/'.$http.'/', '', $page);}
if(preg_match('/'.$www.'/', $page)){preg_replace('/'.$www.'/', '', $page);}
$page = rtrim($page, '/');				
$page = 'http://www.'.$page.'/';

//cURL
        // create curl resource
        $ch = curl_init();
        // set url
        curl_setopt($ch, CURLOPT_URL, $page);
        //return the transfer as a string
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        // $output contains the output string
        $output = curl_exec($ch);
        // close curl resource to free up system resources
        curl_close($ch);

//Convert relative URL to absolute
$output = preg_replace('/src="/', 'src="'.$page, $output);
$output = preg_replace('/href="/', 'href="'.$page, $output);
$output = preg_replace('/action="/', 'action="'.$page, $output);

echo $output;
?>

 

As you can see it's pretty basic. In many cases it fixes broken styles, links, images and form actions. I am looking for any ideas as to how I can add some more intelligence to this script.

 

1. What else should it do

2. Where is it not doing it's job

3. Can it do what it's already doing better

 

Any input offered is much appreciated. I'm not looking for someone to write code, but if you are intrigued and want to add a snippet to it, that's cool! Feel free to keep a copy of your own if you like the idea to build off of.

 

Thanks Peeps,

 

E

Link to comment
Share on other sites

Can do a check if url not equal to substr x amount.... http://, ftp://,https://,feed:// then add the http:// to front if not. With or without the www. won't matter as curl will resolve those to where need to be, even though some sites don't work with or without , but hey what gonna do.

After curl resolves it,  use parse php function and just lowercase the domain name part.

I also lowercase the protocols area, who knows why people like to mess with capitalizing this stuff.

Some sites require an end slash while others can not. I found it's best not to add the end slash, people that usually have the slashes...their links can resolve to with or without the end slash.

 

 

So after al my babble here's the code I use to fix links for inclusion to my site.

 

For me I make everything a http instead of https or ftp, because as I post them I chop off the protocols anyway, and when someone clicks the url in browser...they go to where need to be anyway.

 

So maybe you can use some of my code, can add the pot number back in to the complete url as well, i eliminate any ports as can see in my code. I just didn't want those.

 

$trimurl = trim($_GET['domainname']);
            $trimurl = substr($trimurl, 0,300);
            
            if ($_GET['domainname'] == "" OR $_GET['domainname'] == "http://"){

            echo "<h1>Please Insert a Url</h1><br />";
           
            } else {

            $input_parse_url=mysql_real_escape_string($trimurl);
            /*check for valid urls*/
            if ((substr($input_parse_url, 0,  == "https://") OR (substr($input_parse_url, 0, 12) == "https://www.") OR (substr($input_parse_url, 0, 7) == "http://") OR (substr($input_parse_url, 0, 11) == "http://www.") OR (substr($input_parse_url, 0, 4) == "www.") OR (substr($input_parse_url, 0, 6) == "ftp://")  OR (substr($input_parse_url, 0, 11) == "feed://www.") OR (substr($input_parse_url, 0, 7) == "feed://")) {
                $new_parse_url = $input_parse_url;

            } else {
                /*replace uppercase or unsupported to normal*/
                $url_input .= str_replace(array('feed://www.','feed://','HTTP://','HTTP://www.','HTTP://WWW.','http://WWW.','HTTPS://','HTTPS://www.','HTTPS://WWW.','https://WWW.'), '', $input_parse_url);
                $new_parse_url = "http://$url_input";
            }
            echo "$input_parse_url<br />";
            
            /*parse the complete url to lowercase just the site domain area*/
            function getparsedHost($new_parse_url) {
                $parsedUrl = parse_url(trim($new_parse_url));
                return trim($parsedUrl[host] ? $parsedUrl[host] : array_shift(explode('/', $parsedUrl[path], 2)));
            }
            $get_parse_url = parse_url($new_parse_url, PHP_URL_HOST);
            $host_parse_url .= str_replace(array('Www.','WWW.'), '', $get_parse_url);
            $host_parse_url = strtolower($host_parse_url);
            $port_parse_url = parse_url($new_parse_url, PHP_URL_PORT);
            $user_parse_url = parse_url($new_parse_url, PHP_URL_USER);
            $pass_parse_url = parse_url($new_parse_url, PHP_URL_PASS);
            $get_path_parse_url = parse_url($new_parse_url, PHP_URL_PATH);
            $path_parse_url .= str_replace(array('Www.','WWW.'), '', $get_path_parse_url);
            $query_add_parse_url = parse_url($new_parse_url, PHP_URL_QUERY);
            $query_add_parse_url = "?$query_add_parse_url";
            $query_add_parse_url = rtrim($query_add_parse_url, '#');
            $fragment_parse_url = parse_url($new_parse_url, PHP_URL_FRAGMENT);
            $fragment_parse_url = "#$fragment_parse_url";
            $hostpath_url = "$host_parse_url$path_parse_url";
            $hostpath_url = rtrim($hostpath_url, '?');
            $query_add_parse_url = rtrim($query_add_parse_url, '?');

            $hostpathquery_url = "$host_parse_url$path_parse_url$query_add_parse_url";

            $complete_url = "$host_parse_url$user_parse_url$pass_parse_url$path_parse_url$query_add_parse_url$fragment_parse_url";
            $complete_url = rtrim($complete_url, '#');

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.