Jump to content

valid urls


Roosterr

Recommended Posts

Hello!

First... I'm sorry about my english skills.. Seems that with my own language I don't get a correct or good answer :)

So.. I have a site where visitors can put there favorite links to other sites and show them to others.

 

My problem is multiple urls to same locations in sites-table.

example:

http://google.com

http://google.com/

http://www.google.com

http://www.google.com/

http://www.google.com/index.php (?)

they all lead to same...

 

is there any light and powerfull way to check these? I'm trying to use query with "like" but it only works if there is exactly same url. I use fopen to check that url really exicst..

 

table:

Sites (basic information for link)

-id

-url

-title

+some meta tags if found

-timestamp

 

I'm not asking to do me this code.

But if someone can tell me just what functions and others I'm looking for and what I need for this...

 

My coal is to keep sites-table as clean as it can be.

If visitor adds a url that is already in sites-table only thing that happens is updating timestamp.

 

Thank you all for helping me!

 

-Roosterr

Link to comment
Share on other sites

Hey Rooster, it's a good question and one that could for the most part be done with Regular Expressions. If you run the code below you will see how it prints out :

as you can see it's on it's way to finding similarities then you would use the php array distinct or diff function to wipe out the repeats.

 


$string =  'http://google.com http://google.com/ http://www.google.com http://www.google.com/ http://www.google.com/index.php';
$values = explode(' ',$string);

foreach($values as $value) {

echo '<br>';
$newvalue = preg_replace("/com$|com\/\$/i",'com',$value);

   
   echo $newvalue;


}

http://google.com

http://google.com

http://www.google.com

http://www.google.com

http://www.google.com/index.php

 

 

Link to comment
Share on other sites

hey there i had a bit of time on my hands, anywho, this one gets it right down to one URL, so you can see how each time a preg_replace is run it narrows it down a bit more, there could be other duplicates that might slip through the net, may need more testing and yes I am aware there are more elegant ways of doing this in one swoop, say with one long regex expression? anyone?

 

$string =  'http://google.com http://google.com/ http://www.google.com http://www.google.com/ http://www.google.com/index.php';
$array1 = explode(' ',$string);

foreach($array1 as $value) {

echo '<br>';

$newvalue = preg_replace("/com$|com\/\$/i",'com',$value);

$array2[] = $newvalue; 
    


}

echo '<br>';

foreach($array2 as $value) {

echo '<br>';

$newvalue = preg_replace("/^http:\/\/www\./i",'http://',$value);

$array3[] = $newvalue; 
    


}

echo '<br>';






echo '<br>';

foreach($array3 as $value) {

echo '<br>';

$newvalue = preg_replace("/\/index\.(html?)|\/index\.(php)?$/i",'',$value);

$array4[] = $newvalue; 
    


}

echo '<br>';

print_r(array_unique($array4));

 

 

 

Link to comment
Share on other sites

It sounds like you want to determine if the url entered by a user is already in the database, so you don't enter a duplicate. Since domain.com and www.domain.com could, in fact, be different sites, just normalizing the url may not be the correct solution.

 

Go to the address bar of your browser and type google.com (no http, no www, just google.com). When the page finishes loading, look at the address bar, it shows http://www.google.com/ (this is the effective url).

 

If I was going to attack this, I would start by using curl to issue a HEAD request (is that possible? if not, use GET) for the address entered by the user. If the request fails, then the user's entry is invalid. If the request succeeds, you should be able to retrieve the EFFECTIVE URL from the curl information. This is the url I would store in the database. I would try it with HEAD first, because this will use less bandwidth on your server and the destination server. This is just out of common courtesy and so there is less returned data to deal with.

 

If I could not get this to work with curl (and some help from phpFreaks), then I would take a look at parse_url() to break down the user's entry and normalize the url before entering it in the database.

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.