valid urls

Roosterr · February 17, 2011

Hello!

First... I'm sorry about my english skills.. Seems that with my own language I don't get a correct or good answer

So.. I have a site where visitors can put there favorite links to other sites and show them to others.

My problem is multiple urls to same locations in sites-table.

example:

http://google.com

http://google.com/

http://www.google.com

http://www.google.com/

http://www.google.com/index.php (?)

they all lead to same...

is there any light and powerfull way to check these? I'm trying to use query with "like" but it only works if there is exactly same url. I use fopen to check that url really exicst..

table:

Sites (basic information for link)

-id

-url

-title

+some meta tags if found

-timestamp

I'm not asking to do me this code.

But if someone can tell me just what functions and others I'm looking for and what I need for this...

My coal is to keep sites-table as clean as it can be.

If visitor adds a url that is already in sites-table only thing that happens is updating timestamp.

Thank you all for helping me!

-Roosterr

wigpip · February 17, 2011

Hey Rooster, it's a good question and one that could for the most part be done with Regular Expressions. If you run the code below you will see how it prints out :

as you can see it's on it's way to finding similarities then you would use the php array distinct or diff function to wipe out the repeats.


$string =  'http://google.com http://google.com/ http://www.google.com http://www.google.com/ http://www.google.com/index.php';
$values = explode(' ',$string);

foreach($values as $value) {

echo '<br>';
$newvalue = preg_replace("/com$|com\/\$/i",'com',$value);

   
   echo $newvalue;


}

http://google.com

http://www.google.com

http://www.google.com/index.php

wigpip · February 18, 2011

hey there i had a bit of time on my hands, anywho, this one gets it right down to one URL, so you can see how each time a preg_replace is run it narrows it down a bit more, there could be other duplicates that might slip through the net, may need more testing and yes I am aware there are more elegant ways of doing this in one swoop, say with one long regex expression? anyone?

$string =  'http://google.com http://google.com/ http://www.google.com http://www.google.com/ http://www.google.com/index.php';
$array1 = explode(' ',$string);

foreach($array1 as $value) {

echo '<br>';

$newvalue = preg_replace("/com$|com\/\$/i",'com',$value);

$array2[] = $newvalue; 
    


}

echo '<br>';

foreach($array2 as $value) {

echo '<br>';

$newvalue = preg_replace("/^http:\/\/www\./i",'http://',$value);

$array3[] = $newvalue; 
    


}

echo '<br>';






echo '<br>';

foreach($array3 as $value) {

echo '<br>';

$newvalue = preg_replace("/\/index\.(html?)|\/index\.(php)?$/i",'',$value);

$array4[] = $newvalue; 
    


}

echo '<br>';

print_r(array_unique($array4));

DavidAM · February 18, 2011

It sounds like you want to determine if the url entered by a user is already in the database, so you don't enter a duplicate. Since domain.com and www.domain.com could, in fact, be different sites, just normalizing the url may not be the correct solution.

Go to the address bar of your browser and type google.com (no http, no www, just google.com). When the page finishes loading, look at the address bar, it shows http://www.google.com/ (this is the effective url).

If I was going to attack this, I would start by using curl to issue a HEAD request (is that possible? if not, use GET) for the address entered by the user. If the request fails, then the user's entry is invalid. If the request succeeds, you should be able to retrieve the EFFECTIVE URL from the curl information. This is the url I would store in the database. I would try it with HEAD first, because this will use less bandwidth on your server and the destination server. This is just out of common courtesy and so there is less returned data to deal with.

If I could not get this to work with curl (and some help from phpFreaks), then I would take a look at parse_url() to break down the user's entry and normalize the url before entering it in the database.

Roosterr · February 18, 2011

I have been working with this like couple of weeks now and now i can do much better code

Thanks wigpip and davidam!

Sign In

valid urls

Recommended Posts

Roosterr

Link to comment

Share on other sites

wigpip

Link to comment

Share on other sites

wigpip

Link to comment

Share on other sites

DavidAM

Link to comment

Share on other sites

Roosterr

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information