Jump to content

Best way to cross matching large datasets


plautzer

Recommended Posts

Hi,

 

Im running a script where am I cross matching about 200 000 data sets with each other. Each data set consists of 8 parameters and I want to count all datasets which have similar or the same parameters for each data set.

 

Right now, I am doing the matching via a MySql query which im calling about 200 000 times. The problem is that using a query is extremely expensive… it takes up to 2 hours until the script is done. So I am wondering if there is a better method to cross match data sets and if some of could help me find a better solution.

 

While researching I found out that arrays may be a faster alternative to queries. And so far, I identified 3 possible ways for cross matching:

 

1. nested foreach () loops

foreach($array as ar1)

  foreach($array as ar2)

if ($ar1[0] == $ar1[0])….

 

2. Using an Array_map with Callback function, so that i would have only one "hand coded" loop

  foreach($array as arr)

if ($arr[0] == $parameter)….

 

3. Array walk where i could save one "hand coded" loop as well.

 

Theoretically would be the best/fastest way to go about it? Can Anyone tell me what technically the difference between those 3 ways is? And which one is the better approach or if there other alternatives to them?

 

I am thankful for any advice that helps me reduce execution time!

 

 

Greetz

plautzer

Link to comment
Share on other sites

Here is an example of a query with 6 parameter.

 

In this case par1,2,3 should unequal to the parameters of the current data set.

par4, par5 and are supposed to be in a range which varies per data set.

 

In the end I do simple statisitical count.

 

$sql =" SELECT         
    Count(id) as sum,
    Sum(If(type = 1 ,1,0)) as type1,
    Sum(If(type = 2 ,1,0)) as type2,
    Sum(If(type = 3 ,1,0)) as type3,
    Sum(If(type = 4 ,1,0)) as type4,
    Sum(If(type = 5 ,1,0)) as type5
  From table1
  WHERE
    (par1 != $id1 and par2 != $id2) and
    (par1  != $id2 and par3  != $id1) and
    par2 = $par3 and
    par4 between 800 and 1000 and
    par5 between 400 and 300 and
    date < '$date' and
    date > DATE_SUB('$date', INTERVAL 5 YEAR)";   

 

The query takes up to 0,1 seconds.

 

Link to comment
Share on other sites

Depends how complex the query is, I just ran a quick test on "SELECT * FROM table" ran in a loop 50,000 times, which took 0.364953 secs to succeed. Which is roughly 0.00000729906 per query (assuming theres no caching going on).

 

Edit: Perhaps start a sub-question in the MySQL board asking how to optimise that query.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.