Jump to content

Large CSV Import


mort

Recommended Posts

Hi all

 

So... I am creating an import script for putting contacts into a database. The script we had worked ok for 500kb / 20k row CSV files, but anything much bigger than that and it started to run into the max execution limit. Rather than alter this I wish to create something that will run in the background and work as efficiently as possible.

 

So basically the CSV file is uploaded, then you choose if the duplicates should be ignored / overwritten, and you match up the fields in the CSV (by the first line being a field title row), to the fields in the database.

 

The field for the email address is singled out as this is to be checked for duplicates that already exist in the system.

 

It then saves these values, along with the filename, and puts it all into an import queue table, which is processed by a CRON job. Each batch of the CRON job will look in the queue, find the first import that is incomplete, then start work on that file from where it left off last. When the batch is complete it will update the row to give a pointer in the file for the next batch, and update how many contacts were imported / how many duplicates there were

 

So far so good, but when checking for duplicity it is massively slowing down the script. I can run 1000 lines of the file in 0.04 seconds without checking, but with checking that increases to 14-15 seconds, and gets longer the more contacts are in the db.

 

For every line it tries to import its doing a SELECT query on the contact table, and although I am not doing SELECT * its still adding up to a lot of DB activity. One thought was to load every email address in the contacts table into an array before hand, but this table could be massive so thats likely to be just as inefficient.

 

Any ideas on optimising this process?

Link to comment
Share on other sites

In your error checking logic in your script, when the mysql_query statement fails and returns a false value, indicating an error occurred, mysql_errno will be a specific value when there was a duplicate key. I think it is 1062, but you should test to confirm what value you get by echoing mysql_errno in a test script.

 

It's also possible that if you are using the IGNORE keyword in your query that using mysql_affected_rows will correctly indicate if the query actually inserted a row or not, but I don't know this for a fact and you would need to experiment to see if you get the correct indication out.

 

 

Link to comment
Share on other sites

In another recent thread I used mysql_affected_rows() after a multi-value insert query, with the IGNORE keyword, and it does correctly indicate how many new rows were inserted.

 

The number of duplicates would be the total count of data values being put into the query, less the number of affected rows.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.