Jump to content

Large file to CSV


cyberbob1uk

Recommended Posts

Hi guys,

 

I am currently receiving a large text file ( > 500mb), once per week which I have been manually splitting then processing to obtain the required CSV files.  However, this is taking in the region of 2 to 3 hours.  Very soon, these files will be sent daily and I really dont have the time to split and process this everyday

 

I have been playing for a while to try and parse everything properly/automatically with fopen, feof and fgets ( and other 'f' options), but the script never seems to read the file all the way to the end - I assume this is due to memory usage.

 

The data received in the file follows a strict pattern throughout the file which is:

BSNY990141112271112270100000 POO2C35    122354000 DMUS   075                   O
BX         NTY                                                                  
LOLANCSTR 1132 11322         TB                                                 
LIMORCMSJ           1135 00000000                                               
LICRNFNJN           1140 00000000                       H                       
LICRNF    1141H1142H     11421142         T                                     
LISDAL    1147H1148H     11481148         T                                     
LIARNSIDE 1152H1153      11531153         T                                     
LIGOVS    1158 1159      11581159         T                                     
LIKTBK    1202 1202H     12021202         T                                     
LICARK    1206 1207      12061207         T                                     
LIULVRSTN 1214H1215H     12151215         T                                     
LIDALTON  1223 1223H     12231223         T                                     
LIDALTONJ           1225 00000000                                               
LIROOSE   1229 1229H     12291229         T             2                       
LTBAROW   1237 12391     TF 

 

That is just one record of informaton (1 of around 140,000 records), each record has no fixed amount of lines but each line in each record is fixed to 80 characters and all lines in each record need to have the same unique 'id', at present, Im using an md5 hash of microtime.

 

The first line of every record starts with 'BS' and the last line of each record starts with 'LT' terminating with 'TF'.  All the other stuff between also follows a certain pattern of which I can break down effectively.

 

The record above show one train service schedule, hence why each line in each record needs the same unique id.

 

Anyone got any ideas on how I could process such a file effectively??

 

Many thanks

 

Dave

Link to comment
Share on other sites

Define: manually splitting then processing to obtain the required CSV files

 

The weekly files I  have been doing on my local machine through the night at work.

 

The original file was split into managable chunks (usually 10 - 12 separate files).

 

Then each record in each smaller file was made into an array using an md5 hash of microtime to create a unique array key.

 

In the next script I manually changed the name of the smaller file (now a php file containing arrays), and included it in the script which put the info into CSV files

 

Why and what are you splitting the data for? What processing are you doing? Why and what are the 'required' CSV files?

 

Each record in the file contains the stopping pattern for one train service operating between date_a and date_b on day(s)_x  (this info is held within the first line of each record) - one train can have more than 1 pattern depending on the date and day.

 

Because of the amount of services and service variations, each record needs a unique id so different schedules for the same train on different days dont get mixed up. 

 

From this, main CSV files service details, and stopping patterns are created.  Service details containing the necessary parts for that service, stopping patterns contains the timing points for each record in order of passing (theses are the lines starting 'LO','LI' and 'LT'), then each timing point has its own csv containing each service that passes/stops at it.

 

CSV's are then transfered to a database using LOAD DATA INFILE

Link to comment
Share on other sites

I think, that you may try to load it directly to Data Base. Every DB has a command that allow to do it. Your file is well-structured and it's read to be loaded.

You may to do the next actions while loading:

1. Assign automatic, auto-incremented ids.

2. Analyze information and mark some records as start/stop of a block - you may do it inside trigger. You may skip recording of some lines. You may save some additional info into some other tables, on the base of information from some lines... In other words you may do whatever you wish.

 

500 MB/80 chars =~ 6.25 mln records. It's not too much for a good server and good SQL-server :)

 

Now you need a lot of additional time to process via CSV.

Link to comment
Share on other sites

OK. Could you show a code that is used for file processing? Maybe you something is wrong.

 

You see... I also load a lot of information every day, it's statistics information. One part is loaded automatically in the night, I don't know it's volume. Another part is loaded under my control in the morning. The second part has a volume appr. 400-500 MB and it takes appr. 15-20 minutes to prepare it (C++ program is working) and then appr. 10-15 minutes to load it into DB, in some tables. DB is PostgreSQL. You see - the time is not so big. I'm talking about it in order to show you that I know something about it :).

Link to comment
Share on other sites

First i see if the new file is available, if it is then :

 

while(!feof($handle)) {
  $line = stream_get_line($handle, 81);
  $lineID = "$line[0]$line[1]";
  
  if($lineID == 'TI'){
    ti($line);
  }
  if($lineID == 'BS'){
    $time = md5(microtime());
    $time = "a$time";
    
    $details = bs($line,$time);
  }
  if($lineID == 'BX'){
    
    bx($line,$details);
  }
  if($lineID == 'LO'){
    lo($line,$details);
  }
  if($lineID == 'LI'){
    li($line,$details);
  }
  if($lineID == 'LT'){
    lt($line,$details);
  }

}

 

each custom function opens the relevant csv file, appends the needed data, closes csv file.

 

$details returns the md5 hash of the microtime of when BS was detedted at the start of the line, therefore giving each line the same id until the next BS line-start is encountered

 

 

Link to comment
Share on other sites

each custom function opens the relevant csv file, appends the needed data, closes csv file.

I'm quite sure that it's your problem!!!

You have some millions of lines and you open-close files some millions of times.... That's very-very bad! Because opening-closing procedures takes a time. Many millions...

 

You have to open files in the beginning, before your loop is started and close them when loop is finished. Just try it - preparation time might be changed from hours to minutes or tens of minutes.

Link to comment
Share on other sites

There are some things in the posted code and in the implied code that are not efficient for processing a large amount of data -

 

1) Reading the file one line at a time (you should read a large chunk of the file at one time, such as 100k - 200k characters),

 

2) As SergeiSS pointed out - opening and closing files repeatedly,

 

3) Once you do have the data preprocessed, you should either get the database engine to import the data (see the LOAD DATA LOCAL INFILE query for mysql) or if you must use php code to perform each insert/replace query, use the multi-value version of INSERT/REPLACE statements with a few thousand records in each query.

 

We can only directly help if we know what portion of that data you are actually keeping and what processing you are doing (i.e. what's your existing code need to reproduce the problem.) Also, if this data is coming from a standard/published source, if you mention the source, someone might know of an existing script that deals with it.

 

Just an FYI, I created a test data file with 140k copies of the sample data you posted and was able to read through the ~176Mbyte/~2M line file in blocks of 200k characters at a time, breaking the data into lines in an array, ready to be processed, in 4 seconds for the whole file, when using an explode statement to break up the lines (14 seconds when using a preg_split statement.)

Link to comment
Share on other sites

PFMa, you say it's better to read in at 200k chars at a time, but how do you deal with the situation that you're not getting the full last row? Do you somehow glue it together to the next row you read, or?

The lines are max 80 chars but they could be less unless padding is used to reach 80 chars/line?

 

What does 2MB refer to?

Link to comment
Share on other sites

But you don't really need it.

 

I think the manual splitting of the file is unnecessary. the processor is a conglomeration of if's, which should be optimized. Using a switch/case or just using continue, to avoid any more if comparisons.

 

You should be looking at your manual work and looking to automate that.

<?php
  $fh=fopen('sample.txt','rt');
  $details=array();
  $insection=FALSE;
  $ctr=0;
  while(!feof($fh))
  {
    $line=fgets($fh);
    $id=substr($line,0,2);
    if($id!=='BS' && !$insection)
      continue;
    switch($id)
    {
      case 'BS':
        // Initial Processing
        $id=md5(microtime());
        $insection=TRUE;
        break;
      case 'TI':
      case 'BX':
      case 'LO':
      case 'LI':
        // Each section should have it's own processor
        break;
      case  'LT':
        // Do Final Processing Here
        $ctr++;
        $insection=FALSE;
    }
  }
  fclose($fh);
  echo "Processed $ctr sections";
?>

This should handle each section of the BIG FILE, just add your processing code

 

 

Link to comment
Share on other sites

After typing and testing for a number of hours, the problem is lying with the amount of files that need creating.

 

Each line in the data that starts with either LO,LI or LT is a station or timing point, removing the LO,LI or LT gives the unique reference to that station.

 

LOLANCSTR becomes LANCSTR                                            
LIMORCMSJ becomes MORCMSJ
LTBAROW becomes BAROW

 

each of these stations timing points need their own table (containing the rest of the data on the same row),  as users on the site will search by station/timing point intially, then extra info is collected from the other tables using the unique id.

 

I ran the file creation script so all LO,LI and LT were written to one file and this resulted in 2.2 million lines of data(in a matter of 5 seconds). 

 

After eventually getting that file into a database (LOAD DATA LOCAL INFILE failed and the database table i was trying to insert on crashed, with no lines inserted), I performed a couple of test searches but was taking 10s to return just bare basic info, that was before i got the additional info from the other tables and echoed the results to the browser.

 

There are approximatey 5,500 different timing points for which tables are needed - there has to be a better way for me to create these?

Link to comment
Share on other sites

We cannot specifically help you unless you show us specifically what you are doing that you need help with.

 

However, it sounds like you are creating multiple tables to hold same meaning data. That is a bad database design and won't ever produce results efficiently.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.