Search algorithm...

ajlisowski · October 22, 2010

Hey all. I am racking my brain trying to figure out the best way to handle a certain process my company needs me to do. First I will give a general description of the issue, then I will get into specifics.

Basically I have a file that contains a bunch of data. The data stored in the file is consistant, each entry has X number of elements with a total character size of Y. Element 1 in each entry will always take up the same amount of characters.

So if I have 2 elements in each data entry, each 5 characters long, then each data entry is 10 characters. Whcih means to access the 2nd element of the 3rd entry I would grab characters 15-20.

I would want to be able to search this data for a number of things. Basically id want to be able to say, "give me all results where element 1 = 'pancakes'. Or where element 2 >8. Or more substaintial searches such as element 1>10 AND element 2<5.

Is there a good way to do this?

My first line of thought would be to run through the file, grabbing every element that can be searched on. So lets say if elements 2 and 4 can be searched on, and I have 10 entries, I would run through and grab an array holding the values for entry 2 and 4.

array(

array('index=>0', 'element2'=>10, 'element4'=>12),

array('index=>1', 'element2'=>11, 'element4'=>-2),

array('index=>2', 'element2'=>0, 'element4'=>7)

)

Then, depending on the search, I would reorder the array and search through it. So if I wanted to see if element2 was greater then 8, id reorder by element2 and then search. As if this wasnt difficult enough, the issue would be on complex searches such as 'element1>10 OR (element2=6 AND element3=0)'

I suppose i could store the result array of each individual search result and then merge/union/intersect based on the AND/OR.

To make things ever CRAZIER, the flat files can have other flat files as children. So if I have element1 in file 1 and element1 in file 2, all of file2 data would be part of the corrisponding data in file1.

So, id basically need to be able to "join" files. I have thought of a number of solutions, but im not sure how they will pan out.

Ugh.

In case your wondering WHY I need to do this. My company runs a lot of data off an old database called filepro which uses flat files with index and map files. Right now we have a few systems which weve "synched" with mysql which runs batch updates between the two databases to keep data consistant.

We wont be coming "off" of filepro for a long, long time. But a lot of my php/mysql apps need access to the data on filepro. So, Im trying to find a way to quickly be able to search and pull data from filepro into php. Basically, i want to run search queries on the flat files.

Basically id like your input on a number of things:

Is this even a reasonable attempt?

Is the idea of pulling the searched elements into php to actually run the search, then collecting the correct rows and getting the data directly from them a smart move?

Once I have the searchable elements and index, what algorithms are good for actually running the search?

ignace · October 22, 2010

IMO it would be best to parse these documents once insert them into something like SQLite then whenever you need any data query SQLite instead of parsing on each request.

ajlisowski · October 22, 2010

Well, the problem is the data is constantly changing in the documents. Right now we run a cron which pulls the data into mysql every night for the bigger tables. I suppose I could just do that for all of them. Runs the risk of the data not being 100% up to date in all regards though. I guess we could just up the cron.

ignace · October 22, 2010

Do you have control over where the process that writes it to a text file? When you retrieve the data for what is it then used? Is it displayed to customers, serving millions? How many times does the data change?

ajlisowski · October 22, 2010

The text files are created by a database system called filepro. Our entire company pretty much runs on filepro right now but we are migrating parts to php/mysql.

To be honest, I am not 100% sure of the depth of use, but I do know its pretty massive. Pretty much anything that has to do with our inventory goes through this database.

New products and items are being added every day. Inventory amounts are changing constantly. Customers are being added, updated. Leads, opportunities, everything is being stored and tracked.

I have ran an update to pull our product list out of this system into mysql and then we dump it back to filepro every day so other systems can have the updated inventory list. So weve migrated that to mysql, but its an ugly batch solution. If someone adds a product at noon, and then someone wants to sell that product at 2 they cant without requesting that we synch the databases.

To be honest, on the project I am working on right now, I can simply pull in the data, they tables im working with wont be updated often at all. I am more looking for a long term solution to this issue. Moving everything from filepro would be optimal, but we have 100s of processes which right now utlize the filepro database so we can not simply move off of it. We have to move one system at a time and find ways to continue synching the data for future use.

Or if I could find a way to search this data from PHP, we would not need to migrate from the current database. And even then, doing so would be easier because I could develop my applications to be able to easily start pulling from mysql when we are ready to switch. Which is what I am attempting to do now.

ignace · October 22, 2010

Which version of filePro are you using? I've seen that filePro Plus 5.6 has ODBC support which is quite handy as that would make it possible for filePro to communicate with MySQL when you install the MySQL ODBC binaries.

ajlisowski · October 22, 2010

From meeting with my boss today, I am pretty sure our version of filepro does not have ODBC support. Plus, I believe the ODBC support filepro does have is simply reading ODBC supported DBs INTO filepro and it does not have a way to communicate back out.

Though I will look into that a bit further. I believe we had this discussion internally a few months back and it was determined that the ODBC support was only one way. Thanks for looking into that though, most people dont even know what filepro is

ajlisowski · October 29, 2010

So, due to some other projects probably requiring this in the future, ive decided that i need to look into filepros index files to have any chance and really having real time access to it.

I am having some issues intepretting the files to determine how to store/traverse the index tree.

Documentation on it can be found here:

https://www.fptech.com/Products/Docs/FPFormat/autoix45.shtml#sort

Basically it seems to be a tree that consists of 3 elements

non leaf nodes

leaf nodes

a free chain.

The free chain is a linked list of nodes. I dont get its purpose or how to use it to properly search for data.

Non leaf nodes contain a left node and a series of other nodes. I am GUESSING the left node goes back to the "parent" node and that the right node consists of children. I would also guess that non-leaf-nodes would not contain more then ONE non-leaf-node as a child, and that the rest would be leaf-nodes.

Leaf nodes contain the key data, a backwards pointer (to the parent) and a forward pointer (to the next node, which would exist if multiple records contain the same index).

Is my interpretation of the documentation correct? Or am I way off? Also what the heck is the point/use of the freechain list?

Sign In

Search algorithm...

Recommended Posts

ajlisowski

Link to comment

Share on other sites

ignace

Link to comment

Share on other sites

ajlisowski

Link to comment

Share on other sites

ignace

Link to comment

Share on other sites

ajlisowski

Link to comment

Share on other sites

ignace

Link to comment

Share on other sites

ajlisowski

Link to comment

Share on other sites

ajlisowski

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information