Jump to content

Search algorithm...


ajlisowski

Recommended Posts

Hey all. I am racking my brain trying to figure out the best way to handle a certain process my company needs me to do. First I will give a general description of the issue, then I will get into specifics.

 

Basically I have a file that contains a bunch of data. The data stored in the file is consistant, each entry has X number of elements with a total character size of Y. Element 1 in each entry will always take up the same amount of characters.

 

So if I have 2 elements in each data entry, each 5 characters long, then each data entry is 10 characters. Whcih means to access the 2nd element of the 3rd entry I would grab characters 15-20.

 

I would want to be able to search this data for a number of things. Basically id want to be able to say, "give me all results where element 1 = 'pancakes'. Or where element 2 >8. Or more substaintial searches such as element 1>10 AND element 2<5.

 

Is there a good way to do this?

 

My first line of thought would be to run through the file, grabbing every element that can be searched on. So lets say if elements 2 and 4 can be searched on, and I have 10 entries, I would run through and grab an array holding  the values for entry 2 and 4.

 

array(

array('index=>0', 'element2'=>10, 'element4'=>12),

array('index=>1', 'element2'=>11, 'element4'=>-2),

array('index=>2', 'element2'=>0, 'element4'=>7)

)

 

Then, depending on the search,  I would reorder the array and search through it. So if I wanted to see if element2 was greater then 8, id reorder by element2 and then search. As if this wasnt difficult enough, the issue would be on complex searches such as 'element1>10 OR (element2=6 AND element3=0)'

 

I suppose i could store the result array of each individual search result and then merge/union/intersect based on the AND/OR.

 

To make things ever CRAZIER, the flat files can have other flat files as children. So if I have element1 in file 1 and element1 in file 2, all of file2 data would be part of the corrisponding data in file1.

 

So, id basically need to be able to "join" files. I have thought of a number of solutions, but im not sure how they will pan out.

 

Ugh.

 

In case your wondering WHY I need to do this. My company runs a lot of data off an old database called filepro which uses flat files with index and map files. Right now we have a few systems which weve "synched" with mysql which runs batch updates between the two databases to keep data consistant.

 

We wont be coming "off" of filepro for a long, long time. But a lot of my php/mysql apps need access to the data on filepro. So, Im trying to find a way to quickly be able to search and pull data from filepro into php. Basically, i want to run search queries on the flat files.

 

Basically id like your input on a number of things:

 

Is this even a reasonable attempt?

 

Is the idea of pulling the searched elements into php to actually run the search, then collecting the correct rows and getting the data directly from them a smart move?

 

Once I have the searchable elements and index, what algorithms are good for actually running the search?

Link to comment
Share on other sites

Well, the problem is the data is constantly changing in the documents. Right now we run a cron which pulls the data into mysql every night for the bigger tables. I suppose I could just do that for all of them. Runs the risk of the data not being 100% up to date in all regards though. I guess we could just up the cron.

Link to comment
Share on other sites

The text files are created by a database system called filepro. Our entire company pretty much runs on filepro right now but we are migrating parts to php/mysql.

 

To be honest, I am not 100% sure of the depth of use, but I do know its pretty massive. Pretty much anything that has to do with our inventory goes through this database.

 

New products and items are being added every day. Inventory amounts are changing constantly. Customers are being added, updated. Leads, opportunities, everything is being stored and tracked.

 

I have ran an update to pull our product list out of this system into mysql and then we dump it back to filepro every day so other systems can have the updated inventory list. So weve migrated that to mysql, but its an ugly batch solution. If someone adds a product at noon, and then someone wants to sell that product at 2 they cant without requesting that we synch the databases.

 

To be honest, on the project I am working on right now, I can simply pull in the data, they tables im working with wont be updated often at all. I am more looking for a long term solution to this issue. Moving everything from filepro would be optimal, but we have 100s of processes which right now utlize the filepro database so we can not simply move off of it. We have to move one system at a time and find ways to continue synching the data for future use.

 

Or if I could find a way to search this data from PHP, we would not need to migrate from the current database. And even then, doing so would be easier because I could develop my applications to be able to easily start pulling from mysql when we are ready to switch. Which is what I am attempting to do now.

Link to comment
Share on other sites

From meeting with my boss today, I am pretty sure our version of filepro does not have ODBC support. Plus, I believe the ODBC support filepro does have is simply reading ODBC supported DBs INTO filepro and it does not have a way to communicate back out.

 

Though I will look into that a bit further. I believe we had this discussion internally a few months back and it was determined that the ODBC support was only one way. Thanks for looking into that though, most people dont even know what filepro is :)

Link to comment
Share on other sites

So, due to some other projects probably requiring this in the future, ive decided that i need to look into filepros index files to have any chance and really having real time access to it.

 

I am having some issues intepretting the files to determine how to store/traverse the index tree.

 

Documentation on it can be found here:

https://www.fptech.com/Products/Docs/FPFormat/autoix45.shtml#sort

 

Basically it seems to be a tree that consists of 3 elements

 

non leaf nodes

leaf nodes

a free chain.

 

The free chain is a linked list of nodes. I dont get its purpose or how to use it to properly search for data.

 

Non leaf nodes contain a left node and a series of other nodes. I am GUESSING the left node goes back to the "parent" node and that the right node consists of children. I would also guess that non-leaf-nodes would not contain more then ONE non-leaf-node as a child, and that the rest would be leaf-nodes.

 

Leaf nodes contain the key data, a backwards pointer (to the parent) and a forward pointer (to the next node, which would exist if multiple records contain the same index).

 

Is my interpretation of the documentation correct? Or am I way off? Also what the heck is the point/use of the freechain list?

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.