Jump to content

PHP searching through HTML


kigogu

Recommended Posts

Hi, I was wondering what would be a good way to search through html for specific tags using php so that I can insert some other html in to that? I have used some parsing of an html that was converted in to a string, but the way I did it might not be fully efficient lol So i was wondering how everyone else would go about this? I don't need a complete in depth description but at least a pointer to what i could do that way i can just search for it on my own.

Link to comment
Share on other sites

It wasn't sarcasm, I always use regex to parse HTML.  the DOM is undocumented (last I checked) and clunky (last I used it), and craps out on malformed HTML which renders perfectly in the browser. 

 

In fact, I worked for 3 years writing specialized web spiders in PHP without using the DOM once.

 

 

Link to comment
Share on other sites

It wasn't sarcasm, I always use regex to parse HTML.  the DOM is undocumented (last I checked) and clunky (last I used it), and craps out on malformed HTML which renders perfectly in the browser. 

 

In fact, I worked for 3 years writing specialized web spiders in PHP without using the DOM once.

 

 

lol fair enough, to be honest it's just some crap that I read and assumed to be true,  but we all know what they say about that word.

 

 

Yeah, last time I used the DOM I had to use the error suppression operator for the first time in about 4 years of coding with the loadHTML method.

Link to comment
Share on other sites

Someone's who's really good with regex can work magic on all sorts of problems.  Parsing HTML is nothing.

 

The only thing PHP's regex cannot handle is counting matched tags, or nested tags.  If you need that, then the DOM is better, assuming the HTML is well-formed.

Link to comment
Share on other sites

If you have PHP compiled --with-tidy, you can use tidy to clean the HTML before using DOMDocument, or use tidy itself.

 

DOMDocument is far more powerful than RegEx when dealing with HTML. That isn't to say you can't solve your problem with RegEx.

Link to comment
Share on other sites

DOMDocument is far more powerful than RegEx when dealing with HTML. That isn't to say you can't solve your problem with RegEx.

If OP really is just finding a single tag and inserting something after it, regex is better (and faster) than DOM.  Though it's going to become a religious argument if we keep talking about it.
Link to comment
Share on other sites

sorry for the super late reply to all of this, been pretty busy lately, but what i wanted to do is have something like:

<html>
<body>
<div id="container">
<div id="left_column">
</div>
<div id="right_column">
</div>
</div>
</body>
</html> 

 

as my html code and then be able to search through everything to find, say, a div where the id is "left_column". once i have that then i can insert some html tags like:

 

<a href="http://www.google.com/">Google</a> 

 

that would make my total html document be:

 

<html>
<body>
<div id="container">
<div id="left_column">
<a href="http://www.google.com/">Google</a> 
</div>
<div id="right_column">
</div>
</div>
</body>
</html> 

 

hopefully that makes it a little bit more clear as to what i was asking?

Link to comment
Share on other sites

What are you ultimately trying to accomplish?  I can't see why you'd want to build a page this way.

 

I think I understand what you're saying... you want to be able to take some bare HTML code and alter it by finding certain <div> tags (or whatever) and injecting more HTML within those tags to create a page?  I'm assuming this page will be dynamic and change often based on *something*, whether it be a specific user action, or something along those lines?

 

It also depends on many items you want to change in any single page load.  Do I dare ask if this is to be a static page?  Or are you scraping an external webpage layout to use on your own site?

Link to comment
Share on other sites

What are you ultimately trying to accomplish?  I can't see why you'd want to build a page this way.

 

I think I understand what you're saying... you want to be able to take some bare HTML code and alter it by finding certain <div> tags (or whatever) and injecting more HTML within those tags to create a page?  I'm assuming this page will be dynamic and change often based on *something*, whether it be a specific user action, or something along those lines?

 

It also depends on many items you want to change in any single page load.  Do I dare ask if this is to be a static page?  Or are you scraping an external webpage layout to use on your own site?

 

Yeah, that is pretty much what I am trying to do lol

 

What I would like to accomplish is to have an automated way to create a webpage so that I wouldn't need to go in there myself to change what I would like to lol mainly just trying to eliminate the need for me to do anything. For example I want the same layout that I have, but I want to change the current fields around or add in anything new. instead of manually going in to my html and changing what I would need to, I would like to run some kind of script that I would pass in what fields I would like to see and it would change it for me. Also, yes, I would like for these pages to be dynamic, which this would also help with that. where one page shows this, but then once someone does some kind of action it would change what they see (now I know there are other ways to go about this part of it than what I am asking for)

 

I'm not trying to scrap off some layout to use for myself, I would definitely have my own designs and layout, just the content would need to be changed.

Link to comment
Share on other sites

What are you ultimately trying to accomplish?  I can't see why you'd want to build a page this way.

 

I think I understand what you're saying... you want to be able to take some bare HTML code and alter it by finding certain <div> tags (or whatever) and injecting more HTML within those tags to create a page?  I'm assuming this page will be dynamic and change often based on *something*, whether it be a specific user action, or something along those lines?

 

It also depends on many items you want to change in any single page load.  Do I dare ask if this is to be a static page?  Or are you scraping an external webpage layout to use on your own site?

 

Yeah, that is pretty much what I am trying to do lol

 

What I would like to accomplish is to have an automated way to create a webpage so that I wouldn't need to go in there myself to change what I would like to lol mainly just trying to eliminate the need for me to do anything. For example I want the same layout that I have, but I want to change the current fields around or add in anything new. instead of manually going in to my html and changing what I would need to, I would like to run some kind of script that I would pass in what fields I would like to see and it would change it for me. Also, yes, I would like for these pages to be dynamic, which this would also help with that. where one page shows this, but then once someone does some kind of action it would change what they see (now I know there are other ways to go about this part of it than what I am asking for)

 

I'm not trying to scrap off some layout to use for myself, I would definitely have my own designs and layout, just the content would need to be changed.

 

Well, there are much simpler ways of creating dynamic layouts than regex.  This is actually the first I've ever heard somebody attempt this with your reasoning.

 

The time it would take you to plug in the updated variables into your "system", you could probably have just made the changes to your html page, anyways.  Maybe +/- 30 seconds.  Thing is, at some point you will have to make the changes manually, unless you have some sort of AI built-in.

 

For a simple dynamic-driven website, I would simply recommend a database-driven setup.  However, how large/many pages do you expect this site to be?  Anything < 5-10, and it's hardly worth creating anything dynamically driven.  Static pages with simple included header/footer files would suffice.  And unless your content is changing regularly, I'm sure some updates could be easily managed.

Link to comment
Share on other sites

Well the reason I would like to stray away from inputting everything in manually is due to the load that I am planning on getting. For example, I would need to make 30 pages or like 10 websites with all the same design. Well if the content changes a lot or are similar but there are one or two fields that are in different places, then an automated way would work really well instead of having to manually create each page. if i just know where they need to go then i would be able to input what I need and where it needs to go and then my script would place it in there without much effort from me. This is so I can use the limited resources I have in other places that are needed more.

 

searching using DOM or regex for what I want, which I think just looking up an id value would suffice, seems pretty straightforward and I don't think I would need that much help on it. the only problem I run in to is inserting whatever I want into the location that I find

Link to comment
Share on other sites

Multiple sites with the same layout would be best achieved with a database-driven site.  Super simple setup, and you just plop in the content throughout the page wherever it needs to go.

 

You can even have your template hosted on one server and use include() to grab the template files for each site.  Updating those template files would be reflected across all sites.  You would need to ensure that "URL fopen wrappers" are enabled on the server(s) so you could do that, of course.

 

I just think you're over-complicating things your way.

 

That's all I got though.  I've never (even thought of doing) done what you're trying to do, so I cannot help you any further regarding that.

Link to comment
Share on other sites

Multiple sites with the same layout would be best achieved with a database-driven site.  Super simple setup, and you just plop in the content throughout the page wherever it needs to go.

 

You can even have your template hosted on one server and use include() to grab the template files for each site.  Updating those template files would be reflected across all sites.  You would need to ensure that "URL fopen wrappers" are enabled on the server(s) so you could do that, of course.

 

I just think you're over-complicating things your way.

 

That's all I got though.  I've never (even thought of doing) done what you're trying to do, so I cannot help you any further regarding that.

 

lol alright, I will definitely look in to using a database driven site. Hopefully that is something that I can at least use for the time being, if not just using that as my system. I haven't really dove into using a database driven site since I am pretty new to creating websites/php, but it does sound pretty interesting :P Thanks for the advice!

 

Even though I said that I would try a database driven site, if there is something that does know how to accomplish my original task then I fully welcome that solution ;D

Link to comment
Share on other sites

Okay, so i think i am getting closer to my end goal! lol so far I have it to where i just used preg_split to split the html into an array of each tag, but I was wondering if there was a way to use regex to check an array for a substring and return the index that it found?

 

I think once I have this, I would be able to accomplish what I wanted to do. Which is basically split the html into an array, find the index of the id tag of a div, go to the next index in the array and place all of the fields in to that, which will be empty before the placement, then combine the array into a string or just echo out all of the parts. How does that sound?

Link to comment
Share on other sites

I was wondering if there was a way to use regex to check an array for a substring and return the index that it found?

strpos does this for non-variable substrings.

 

I think once I have this, I would be able to accomplish what I wanted to do. Which is basically split the html into an array, find the index of the id tag of a div, go to the next index in the array and place all of the fields in to that, which will be empty before the placement, then combine the array into a string or just echo out all of the parts. How does that sound?

This really is the wrong way of going about things.  So I guess it sounds...kind of functional but scary and slow.  If it works, it works, we're not going to force you to rewrite it to be correct.

Link to comment
Share on other sites

I was wondering if there was a way to use regex to check an array for a substring and return the index that it found?

strpos does this for non-variable substrings.

 

I think once I have this, I would be able to accomplish what I wanted to do. Which is basically split the html into an array, find the index of the id tag of a div, go to the next index in the array and place all of the fields in to that, which will be empty before the placement, then combine the array into a string or just echo out all of the parts. How does that sound?

This really is the wrong way of going about things.  So I guess it sounds...kind of functional but scary and slow.  If it works, it works, we're not going to force you to rewrite it to be correct.

 

haha, *sigh* i kind of figured that OTL well each page isn't going to run this when someone goes to it, just to save the personalized file i guess. I definitely do not mind rewriting anything that i've been doing as long as its more efficient or just an easier way to go about doing things, but I know what i'm asking for is definitely not an efficient way anyways so its more of i'll take what i can get xD even if its a conceptual idea of how one might go about doing this.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.