Jump to content

Split up large html file based on html tags?


hannylicious

Recommended Posts

Hey guys,

 

I'm a total newbie here, and just about as a new to php. 

 

My issue:

I have a very large .html file that contain multiple articles (I actually have a few of these, but we'll start with one for practicality).  The article titles are all wrapped in <h2> tags, there are 10 articles in one file.  The articles are very simple, just a title wrapped with <h2> and then a few paragraphs wrapped in <p> tags. 

 

What I want to know how to do:

I want to know if there's a way to open that file, and have each article saved as it's own .html or .txt document (the title & following paragraphs of each article).

Ultimately taking my 1 large file, and creating the subsequent 10 smaller files from the articles inside of it.

 

I am having trouble explaining this in text so I'll try to illustrate:

I have "Articles.html" - which contains (article1,article2,article3.. ..article10)

I want to split "Articles.html" and create "Article1.html", "Article2.html", "Article3.html", etc.

 

Is that possible?  Or am I looking at something far more complex than I can imagine at this point - perhaps something I'd be better off doing by hand?

 

 

 

 

Ultimately I intend to stick all these articles into a database, but that's the 2nd part of what I want to do (and I think will be the easier of the tasks).

Let me know if you need any additional information in the event my description above is unclear... I simply am having issues figuring out how to separate out the text into individual articles.

Link to comment
Share on other sites

yes, you can use preg_match_all() to pull out the articles between the <h2> tags. Then you can use file_put_contents(), of fopen(), fwrite(), and fclose() to make the file and insert the articles into them.

 

Alright, excellent.  I'll get to reading and see what I can hack up.

Thanks for the quick reply and good information!  I really appreciate it!

Link to comment
Share on other sites

Well, as expected I've ran into some issues :)

 

I'm trying to take it one step at a time.  To begin I'd just like to have it display the titles of the articles on a page (so I can learn this stuff one step at a time).

 

My code looks like this:

 

<?php

echo "<b>Article 1:</b>";

$html = file_get_contents("Articles-lot-5-6.html");
$regexp = '#\<h2\>(.+?)\<\/h2\>#s';

if (preg_match_all($regexp, $html, $matches)) {
echo $matches[0];
}
else {
    "There were no article titles found";
}

?>

 

It gives me this output:

 

Article 1:Array

 

I'm sure I'm on the right track to this - but again, I'm really new to all this.

Any idea on how to move forward from here?

 

I've tried playing with the $regexp and changing it... I've had to do a lot of reading on regex as I don't know it very well at all. 

If I do:

$regexp = '/<h2>(.*?)<\/h2>/';

nothing displays except "Article 1:"

 

I apologize if my code is very simple and my errors very 'common sense' to some of you, I'm just trying to learn this stuff and get a handle on it one bit at a time.

Link to comment
Share on other sites

you might be able to call the plain array

if (preg_match_all($regexp, $html, $matches)) {
echo $matches;
}

but if not, then you might have to draw them out manually

if (preg_match_all($regexp, $html, $matches)) {
echo $matches[0][0];
        echo $matches[][1];
        echo $matches[1][0];
}

etc...

Link to comment
Share on other sites

Your suggestions once again are spot on!

 

Would it be possible to create a count based on the total number of matches and then have it loop through until the count = 0?

So it would be something like: count matches = 20, then every loop just have it 'count=count - 1' - then just shove that variable into the $matches[0][$count]?  Would that even work?

 

 

Or would it be better to just manually input the $matches[0][0],$matches[0][1],$matches[0][2],$matches[0][3], etc..?

Link to comment
Share on other sites

After some toying around on my own I see now the the ability to count the matches

echo count($matches[0])." matches found";

 

I'll keep playing around and see if I can do as I described earlier, I wouldn't be anywhere near this solution if not for you Fugix!  Thanks again!

 

Link to comment
Share on other sites

Okay!

I figured it out a bit - the following code will display the article title, and the article itself in order as it works it's way through the article as a whole (except the last article does not display the article text, just the title).

 

Next up I'll start tinkering with writing all of these things to their own files and saving them as the article title!

 

Thanks again for all of your help Fugix!

I'm going to keep this thread open just a bit more because I'm sure I may have questions on writing the new files.

 

 

<?php

echo "<b>Article List:</b>";

$html = file_get_contents("Articles-lot-5-6.html");
$regexptitle = '#\<h2\>(.+?)\<\/h2\>#s';
$regexpdata = '#\<\/h2\>(.+?)\<h2\>#s';

$count = 0;

if (preg_match_all($regexptitle, $html, $matches)) {

If (preg_match_all($regexpdata, $html, &$matches2)){
while ($count < count($matches[0]) ){
$count++ ;
echo $matches[0][$count];
echo $matches2[0][$count];
}	
}
}
else {
    "There were no article titles found";
}

?>

Link to comment
Share on other sites

Okay!

I figured it out a bit - the following code will display the article title, and the article itself in order as it works it's way through the article as a whole (except the last article does not display the article text, just the title).

 

Next up I'll start tinkering with writing all of these things to their own files and saving them as the article title!

 

Thanks again for all of your help Fugix!

I'm going to keep this thread open just a bit more because I'm sure I may have questions on writing the new files.

 

 

<?php

echo "<b>Article List:</b>";

$html = file_get_contents("Articles-lot-5-6.html");
$regexptitle = '#\<h2\>(.+?)\<\/h2\>#s';
$regexpdata = '#\<\/h2\>(.+?)\<h2\>#s';

$count = 0;

if (preg_match_all($regexptitle, $html, $matches)) {

If (preg_match_all($regexpdata, $html, &$matches2)){
while ($count < count($matches[0]) ){
$count++ ;
echo $matches[0][$count];
echo $matches2[0][$count];
}	
}
}
else {
    "There were no article titles found";
}

?>

 

im glad you figured it out!! I'll be here if you need anything else!

Link to comment
Share on other sites

Well,

I'm back already (have I even left yet? ha!)

 

Anyhow, I have a mix of articles in this directory - some of the files contain 1 article (i.e. 1 match of <h2></h2> tags) ,and some of the files contain many matches - the code I had above should work for the individual-article pages as well (I think)

 

For some reason, I can get it to list those files in the directory - but it will not open them and match the Titles/Article-data as it does with the files that contain multiple articles.

 

the code is as follows:

<?php
//Open images directory
$dir = dir("/xampp/htdocs/phpscripts/articles/");
//set directory name
$dirname = "C:/xampp/htdocs/phpscripts/articles/";

//List files
while (($file = $dir->read()) !== false) {

if ($file!="." && $file!=".." ) {
	echo "filename: " . $file . "<br />";




		$html = file_get_contents($dirname.$file) ;

		$regexptitle = '#\<h2\>(.+?)\<\/h2\>#s';
		$regexpdata = '#\<\/h2\>(.+?)\<h2\>#s';

		$count = 0;

		if (preg_match_all($regexptitle, $html, $matches)) {

			If (preg_match_all($regexpdata, $html, &$matches2)){

				while ($count < count($matches[0]) ){
				echo $matches[0][$count];
				/*echo $matches2[0][$count];*/
				$count++ ;
				}	

			}

		}
		else {
			"There were no article titles found";
		}			

}
}

$dir->close();
?> 

 

 

 

Ultimately I am trying to get it to list the name of the file, then display the contents of it from the results of the preg_match_all() - the only ones this does not work for are the files that only contain 1 article inside them.

 

Any idea as to why not??  I have a feeling it's probably something simple and syntax related, but I can't see it...

 

The ouput as is looks something like this:

 

filename: Article5.html
filename: Article6.html
filename: Article7.html
filename: Article8.html
filename: Article9.html
filename: Articles-lot-5-6.html
/*Below are the titles of the 'Articles-lot-5-6.html' file*/
Remodel your kitchen with brand new cabinets
Shopping for kitchen cabinets is a serious task indeed
Tips For Buying Best Kitchen Cabinets
Tips to organize your kitchen cabinet in the best possible manner
Renovate Your Kitchen with Kitchen Cabinets
Which type wood to choose for your kitchen cabinets? 

Link to comment
Share on other sites

Figured it out - it was a simple logic error...

the 2nd regexp looks for data between the </h2> and next <h2> - but if the file only has 1 set of <h2></h2> tags there won't be another <h2> tag to capture data between! 

 

Back to the drawing board :)

Link to comment
Share on other sites

Figured it out - it was a simple logic error...

the 2nd regexp looks for data between the </h2> and next <h2> - but if the file only has 1 set of <h2></h2> tags there won't be another <h2> tag to capture data between! 

 

Back to the drawing board :)

glad you figured that one out!!  ;D

Link to comment
Share on other sites

Long story short - I finished up the script and it works perfectly!

Wouldn't have got this done without your help - thanks again Fugix!

glad i could help you out, if you ever have anymore issues we will be here to help!

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.