Need help parsing / extracting links from log files

Mcod · February 11, 2012

I am looking for some help with extracting links from log files, as it is a pain to do this manually (which I do right now). I basically have some log files which I need to check for ERROR messages and copy and paste the found URL's into another text file.

My log file format looks like this:

INFO  <11 Feb 2012 00:00:23,822> <index> <D2> <Processing URL : http://www.domain1.com/>
INFO  <11 Feb 2012 00:00:23,842> <index> <D4> <Indexed: http://www.domain2.com/> <Time:146 msecs>
INFO  <11 Feb 2012 00:00:23,842> <index> <D4> <Processing URL : http://www.domain3.com/>
ERROR <11 Feb 2012 00:00:23,924> <index> <D1> <http://www.domain4.org/operas/2003-2004/mourning/composer.aspx: >
org.apache.commons.httpclient.HttpRecoverableException: org.apache.commons.httpclient.HttpRecoverableException: Error in parsing
        at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1965)
        at org.apache.commons.httpclient.HttpMethodBase.processRequest(HttpMethodBase.java:2659)
        at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1093)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:674)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529)
        at com.searchblox.scanner.http.HTTPScanner.b(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.scan(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.work(Unknown Source)
        at com.searchblox.scanner.Scanner.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
INFO  <11 Feb 2012 00:00:23,968> <index> <D5> <Indexed: http://domain6.com/~cdobie/kearnsindex.htm>
INFO  <11 Feb 2012 00:00:23,968> <index> <D5> <Indexed: http://domain7.com/~cdobie/kearnsindex.htm>
INFO  <11 Feb 2012 00:00:32,988> <index> <D1> <Processing URL : http://www.domain8.com/>
INFO  <11 Feb 2012 00:00:33,072> <index> <D5> <Indexed: http://www.domain9.com/> <Time:128 msecs>
INFO  <11 Feb 2012 00:00:33,072> <index> <D5> <Processing URL : http://www.domain10.com/>
ERROR <11 Feb 2012 00:00:33,116> <index> <D2> <http://www.domain11.com/: Connection timeout>
org.apache.commons.httpclient.HttpConnection$ConnectionTimeoutException
        at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:736)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:661)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529)
        at com.searchblox.scanner.http.HTTPScanner.b(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.scan(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.work(Unknown Source)
        at com.searchblox.scanner.Scanner.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
INFO  <11 Feb 2012 00:00:33,154> <index> <D1> <Indexing http://www.domain12.com/ ...>
INFO  <11 Feb 2012 00:00:33,159> <index> <D1> <http://www.domain13.com/ - Last-Modified date: Sat Feb 11 00:00:33 CET 2012>
ERROR <11 Feb 2012 00:00:33,207> <index> <D6> <http://www.domain14.com/: Connection timeout>

Now what I am after is some piece of code which basically saves the http://domain.com/ part to a text file IF the line starts with ERROR. There are many different error reasons, so the strings are all different at the start and at the end, so maybe you know a way to open a log file, look out for the word ERROR at the beginning of a line and if that's the case, either save the whole line to another text file or if possible just the domain part (which would be even more great)

If possible, please post a fully functional code block, as I am extremely bad with anything that has to do with regex, opening and closing files etc.

Your help would be greatly appreciated

I attached a sample log file to this post in case it helps (same as the lines above)

17560_.txt

salathe · February 11, 2012

I'm confused, are you looking for help or someone to do the work for you? (It's fine either way, but dictates the responses you'll get here.)

Mcod · February 11, 2012

Hi salathe,

giventhe fact that I really have no clue about how to even get started with this, I would love to see a complete solution with maybe some comments on "why this will work best" so I can learn from it when I need to complete similar tasks. So yes, I am more looking for a "complete solution" instead of pointers, as I am currently doing this all by hand (about 100000 lines per day) so it would save me a lot of time.

litebearer · February 11, 2012

A very rough hack, tested using the data you supplied. I am SURE there is a more efficient/elegant way; however, this worked...

<?PHP
/* create a test log */
$myfile = "mytest.log";
$contents = "INFO  <11 Feb 2012 00:00:23,822> <index> <D2> <Processing URL : http://www.domain1.com/>
INFO  <11 Feb 2012 00:00:23,842> <index> <D4> <Indexed: http://www.domain2.com/> <Time:146 msecs>
INFO  <11 Feb 2012 00:00:23,842> <index> <D4> <Processing URL : http://www.domain3.com/>
ERROR <11 Feb 2012 00:00:23,924> <index> <D1> <http://www.domain4.org/operas/2003-2004/mourning/composer.aspx: >
org.apache.commons.httpclient.HttpRecoverableException: org.apache.commons.httpclient.HttpRecoverableException: Error in parsing
        at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1965)
        at org.apache.commons.httpclient.HttpMethodBase.processRequest(HttpMethodBase.java:2659)
        at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1093)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:674)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529)
        at com.searchblox.scanner.http.HTTPScanner.b(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.scan(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.work(Unknown Source)
        at com.searchblox.scanner.Scanner.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
INFO  <11 Feb 2012 00:00:23,968> <index> <D5> <Indexed: http://domain6.com/~cdobie/kearnsindex.htm>
INFO  <11 Feb 2012 00:00:23,968> <index> <D5> <Indexed: http://domain7.com/~cdobie/kearnsindex.htm>
INFO  <11 Feb 2012 00:00:32,988> <index> <D1> <Processing URL : http://www.domain8.com/>
INFO  <11 Feb 2012 00:00:33,072> <index> <D5> <Indexed: http://www.domain9.com/> <Time:128 msecs>
INFO  <11 Feb 2012 00:00:33,072> <index> <D5> <Processing URL : http://www.domain10.com/>
ERROR <11 Feb 2012 00:00:33,116> <index> <D2> <http://www.domain11.com/: Connection timeout>
org.apache.commons.httpclient.HttpConnection$ConnectionTimeoutException
        at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:736)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:661)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:529)
        at com.searchblox.scanner.http.HTTPScanner.b(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.scan(Unknown Source)
        at com.searchblox.scanner.http.HTTPScanner.work(Unknown Source)
        at com.searchblox.scanner.Scanner.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
INFO  <11 Feb 2012 00:00:33,154> <index> <D1> <Indexing http://www.domain12.com/ ...>
INFO  <11 Feb 2012 00:00:33,159> <index> <D1> <http://www.domain13.com/ - Last-Modified date: Sat Feb 11 00:00:33 CET 2012>
ERROR <11 Feb 2012 00:00:33,207> <index> <D6> <http://www.domain14.com/: Connection timeout>
";


file_put_contents($myfile, $contents);

/* FROM  HERE FORWARD IS WHERE YOU WILL USE YOUR REAL DATA */
/* read the entire file into a string */
$contents = file_get_contents($myfile);

/* remove extraneous characters */
$contents = str_ireplace ("<", "", $contents);
$contents = str_ireplace (">", "", $contents);
$contents = str_ireplace ("Connection timeout", "", $contents);

/* write the cleansed data back to the file */
file_put_contents($myfile, $contents);

/* read the log file into an array */
$lines = file($myfile);


/* count the number of lines (elements) */
$c = count($lines);

/* loop thru the lines - grabing only those lines containing ERROR  into a new array */
for($i=0;$i<$c;$i++) {
$string = "This is a strpos() test";
$pos = strpos($lines[$i], "ERROR");
if ($pos === false) {
}else{
	$my_line = explode("http://", $lines[$i]);
	$new_content = $new_content . $my_line[1];
}
}
echo nl2br($new_content);
/* save the data to a new file */
$new_file = "test_log_" . time() . ".txt";
file_put_contents($new_file, $new_content);
?>

end output here http://www.nstoia.com/logtest.php

Sign In

Need help parsing / extracting links from log files

Recommended Posts

Mcod

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

Mcod

Link to comment

Share on other sites

litebearer

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information