Important alert: (current site time 7/15/2013 9:24:43 PM EDT)
 

VB icon

HTML to Excel conversion

Email
Submitted on: 1/7/2003 5:54:11 AM
By: BigCalm  
Level: Intermediate
User Rating: By 1 Users
Compatibility: 5.0 (all versions)
Views: 62850
author picture
(About the author)
 
     Script to convert HTML to Excel spreadsheet, without ever touching Excel. Requires modules HTML::Parser and Spreadsheet::WriteExcel (*updated*)
 
code:
Can't Copy and Paste this?
Click here for a copy-and-paste friendly version of this code!
 
Terms of Agreement:   
By using this code, you agree to the following terms...   
  1. You may use this code in your own programs (and may compile it into a program and distribute it in compiled format for languages that allow it) freely and with no charge.
  2. You MAY NOT redistribute this code (for example to a web site) without written permission from the original author. Failure to do so is a violation of copyright laws.   
  3. You may link to this code from another website, but ONLY if it is not wrapped in a frame. 
  4. You will abide by any additional copyright restrictions which the author may have placed in the code or code's description.
				
=**************************************
= Name: HTML to Excel conversion
= Description:Script to convert HTML to Excel spreadsheet, without ever touching Excel. Requires modules HTML::Parser and Spreadsheet::WriteExcel (*updated*)
= By: BigCalm
=
=This code is copyrighted and has= limited warranties.Please see http://www.Planet-Source-Code.com/vb/scripts/ShowCode.asp?txtCodeId=398&lngWId=6=for details.=**************************************

#!/usr/bin/perl -w
# Converts a HTML file into Excel format. (badly)
# Author: Jonathan Daniel
# Version: 1.0
# Date: 30/08/02
# Description:
#Takes a html file, and converts it into an excel file.
#	Designed for where I have good control over the html file being produced,
#	and if your data is not organised into tables (or the tables are nested) 
#	then it'll produce rather unsatisfactory results.
#
# Version 1.1, 07/05/03
# "use strict".
# Occasionally HTML::Parser would decide it was only going to spit out half 
# of the detail between <td> tags, so I've had to buffer to prevent data loss.
# Now recognises the th tag, and handles hr better.
# Space stripping on number columns improved.
# 
# Version 1.2, 
# Auto-fit of columns attempted with reasonable success.
#
# Version 1.3, 20/11/03
# Improvement of buffering options when not in table. Dates now properly 
# converted, assuming non-US date format. To change to US-dates changes this
# line: xl_parse_date_init("TZ=GMT","DateFormat=non-US");
use strict;
use HTML::Parser 3.26 ();
use Spreadsheet::WriteExcel;
use Spreadsheet::WriteExcel::Utility;
# define variables:
my %inside;
my @colarray;
@colarray = (0) x 30;
my $inputfile = "";
my $outputfile = "";
my $maxcol = 0;
# Parse arguments
if(@ARGV < 4)
{
	print "Usage: perl -w html2xls.pl -o <HTML File> -o <Excel File>\n";
	exit(1);
}
foreach $a (@ARGV)
{
	if ($a =~ /^-o/)
	{
		shift @ARGV;
		$inputfile = $ARGV[0];
		shift @ARGV;
		shift @ARGV;
		$outputfile = $ARGV[0];
	} 
	else 
	{
		last;
	}
}
# Define variables
# Setup spreadsheet
my $workbook = Spreadsheet::WriteExcel->new($outputfile);
my $prevtext = "";
my $buffertext = "";
my $worksheet = $workbook->addworksheet();
$worksheet->hide_gridlines(2);
# Format1 = Bold text
my $format1 = $workbook->addformat();
$format1->set_bold(1);
$format1->set_color('black');
# Format2 = Title Text
my $format2 = $workbook->addformat();
$format2->set_bold(1);
$format2->set_underline;
$format2->set_color('blue');
$format2->set_size(16);
# Format3 = Title Text
my $format3 = $workbook->addformat();
$format3->set_bold(1);
$format3->set_color('blue');
$format3->set_size(14);
# Format4 = Background colour of green
#$format4 = $workbook->addformat();
#$format4->set_color('black');
#$format4->set_fg_color(42);
#$format4->set_pattern(1);
# Format 5 = Integer format
my $format5 = $workbook->addformat();
$format5->set_color('black');
$format5->set_num_format('0');
# Format 6 = Float format
my $format6 = $workbook->addformat();
$format6->set_color('black');
$format6->set_num_format('0.00');
# Format 7 = Date Format
my $format7 = $workbook->addformat();
$format7->set_color('black');
$format7->set_num_format('dd/mm/yy');
# Format 8 = Bold Integers
my $format8 = $workbook->addformat();
$format8->set_color('black');
$format8->set_bold(1);
$format8->set_num_format('0');
# Format 9 = Bold Float format
my $format9 = $workbook->addformat();
$format9->set_color('black');
$format9->set_bold(1);
$format9->set_num_format('0.00');
# Format 10 = Bold Date Format;
my $format10 = $workbook->addformat();
$format10->set_color('black');
$format10->set_bold(1);
$format10->set_num_format('dd/mm/yy');
# Format 11 - for horizontal lines
my $format11 = $workbook->addformat();
$format11->set_color('black');
$format11->set_bottom();
$format11->set_bottom_color('black');
# Format 12 = Date Format
my $format12 = $workbook->addformat();
$format12->set_color('black');
$format12->set_num_format('dd/mm/yyyy');
xl_parse_date_init("TZ=GMT","DateFormat=non-US");
my $intable = 0;
my $row = 0;
my $col = 0;
my $tdcheck = 1;
my $fieldtype = 0;
HTML::Parser->new(api_version => 3,
		 handlers=> [start => [\&tag, "tagname,'+1',attr"],
				 end=> [\&tag, "tagname, '-1'"],
				 text => [\&text, "dtext"],
				 ],
		 marked_sections => 1,
	)->parse_file($inputfile) || die "Can't open file: $!\n";;
# Attempt to auto-fit
my $i = 0;
my $j = 0;
for ($i = 0;$i<$maxcol;$i++)
{
	$j += $colarray[$i];
	$worksheet->set_column($i,$i,$colarray[$i]);
}
if($j > 100)
{
	$worksheet->set_landscape();
}
else
{
	$worksheet->set_portrait();
}
$worksheet->set_paper(9);
$worksheet->fit_to_pages(1, 0);
sub tag
{
my($tag, $num, $attr) = @_;
$inside{$tag} += $num;
if(@_ eq "table")
{
$intable++;
}
if(@_ eq "/table")
{
$intable--;
}
if($intable == 0)
{
 $buffertext="";
}
if($tag eq "td")
{
	if($tdcheck == 0)
	{
		$col++;
		if( $col > $maxcol)
		{
			$maxcol = $col;
		}
		$tdcheck++;
	}
	else
	{
		$tdcheck--;
	}
}
if($tag eq "br")
{
	$row++;
$col=0;
}
if($tag eq "h1" && $num == -1 )
{
	$row+=2;
$col=0;
}
if($tag eq "h2" && $num == -1 )
{
	$row+=2;
$col=0;
}
if($tag eq "h3" && $num == -1 )
{
	$row+=2;
$col=0;
}
if($tag eq "h4" && $num == -1 )
{
	$row+=2;
$col=0;
}
if($tag eq "tr")
{
	if($col >= 1)
	{
		$row++;
		$col = 0;
	}
}
if($tag eq "th")
{
	if($col >= 1)
	{
		$row++;
		$col = 0;
	}
}
if($tag eq "hr")
{
	$row++;
$col = 0;
# apply changes to column using format11
	$worksheet->set_row($row, undef, $format11);
$row++;
}
if($tag eq "b")
{
	return;
}
if($tag eq "td")
{
	$prevtext = "";
	if($attr){
		 $fieldtype = 0;
		 if(exists($attr->{style}))
		 {
			 if($attr->{style} eq "vnd.ms-excel.numberformat:@")
			 {
				$fieldtype = 1;
			 }
			 if($attr->{style} eq "vnd.ms-excel.numberformat:0")
			 {
				$fieldtype = 2;
			 }
			 if($attr->{style} eq "vnd.ms-excel.numberformat:0.00")
			 {
				$fieldtype = 3;
			 }
			 if($attr->{style} eq "vnd.ms-excel.numberformat:dd/mm/yy")
			 {
				$fieldtype = 4;
			 }
			 if($attr->{style} eq "vnd.ms-excel.numberformat:dd/mm/yyyy")
			 {
				$fieldtype = 5;
			 }
		 }
	}
	}
}
sub text
{
return if(substr($_[0],0,9) eq "<!DOCTYPE");
return if $inside{script} || $inside{style} || $inside{title};
	return if ($_[0] !~ /\S/);
	my $subs = "";
	my $test = 0;
	my $test2 = 0;
my $date1 = 0;
if ( $inside{table} )
{
	$prevtext = $prevtext . $_[0];
		if( $inside{b} )
		{
			if($fieldtype == 0)
			{
				$worksheet->write($row,$col,$prevtext, $format1);
			}
			if($fieldtype == 1)
			{
				$worksheet->write($row,$col,$prevtext, $format1);
			}
			if($fieldtype == 2)
			{
				$prevtext =~ s/ //g;
				$worksheet->write($row,$col,$prevtext, $format8);
			}
			if($fieldtype == 3)
			{
				$prevtext =~ s/ //g;
				$worksheet->write($row,$col,$prevtext,$format9);
			}
			if($fieldtype == 4)
			{
				$date1 = xl_parse_date($prevtext);
				$worksheet->write($row,$col,$date1,$format10);
			}
			if($fieldtype == 5)
			{
				$date1 = xl_parse_date($prevtext);
				$worksheet->write($row,$col,$date1,$format12);
			}
			if ($colarray[$col] < length($prevtext))
			{
				$colarray[$col] = length($prevtext);
			}
		}
		else
		{
		if($fieldtype == 1)
		{
				$test = $worksheet->write_string($row,$col,$prevtext);
				if($test == -3)
				{
					print "string too long - ";
					$test = $worksheet->write($row,$col,long_string($prevtext));
					print $test . " " . long_string($prevtext);
				}
		}
			if($fieldtype == 0)
		{
				$worksheet->write($row,$col,$prevtext);
			}
		if($fieldtype == 2)
			{
				$prevtext =~ s/ //g;
				$worksheet->write($row,$col,$prevtext,$format5);
			}
		if($fieldtype == 3)
			{
				$prevtext =~ s/ //g;
				$worksheet->write($row,$col,$prevtext,$format6);
			}
		if($fieldtype == 4)
			{
				$prevtext =~ s/ //g;
				$date1 = xl_parse_date($prevtext);
				$worksheet->write($row,$col,$date1,$format7);
			}
		if($fieldtype == 5)
			{
				$prevtext =~ s/ //g;
				$date1 = xl_parse_date($prevtext);
				$worksheet->write($row,$col,$date1,$format12);
			}
			if ($colarray[$col] < length($prevtext))
			{
				$colarray[$col] = length($prevtext);
			}
		}
}
else
{
		$buffertext = $buffertext . $_[0];
		$buffertext =~ s/\n//g;
		$subs = $buffertext;
		if( $inside{h1} || $inside{h2} || $inside{h3} || $inside{h4} )
		{
			if( $inside{h1} )
			{
				$col = 0;
				$worksheet->write($row,$col,$subs, $format2);
				#$row+=2;
			}
			if( $inside{h2} )
			{
				$col=0;
				$worksheet->write($row,$col,$subs, $format3);
				#$row+=2;
			}
			if( $inside{h3} || $inside{h4} )
			{
				$col=0;
				$worksheet->write($row,$col,$subs);
				#$row+=2;
			}
		}
		else
		{
			$col = 0;
			$worksheet->write($row,$col,$subs);
			#$row++;
		}
}
}
######################################################################
#
# long_string($str)
#
# Converts long strings into an Excel string concatenation formula.
# The concatenation is inserted between words to improve legibility.
#
# returns: An Excel formula if string is longer than 255 chars.
# The unmodified string otherwise.
#
sub long_string {
my $str= shift;
my $limit = 255;
# Return short strings
return $str if length $str <= $limit;
# Split the line at word boundaries where possible
my @segments = $str =~ m[.{1,$limit}$|.{1,$limit}\b|.{1,$limit}]sog;
# Join the string back together with quotes and Excel concatenation
$str = join '"&"', @segments;
# Add formatting to convert the string to a formula string
return $str = qq(="$str");
}


Report Bad Submission
Use this form to tell us if this entry should be deleted (i.e contains no code, is a virus, etc.).
This submission should be removed because:

Your Vote

What do you think of this code (in the Intermediate category)?
(The code with your highest vote will win this month's coding contest!)
Excellent  Good  Average  Below Average  Poor (See voting log ...)
 

Other User Comments

1/21/2003 1:56:11 PM

can you guide me to write a program for text to excel conversion... email address is hmehta@usc.edu
(If this comment was disrespectful, please report it.)

 
5/7/2003 5:55:31 AMBigCalm

updated.
(If this comment was disrespectful, please report it.)

 
6/12/2003 11:49:23 AMBigCalm

Now attempts auto-fit of columns
(If this comment was disrespectful, please report it.)

 
11/20/2003 6:30:59 AMBigCalm

Improvement of date-conversions (non-US format in the source, but easily changed), and improved buffering of output from the html::parser module. Now requires a couple of additional modules Date::Calc and Date::Manip but these should be fairly standard anyway.
(If this comment was disrespectful, please report it.)

 
3/6/2008 5:14:48 AMProblem with code..Help

Hi BigCalm, I have copy pasted this code in AcvitestatePerl and executed but i dont know where it ask for html input and where it creates ouput excel file..it only shows print statement "Usage: perl -w html2xls.pl -o -o \n" on command prompt..no error nothing... I have given html path to my $inputfile = "c:\abc.htm"; and output excel my $outputfile = "c:\abc.xls"; that is file to be created. But it does not generate excel file..i have search it but no output ever..Please help me..its urgent.. Thanks in advance..eagerly waiting for ur reply...
(If this comment was disrespectful, please report it.)

 
3/11/2008 12:40:58 AMProblem with code..Help

Hi Bigcalm..

I have tried the code...It worked fine with few changes ...but can you help me in one thing...this code only takes one html file and converts it into xls file..I have many such html file in one folder..May you plz guide me how to process all the html files in that folder to one xls file....

Please help

Thanks in Advance..
(If this comment was disrespectful, please report it.)

 

Add Your Feedback
Your feedback will be posted below and an email sent to the author. Please remember that the author was kind enough to share this with you, so any criticisms must be stated politely, or they will be deleted. (For feedback not related to this particular code, please click here instead.)
 

To post feedback, first please login.