Important alert: (current site time 7/15/2013 9:21:58 PM EDT)
 

article

An Accurate Word Count of a Basic Text Document

Email
Submitted on: 1/22/2004 4:42:20 PM
By: Bill Platt  
Level: Beginner
User Rating: By 1 Users
Compatibility: 5.0 (all versions)
Views: 17620
author picture
(About the author)
 
     In my case, I needed to count the number of words in every article on my website. However you use it, this code will permit you to send a text string to a subroutine which will return to you the precise number of words in your string. The text fed to the subroutine can be sent with special characters including line feeds, and the subroutine will still return the precise number of words in the string. The string's Word Count is placed into the variable ($word_count) which can then be used in your own scripts. Although not required, a link to thePhantomWriters.com is requested when this code is used.

 
 
Terms of Agreement:   
By using this article, you agree to the following terms...   
  1. You may use this article in your own programs (and may compile it into a program and distribute it in compiled format for languages that allow it) freely and with no charge.
  2. You MAY NOT redistribute this article (for example to a web site) without written permission from the original author. Failure to do so is a violation of copyright laws.   
  3. You may link to this article from another website, but ONLY if it is not wrapped in a frame. 
  4. You will abide by any additional copyright restrictions which the author may have placed in the article or article's description.
				

# Use this code in your own script pointing to
# the subroutine Count_Words:

Count_Words("$article_text");

# And here is the actual subroutine:

sub Count_Words($working_text)
{
my $word_calc = $_[0];
$word_calc =~ s/\n/ /g;
$word_calc =~ s/ / /g;
  # Must contain two spaces between first two slashes
$word_calc =~ s/ / /g; # Must contain two spaces
$word_calc =~ s/ /X/g;
$word_calc =~ s/[^a-zA-Z0-9_\.]//g;
$word_calc =~ s/X/ /g;
$word_calc =~ s/ / /g; # Must contain two spaces
$word_calc =~ s/ / /g; # Must contain two spaces
@word_cnt = split(/ /, $word_calc);
$word_count = @word_cnt;
}


Report Bad Submission
Use this form to tell us if this entry should be deleted (i.e contains no code, is a virus, etc.).
This submission should be removed because:

Your Vote

What do you think of this article (in the Beginner category)?
(The article with your highest vote will win this month's coding contest!)
Excellent  Good  Average  Below Average  Poor (See voting log ...)
 

Other User Comments
1/25/2004 1:30:37 AM

i see a problem in the code in these lines,
$word_calc =~ s/ /X/g;
$word_calc =~ s/[^a-zA-Z0-9_\.]//g;
$word_calc =~ s/X/ /g;

what if a word has X in it ?
also, what about other punctuations like , : ; etc..



(If this comment was disrespectful, please report it.)

 
1/25/2004 3:11:53 AMBill Platt

I used the "X" because it is rarely used in English words. While it is possible that the use of an "X" may cause a word to be split incorrectly into two words, the chance of that is slim since we are using the capital "X" rather than the little "x". By my understanding the two are treated separately in perl.

The special characters you speak of can be trimmed since they do not affect the actual word count.

The reason I convert the spaces to capital X's and then trim the special characters out, before switching the capital X back to a space is because I had to remove the extra spaces and carriage returns from the variable. Previous to the X conversions, the extra spaces at the end of a line or lack thereof, and the \n skewed the count results to an incorrect number. I had to convert the input to a string with only "one" space between each word in order to get an accurate count.

(If this comment was disrespectful, please report it.)

 
1/28/2004 11:29:22 AM

Rather than two repetative lines like this one:
$word_calc =~ s/ / /g;
# Must contain two spaces between first two slashes
You could use:
$word_calc =~ s/ +/ /g;

to replace one or more spaces with a single space.

(If this comment was disrespectful, please report it.)

 
1/31/2004 12:06:18 AMAaron L. Anderson

I agree with the last poster. there is a bit of reduntant code in here, you can easily change a few regexes and better the script. You shouldn't break on X, what would be the point? It's a normal a-zA-Z character, you shouldn't filter ANY of these out if you want an accurate count.
(If this comment was disrespectful, please report it.)

 
8/17/2004 3:01:53 PM

correct me if i am wrong but couldn't you just "split" it into an array and then just get the count of the array....instead of going through all that....?

@whatever = split(/[ .,;:\'\"\\\/\`\~\!\@\#\$\%\^\&\*\(\)\_\-\+\+\[\]\{\}\?]+/,$working_text);
$word_count
= @whatever;
(If this comment was disrespectful, please report it.)

 
8/17/2004 3:05:28 PM

That last post should read @whatever = split(/[\s.,;:\'\"\\\/\`\~\!\@\#\$\%\^\&\*\(\)\_\- \+\+\[\]\{\}\?]+/,$working_text); $word _count = @whatever;
(If this comment was disrespectful, please report it.)

 
8/17/2004 3:08:15 PMBill Platt

I am sure that all of the suggestions so far offered are correct. I am still learning the perl language. It worked for me and I was happy. Keep bringing the suggestions so that I can learn more about how to do this best.
(If this comment was disrespectful, please report it.)

 
8/17/2004 7:33:19 PM

Paraphrase of original textual comment: The most powerful part of good languages is the fact that you have versatility to optimize your programs for performance. Larger programs go faster because the code is written straight out. Smaller programs try to cram everything into one line causing the processor to work extra hard, because it has to go here and do this and during that go here and do this. Small functional programs tend to do this more. Large scale projects generally go for faster code.
(If this comment was disrespectful, please report it.)

 
9/30/2005 7:11:05 PMMatthew van Eerde

Here's a one-liner:
$word_count = ($working_text =~ s/\b(\w)/$1/);
(If this comment was disrespectful, please report it.)

 
9/30/2005 7:11:39 PMMatthew van Eerde

Er, should be
$word_count = ($working_text =~ s/\b(\w)/$1/g);
(If this comment was disrespectful, please report it.)

 
11/11/2006 9:03:36 AMangel

hello im just perl beginner. i need help. i dont know on what codes to type on how to count the number of punctuation characters used within a paragraph.

at the same i also dont know on what codes to type on how to count, lines, words and characters.

can anybody pls help me i would really appreciate it plsssssssss
(If this comment was disrespectful, please report it.)

 

Add Your Feedback
Your feedback will be posted below and an email sent to the author. Please remember that the author was kind enough to share this with you, so any criticisms must be stated politely, or they will be deleted. (For feedback not related to this particular article, please click here instead.)
 

To post feedback, first please login.