article

How to Extract Text from PDF Documents Based on Columns inside .NET Apps

Email
Submitted on: 1/27/2016 10:00:26 AM
By: Sherazam  
Level: Intermediate
User Rating: Unrated
Compatibility: C#, VB.NET
Views: 2781
 
     This technical tip explains how to extract text from PDF documents based on columns inside .NET Applications. A PDF file may comprise of Text, Images, Annotations, Attachments, Graphs etc elements and Aspose.Pdf for .NET offers the feature to Add as well as manipulate all of these elements. This API is remarkable when comes to Text addition and extraction from PDF document and we may come across a scenario where a PDF document is comprised of more than one columns (multi-column) PDF document and we need to extract the page contents while honoring the same layout, then Aspose.Pdf for .NET is the right choice to accomplish this requirement. There is also another approach provided with ScaleFactor. We have introduced several improvements in TextAbsorber and in internal text formatting mechanism.

 
				This technical tip explains how to extract text from PDF documents based on columns inside .NET Applications. A PDF file may comprise of Text, Images, Annotations, Attachments, Graphs etc elements and Aspose.Pdf for .NET offers the feature to Add as well as manipulate all of these elements. This API is remarkable when comes to Text addition and extraction from PDF document and we may come across a scenario where a PDF document is comprised of more than one columns (multi-column) PDF document and we need to extract the page contents while honoring the same layout, then Aspose.Pdf for .NET is the right choice to accomplish this requirement. One approach is to reduce font size of contents inside PDF document and then perform text extraction. The following code snippet can be used to fulfill this requirement. There is also another approach provided with ScaleFactor. We have introduced several improvements in TextAbsorber and in internal text formatting mechanism. So now during the text extraction using ‘Pure’ mode, you may specify ScaleFactor option and it can be another approach to extract text from multi-column PDF document besides above stated approach. This scale factor may be set to adjust grid which is used for the internal text formatting mechanism during text extraction. Specifying the ScaleFactor values between 1 and 0.1 (including 0.1) has the same effect as font reducing.
//The following code snippet shows the steps to reduce text size and then try extracting text from PDF document.
//[C# Code Sample]
 
string path = "D:\\Temp\\";
InitLicense();
Document pdfDocument = new Document(path + "net_New-age NED's.pdf");
TextFragmentAbsorber tfa = new TextFragmentAbsorber();
pdfDocument.Pages.Accept(tfa);
TextFragmentCollection tfc = tfa.TextFragments;
foreach (TextFragment tf in tfc)
{
//need to reduce font size at least for 70%
tf.TextState.FontSize = tf.TextState.FontSize * 0.7f;
}
Stream st = new MemoryStream();
pdfDocument.Save(st);
pdfDocument = new Document(st);
TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);
String extractedText = textAbsorber.Text;
textAbsorber.Visit(pdfDocument);
System.IO.File.WriteAllText(path + "Extracted.txt", extractedText);
// [VB.NET Code Sample]
 
Dim path As String = "D:\\Temp\\"
' instantiate Document object 
Dim pdfDocument As Document = New Document(path + "net_New-age NED's.pdf")
Dim tfa As Aspose.Pdf.Text.TextFragmentAbsorber = New Aspose.Pdf.Text.TextFragmentAbsorber()
pdfDocument.Pages.Accept(tfa)
Dim tfc As Aspose.Pdf.Text.TextFragmentCollection = tfa.TextFragments
For Each tf As Aspose.Pdf.Text.TextFragment In tfc
' need to reduce font size at least for 70%
tf.TextState.FontSize = tf.TextState.FontSize * 0.7F
Next
' create temporary stream object
Dim st As Stream = New MemoryStream()
' save PDF file with reduced font size
pdfDocument.Save(st)
' Instantiate Document object with stream instance
pdfDocument = New Document(st)
Dim textAbsorber As Aspose.Pdf.Text.TextAbsorber = New Aspose.Pdf.Text.TextAbsorber()
pdfDocument.Pages.Accept(textAbsorber)
Dim extractedText As String = textAbsorber.Text
textAbsorber.Visit(pdfDocument)
System.IO.File.WriteAllText(path + "Extracted.txt", extractedText)
 
//Second approach - Using ScaleFactor
//[C# Code Sample]
 
Document pdfDocument = new Document(inputFile);
TextAbsorber textAbsorber = new TextAbsorber();
textAbsorber.ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
//Setting scale factor to 0.5 is enough to split columns in the majority of documents
//Setting of zero allows to algorithm choose scale factor automatically
textAbsorber.ExtractionOptions.ScaleFactor = 0.5; /* 0; */
pdfDocument.Pages.Accept(textAbsorber);
String extractedText = textAbsorber.Text;
System.IO.File.WriteAllText(outFile, extractedText);
 
// [VB.NET Code Sample]
 
Dim pdfDocument As Document = New Document(inputFile)
Dim textAbsorber As Aspose.Pdf.Text.TextAbsorber = New Aspose.Pdf.Text.TextAbsorber()
textAbsorber.ExtractionOptions = New TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure)
'Setting scale factor to 0.5 is enough to split columns in the majority of documents
'Setting of zero allows to algorithm choose scale factor automatically
textAbsorber.ExtractionOptions.ScaleFactor = 0.5 ' 0;
pdfDocument.Pages.Accept(textAbsorber)
Dim extractedText As String = textAbsorber.Text
System.IO.File.WriteAllText(outFile, extractedText)
 

More about Aspose.Pdf for .NET


Other 44 submission(s) by this author

 


Report Bad Submission
Use this form to tell us if this entry should be deleted (i.e contains no code, is a virus, etc.).
This submission should be removed because:

Your Vote

What do you think of this article (in the Intermediate category)?
(The article with your highest vote will win this month's coding contest!)
Excellent  Good  Average  Below Average  Poor (See voting log ...)
 

Other User Comments


 There are no comments on this submission.
 

Add Your Feedback
Your feedback will be posted below and an email sent to the author. Please remember that the author was kind enough to share this with you, so any criticisms must be stated politely, or they will be deleted. (For feedback not related to this particular article, please click here instead.)
 

To post feedback, first please login.