Parser - pdf
Introduction
The PDF Document Parser is an implementation of the Document Parser interface used to parse the contents of PDF files into plain text. This component implements the Eino: Document Loader guide and is mainly used for the following scenarios:
- When you need to convert PDF documents into a processable plain text format
- When you need to split the contents of a PDF document by page
Features
The PDF parser has the following features:
- Supports basic PDF text extraction
- Optionally splits documents by page
- Automatically handles PDF fonts and encoding
- Supports multi-page PDF documents
Notes:
- May not fully support all PDF formats currently
- Will not retain formatting like spaces and line breaks
- Complex PDF layouts may affect extraction results
Usage
Component Initialization
The PDF parser is initialized using the NewPDFParser
function, with the main configuration parameters as follows:
import (
"github.com/cloudwego/eino-ext/components/document/parser/pdf"
)
func main() {
parser, err := pdf.NewPDFParser(ctx, &pdf.Config{
ToPages: true, // Whether to split the document by page
})
}
Configuration parameters description:
ToPages
: Whether to split the PDF into multiple documents by page, default is false
Parsing Documents
Document parsing is done using the Parse
method:
docs, err := parser.Parse(ctx, reader, opts...)
Parsing options:
- Supports setting the document URI using
parser.WithURI
- Supports adding extra metadata using
parser.WithExtraMeta
Complete Usage Example
Basic Usage
package main
import (
"context"
"os"
"github.com/cloudwego/eino-ext/components/document/parser/pdf"
"github.com/cloudwego/eino/components/document/parser"
)
func main() {
ctx := context.Background()
// Initialize the parser
p, err := pdf.NewPDFParser(ctx, &pdf.Config{
ToPages: false, // Do not split by page
})
if err != nil {
panic(err)
}
// Open the PDF file
file, err := os.Open("document.pdf")
if err != nil {
panic(err)
}
defer file.Close()
// Parse the document
docs, err := p.Parse(ctx, file,
parser.WithURI("document.pdf"),
parser.WithExtraMeta(map[string]any{
"source": "./document.pdf",
}),
)
if err != nil {
panic(err)
}
// Use the parsed results
for _, doc := range docs {
println(doc.Content)
}
}
Using loader
Refer to the example in the Eino: Document Loader guide
Related Documents
Last modified
February 21, 2025
: doc: add eino english docs (#1255) (4f6a3bd)