Loader - web url

Basic Introduction

The URL Document Loader is an implementation of the Document Loader interface, used to load document content from web URLs. This component implements the Eino: Document Loader guide.

Feature Introduction

The URL Document Loader has the following features:

  • Default support for HTML web content parsing
  • Customizable HTTP client configurations (e.g., custom proxies, etc.)
  • Supports custom content parsers (e.g., body, or other specific containers)

Usage

Component Initialization

The URL Document Loader is initialized using the NewLoader function with the main configuration parameters as follows:

import (
  "github.com/cloudwego/eino-ext/components/document/loader/url"
)

func main() {
    loader, err := url.NewLoader(ctx, &url.LoaderConfig{
        Parser:         parser,
        Client:         httpClient,
        RequestBuilder: requestBuilder,
    })
}

Explanation of configuration parameters:

  • Parser: Document parser, defaults to the HTML parser, which extracts the main content of the web page
  • Client: HTTP client which can be customized with timeout, proxy, and other configurations
  • RequestBuilder: Request builder used to customize request methods, headers, etc.

Loading Documents

Documents are loaded through the Load method:

docs, err := loader.Load(ctx, document.Source{
    URI: "https://example.com/document",
})

Note:

  • The URI must be a valid HTTP/HTTPS URL
  • The default request method is GET
  • If other HTTP methods or custom headers are needed, configure the RequestBuilder, for example in authentication scenarios

Complete Usage Example

Basic Usage

package main

import (
    "context"
    
    "github.com/cloudwego/eino-ext/components/document/loader/url"
    "github.com/cloudwego/eino/components/document"
)

func main() {
    ctx := context.Background()
    
    // Initialize the loader with default configuration
    loader, err := url.NewLoader(ctx, nil)
    if (err != nil) {
        panic(err)
    }
    
    // Load documents
    docs, err := loader.Load(ctx, document.Source{
        URI: "https://example.com/article",
    })
    if (err != nil) {
        panic(err)
    }
    
    // Use document content
    for _, doc := range docs {
        println(doc.Content)
    }
}

Custom Configuration Example

package main

import (
    "context"
    "net/http"
    "time"
    
    "github.com/cloudwego/eino-ext/components/document/loader/url"
    "github.com/cloudwego/eino/components/document"
)

func main() {
    ctx := context.Background()
    
    // Custom HTTP client
    client := &http.Client{
        Timeout: 10 * time.Second,
    }
    
    // Custom request builder
    requestBuilder := func(ctx context.Context, src document.Source, opts ...document.LoaderOption) (*http.Request, error) {
        req, err := http.NewRequestWithContext(ctx, "GET", src.URI, nil)
        if err != nil {
            return nil, err
        }
        // Add custom headers
        req.Header.Add("User-Agent", "MyBot/1.0")
        return req, nil
    }
    
    // Initialize the loader
    loader, err := url.NewLoader(ctx, &url.LoaderConfig{
        Client:         client,
        RequestBuilder: requestBuilder,
    })
    if (err != nil) {
        panic(err)
    }
    
    // Load documents
    docs, err := loader.Load(ctx, document.Source{
        URI: "https://example.com/article",
    })
    if (err != nil) {
        panic(err)
    }
    
    // Use document content
    for _, doc := range docs {
        println(doc.Content)
    }
}

Last modified February 21, 2025 : doc: add eino english docs (#1255) (4f6a3bd)