🎉 State-of-the-art:81.39% Web Bench success

Data Extraction

rtrvr.ai excels at extracting structured data from websites, whether it's from a single page, multiple pages in a sequence, or multiple tabs to Google Sheets. This page will guide you through how to use the data extraction tools.

Core Concepts

Data extraction in rtrvr.ai is handled through a planner that determines the appropriate tool to call based on your needs. You can also directly call these tools yourself using the @ syntax:

  • →

    @extractToSheets() on current tab: Extracts data from the current active tab. The output will have multiple rows per tab, suitable for capturing all relevant information from a single page.

  • →

    @extractToSheets() on multiple tabs: Extracts data from multiple tabs that you select in the 'Action Tabs' modal. The Web Agent will perform actions on each selected tab and extract data, resulting in one row per tab.

  • →

    @crawlWebPages(): Designed for navigating and extracting data from paginated listings (e.g., Amazon product search results). You have two options:

    • →Sequential Extraction: Extract data from each page in the sequence, resulting in multiple rows per page.
    • →Linked Page Extraction: Open each link from the listing as a new tab and extract data, producing one row per linked page (tab).

Using @extractToSheets() on Current Tab

The @extractToSheets() tool, when used on the currently active tab, performs actions and extracts data from that single page. It's ideal when you need to capture all relevant information from a single webpage, potentially involving interactions like button clicks or form submissions.

Examples

  • →

    @extractToSheets(prompt='Click the "Add to Cart" button and then extract the product name, price, and quantity') Performs an action (clicking a button) and then extracts data from the modified page. The output would have multiple rows representing the extracted information.

  • →

    @extractToSheets(prompt='Fill out the contact form with the provided details and extract the confirmation message') Fills a form and extracts the resulting message. The output would contain rows with the extracted message.

  • →

    @extractToSheets(prompt='Extract all review text, ratings, and dates') Extracts multiple rows of data from the current page without any interactions.

Using @crawlWebPages()

The @crawlWebPages() tool is your go-to for handling paginated listings, such as search results, product listings, or any website where content is spread across multiple pages.

Crawl Modes

1. Sequential Extraction

In this mode, @crawlWebPages() will navigate through each page of the listing sequentially and extract data. The output will contain multiple rows for each page, capturing all relevant information from each page in the sequence.

Examples

  • →

    @crawlWebPages(prompt='Extract all product names and prices from each page of the search results') Navigates through each page of the search results and extracts product information. The output will contain multiple rows for each page.

  • →

    @crawlWebPages(prompt='Sequential: Get all article titles and publication dates from the blog archive') Goes through paginated blog posts and extracts data from each page.

2. Linked Page Extraction

In this mode, @crawlWebPages() will identify links on each page of the listing (e.g., links to individual product pages), open each link as a new tab, and then extract data from that newly opened tab. The result is one row per linked page (tab).

Examples

  • →

    @crawlWebPages(prompt='For each product, extract the name, price, and description') Crawls through a product listing, opens each product link in a new tab, and extracts information from each product page. The output will have one row per product tab.

  • →

    @crawlWebPages(prompt='For every PDF file linked on this page, extract the paper title and authors') Opens each linked PDF in a new tab and extracts data. The output contains one row per PDF tab.

Using @extractToSheets() on Multiple Tabs

The @extractToSheets() tool lets you work with multiple tabs simultaneously when you select them in the 'Action Tabs' modal. The Web Agent will perform actions and extract data from each selected tab. The result is one row per tab.

Examples

  • →

    @extractToSheets(prompt='Extract the title and URL') With multiple tabs selected, extracts data from all selected tabs. The output has one row per tab with the title and URL.

  • →

    @extractToSheets(prompt='For each tab, click the "Download" button and extract the filename') Performs an action on each selected tab and extracts data. The output would have one row per tab with the extracted filename.

  • →

    @extractToSheets(prompt='Extract company name, employee count, and industry') When you have multiple company profile tabs open, extracts consistent data from each tab.

Guiding Extraction with Recordings

You can provide recordings to guide the AI Web Agent in performing specific actions before data extraction. This is particularly useful for complex interactions or when the agent needs to follow a specific sequence of steps.

How to Use Recordings

  • →

    Record Your Actions: Use the recording feature to capture the steps you want the agent to perform. This could involve clicking buttons, filling forms, navigating menus, etc.

  • →

    Supply Recording for Extraction: When using any extraction tool, you can select recording under Advanced Options along with your extraction instructions. The agent will first execute the actions given in prompt guided by the recorded actions, and then proceed with the data extraction.

Example: Using a Recording with @crawlWebPages()

  • →

    Scenario: You want to extract product data from an e-commerce site with a uniquely designed pagination system.

  • →

    Recording: Create a recording that demonstrates how to find and click the "Next Page" button on the site.

  • →

    Tool Call: Use @crawlWebPages() with the selected recording. For example: @crawlWebPages(prompt='Sequential: Extract product name and price on main page') along with the recording selection for the "Next Page" action. The agent will use this recording to navigate through the pages and extract the specified data.

Automatic Schema Detection

For all extraction tools, you can use minimal prompts or even empty prompts. rtrvr.ai will automatically determine the most relevant data to extract based on the structure of the web pages. This makes it even easier to quickly gather data without needing to specify precise extraction instructions.

Examples

  • →

    @extractToSheets() Without any prompt, automatically extracts relevant data from the current page.

  • →

    @crawlWebPages(prompt='extract') Minimal prompt that lets the AI determine what data is most relevant to extract.

Special Image Handling

When you extract image source URLs and use the column name 'image', rtrvr.ai will automatically wrap the URLs with the '=IMAGE()' function when exporting to Google Sheets. This will make the images render directly within the spreadsheet.

Tips for Effective Data Extraction

  • →

    Let the Planner Help: If you're unsure which tool to use, describe what you want to achieve and let rtrvr.ai's planner determine the appropriate tool.

  • →

    Direct Tool Calling: For more control, call tools directly using the @ syntax: @extractToSheets() or @crawlWebPages().

  • →

    Choose the Right Tool: Use @extractToSheets() on current tab for single page extraction, @crawlWebPages() for paginated listings, and @extractToSheets() with multiple selected tabs for batch processing.

  • →

    Be Specific (When Needed): If you need particular data, clearly state what you want to extract. For instance, instead of "Extract the info", say "Extract the product name, price, and description". You can also use minimal or empty prompts for automatic schema detection.

  • →

    Use Clear Labels: When specifying data elements, use labels that are easy for rtrvr.ai to understand (e.g., "product name," "author," "price"). Remember that using 'image' as a column name will enable special image rendering in Google Sheets.

  • →

    Test and Refine: Start with a small test set of data to confirm the extraction is working as you expect, and refine your command if needed.

Next Steps