@plust/datasleuth - v0.2.0

@plust/datasleuth

Build LLM-powered research pipelines and output structured data.

DataSleuth is a modular AI-powered research engine that transforms natural language queries into structured, validated data. It orchestrates information gathering, fact checking, analysis, and synthesis using customizable pipelines and LLM integration to deliver research results in your specified format.

npm version license

Installation
Key Features
Quick Start
Usage Examples
API Reference
Error Handling
Troubleshooting
Contributing
License

Installation

npm install @plust/datasleuth

Key Features

Comprehensive Research: Go beyond simple searches with intelligent research pipelines
AI-Powered Planning: Automatically generate research plans and strategies
Web Integration: Connect to search engines and content sources
Deep Analysis: Extract and analyze information with AI
Adaptive Research: Refine queries and follow leads with feedback loops
Structured Results: Get consistently formatted data with schema validation
Extensible Architecture: Build custom research steps and tools
Multiple LLM Support: Integrate with any AI provider through Vercel AI SDK
Parallel Processing: Run multiple research tracks concurrently
Fact Checking: Validate findings with AI-powered verification
Entity Analysis: Classify and cluster entities in research data

Quick Start

import { research } from '@plust/datasleuth';
import { z } from 'zod';
import { openai } from '@ai-sdk/openai';

// Define the structure of your research results
const outputSchema = z.object({
  summary: z.string(),
  keyFindings: z.array(z.string()),
  sources: z.array(z.string().url()),
});

// Execute research
const results = await research({
  query: 'Latest advancements in quantum computing',
  outputSchema,
  defaultLLM: openai('gpt-4o'),
});

console.log(results);

Usage Examples

Basic Research

The simplest way to use @plust/datasleuth is with the default pipeline:

import { research } from '@plust/datasleuth';
import { z } from 'zod';
import { openai } from '@ai-sdk/openai';

// Define your output schema
const outputSchema = z.object({
  summary: z.string(),
  keyFindings: z.array(z.string()),
  sources: z.array(z.string().url()),
});

// Execute research with default pipeline
const results = await research({
  query: 'Latest advancements in quantum computing',
  outputSchema,
  defaultLLM: openai('gpt-4o'),
});

Advanced Research

For more control, configure a custom pipeline with specific steps:

import {
  research,
  plan,
  searchWeb,
  extractContent,
  evaluate,
  repeatUntil,
} from '@plust/datasleuth';
import { z } from 'zod';
import { google } from '@plust/search-sdk';
import { openai } from '@ai-sdk/openai';

// Configure a search provider
const googleSearch = google.configure({
  apiKey: process.env.GOOGLE_API_KEY,
  cx: process.env.GOOGLE_CX,
});

// Define complex output schema
const outputSchema = z.object({
  summary: z.string(),
  threats: z.array(z.string()),
  opportunities: z.array(z.string()),
  timeline: z.array(
    z.object({
      year: z.number(),
      event: z.string(),
    })
  ),
  sources: z.array(
    z.object({
      url: z.string().url(),
      reliability: z.number().min(0).max(1),
    })
  ),
});

// Execute research with custom pipeline steps
const results = await research({
  query: 'Impact of climate change on agriculture',
  outputSchema,
  steps: [
    plan({ llm: openai('gpt-4o') }),
    searchWeb({ provider: googleSearch, maxResults: 10 }),
    extractContent({ selector: 'article, .content, main' }),
    repeatUntil(evaluate({ criteriaFn: (data) => data.sources.length > 15 }), [
      searchWeb({ provider: googleSearch }),
      extractContent(),
    ]),
  ],
  config: {
    errorHandling: 'continue',
    timeout: 60000, // 1 minute
  },
});

LLM Integration with Vercel AI SDK

@plust/datasleuth seamlessly integrates with the Vercel AI SDK, allowing you to use any supported LLM provider:

import {
  research,
  plan,
  analyze,
  factCheck,
  summarize,
} from '@plust/datasleuth';
import { z } from 'zod';
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';

// Define your output schema
const outputSchema = z.object({
  summary: z.string(),
  analysis: z.object({
    insights: z.array(z.string()),
  }),
  factChecks: z.array(
    z.object({
      statement: z.string(),
      isValid: z.boolean(),
    })
  ),
});

// Use different LLM providers for different steps
const results = await research({
  query: 'Advancements in gene editing technologies',
  outputSchema,
  steps: [
    // Use OpenAI for research planning
    plan({
      llm: openai('gpt-4o'),
      temperature: 0.4,
    }),

    // Use Anthropic for specialized analysis
    analyze({
      llm: anthropic('claude-3-opus-20240229'),
      focus: 'ethical-considerations',
      depth: 'comprehensive',
    }),

    // Use OpenAI for fact checking
    factCheck({
      llm: openai('gpt-4o'),
      threshold: 0.8,
      includeEvidence: true,
    }),

    // Use Anthropic for final summarization
    summarize({
      llm: anthropic('claude-3-sonnet-20240229'),
      format: 'structured',
      maxLength: 2000,
    }),
  ],
});

Parallel Research

Run multiple research tracks concurrently and merge the results:

import {
  research,
  track,
  parallel,
  searchWeb,
  extractContent,
  analyze,
  ResultMerger,
} from '@plust/datasleuth';
import { z } from 'zod';
import { google, bing } from '@plust/search-sdk';
import { openai } from '@ai-sdk/openai';

// Configure search providers
const googleSearch = google.configure({ apiKey: process.env.GOOGLE_API_KEY });
const bingSearch = bing.configure({ apiKey: process.env.BING_API_KEY });

// Define your output schema
const outputSchema = z.object({
  summary: z.string(),
  findings: z.array(
    z.object({
      topic: z.string(),
      details: z.string(),
      confidence: z.number(),
    })
  ),
  sources: z.array(z.string().url()),
});

// Execute parallel research tracks
const results = await research({
  query: 'Quantum computing applications in healthcare',
  outputSchema,
  steps: [
    parallel({
      tracks: [
        track({
          name: 'academic',
          steps: [
            searchWeb({
              provider: googleSearch,
              query: 'quantum computing healthcare scholarly articles',
            }),
            extractContent(),
            analyze({
              llm: openai('gpt-4o'),
              focus: 'academic-research',
            }),
          ],
        }),
        track({
          name: 'commercial',
          steps: [
            searchWeb({
              provider: bingSearch,
              query: 'quantum computing healthcare startups companies',
            }),
            extractContent(),
            analyze({
              llm: openai('gpt-4o'),
              focus: 'commercial-applications',
            }),
          ],
        }),
      ],
      mergeFunction: ResultMerger.createMergeFunction({
        strategy: 'weighted',
        weights: { academic: 1.5, commercial: 1.0 },
        conflictResolution: 'mostConfident',
      }),
    }),
    summarize({ maxLength: 1000 }),
  ],
});

Agent Orchestration

Use AI agents to dynamically decide which research steps to execute:

import {
  research,
  orchestrate,
  searchWeb,
  extractContent,
  analyze,
  transform,
} from '@plust/datasleuth';
import { z } from 'zod';
import { google, serpapi } from '@plust/search-sdk';
import { openai } from '@ai-sdk/openai';

// Configure search providers
const webSearch = google.configure({ apiKey: process.env.GOOGLE_API_KEY });
const academicSearch = serpapi.configure({
  apiKey: process.env.SERPAPI_KEY,
  engine: 'google_scholar',
});

// Execute research with orchestration
const results = await research({
  query: 'Emerging technologies in renewable energy storage',
  outputSchema: z.object({
    marketOverview: z.string(),
    technologies: z.array(
      z.object({
        name: z.string(),
        maturityLevel: z.enum(['research', 'emerging', 'growth', 'mature']),
        costEfficiency: z.number().min(1).max(10),
        scalabilityPotential: z.number().min(1).max(10),
        keyPlayers: z.array(z.string()),
      })
    ),
    forecast: z.object({
      shortTerm: z.string(),
      mediumTerm: z.string(),
      longTerm: z.string(),
    }),
    sources: z.array(
      z.object({
        url: z.string().url(),
        type: z.enum(['academic', 'news', 'company', 'government']),
        relevance: z.number().min(0).max(1),
      })
    ),
  }),
  steps: [
    orchestrate({
      llm: openai('gpt-4o'),
      tools: {
        searchWeb: searchWeb({ provider: webSearch }),
        searchAcademic: searchWeb({ provider: academicSearch }),
        extractContent: extractContent(),
        analyze: analyze(),
        // Add your custom tools here
      },
      customPrompt: `
        You are conducting market research on emerging renewable energy storage technologies.
        Your goal is to build a comprehensive market overview with technical assessment.
      `,
      maxIterations: 15,
      exitCriteria: (state) =>
        state.metadata.confidenceScore > 0.85 &&
        state.data.dataPoints?.length > 20,
    }),
  ],
});

API Reference

For complete API documentation, see the API Documentation.

Core Functions

`research(options)`

The main research function that serves as the primary API.

research({
  query: string;                // The research query
  outputSchema: z.ZodType<any>; // Schema defining the output structure
  steps?: ResearchStep[];       // Optional custom pipeline steps
  defaultLLM?: LanguageModel;   // Default LLM provider for AI-dependent steps
  config?: Partial<PipelineConfig>; // Optional configuration
}): Promise<unknown>

Pipeline Steps

`plan(options?)`

Creates a research plan using LLMs.

plan({
  llm?: LanguageModel;        // LLM model to use (falls back to defaultLLM)
  customPrompt?: string;      // Custom system prompt
  temperature?: number;       // LLM temperature (0.0-1.0)
  includeInResults?: boolean; // Whether to include plan in results
}): ResearchStep

`searchWeb(options)`

Searches the web using configured search providers.

searchWeb({
  provider: SearchProvider;   // Configured search provider
  maxResults?: number;        // Maximum results to return
  language?: string;          // Language code (e.g., 'en')
  region?: string;            // Region code (e.g., 'US')
  safeSearch?: 'off' | 'moderate' | 'strict';
  useQueriesFromPlan?: boolean; // Use queries from research plan
}): ResearchStep

`extractContent(options?)`

Extracts content from web pages.

extractContent({
  selectors?: string;         // CSS selectors for content
  maxUrls?: number;           // Maximum URLs to process
  maxContentLength?: number;  // Maximum content length per URL
  includeInResults?: boolean; // Whether to include content in results
}): ResearchStep

`factCheck(options?)`

Validates information using AI.

factCheck({
  llm?: LanguageModel;        // LLM model to use
  threshold?: number;         // Confidence threshold (0.0-1.0)
  includeEvidence?: boolean;  // Include evidence in results
  detailedAnalysis?: boolean; // Perform detailed analysis
}): ResearchStep

`analyze(options?)`

Performs specialized analysis on collected data.

analyze({
  llm?: LanguageModel;        // LLM model to use
  focus?: string;             // Analysis focus ('technical', 'business', etc.)
  depth?: 'basic' | 'comprehensive' | 'expert';
  includeInResults?: boolean; // Whether to include analysis in results
}): ResearchStep

`summarize(options?)`

Synthesizes information into concise summaries.

summarize({
  llm?: LanguageModel;        // LLM model to use
  maxLength?: number;         // Maximum summary length
  format?: 'paragraph' | 'bullet' | 'structured';
  includeInResults?: boolean; // Whether to include summary in results
}): ResearchStep

`evaluate(options)`

Evaluates current state against specified criteria.

evaluate({
  criteriaFn: (state) => boolean | Promise<boolean>; // Evaluation function
  criteriaName?: string;      // Name for this evaluation
  confidenceThreshold?: number; // Confidence threshold (0.0-1.0)
}): ResearchStep

`repeatUntil(conditionStep, stepsToRepeat, options?)`

Repeats steps until a condition is met.

repeatUntil(
  conditionStep: ResearchStep,  // Step that evaluates condition
  stepsToRepeat: ResearchStep[], // Steps to repeat
  {
    maxIterations?: number;     // Maximum iterations
    throwOnMaxIterations?: boolean; // Throw error on max iterations
  }
): ResearchStep

`parallel(options)`

Executes multiple research tracks concurrently.

parallel({
  tracks: TrackOptions[];      // Array of research tracks
  mergeFunction?: MergeFunction; // Function to merge results
  continueOnTrackError?: boolean; // Continue if a track fails
}): ResearchStep

`track(options)`

Creates an isolated research track.

track({
  name: string;               // Track name
  steps: ResearchStep[];      // Steps to execute in this track
  initialData?: any;          // Initial data for this track
}): ResearchStep

`orchestrate(options)`

Uses AI agents to make dynamic decisions about research steps.

orchestrate({
  llm: LanguageModel;         // LLM model for orchestration
  tools: Record<string, ResearchStep>; // Available tools for agent
  customPrompt?: string;      // Custom orchestration prompt
  maxIterations?: number;     // Maximum iterations
  exitCriteria?: (state) => boolean | Promise<boolean>; // Exit condition
}): ResearchStep

`transform(options?)`

Ensures research output matches the expected schema structure.

transform({
  llm?: LanguageModel;           // LLM model to use (falls back to defaultLLM)
  allowMissingWithDefaults?: boolean; // Auto-fix missing fields with defaults
  useLLM?: boolean;              // Use LLM for intelligent transformation
  temperature?: number;          // LLM temperature (0.0-1.0)
  systemPrompt?: string;         // Custom system prompt
  transformFn?: (state) => any;  // Custom transformation function
}): ResearchStep

Utilities

`ResultMerger`

Utilities for merging results from parallel research tracks.

ResultMerger.createMergeFunction({
  strategy: 'mostConfident' | 'first' | 'last' | 'majority' | 'weighted' | 'custom';
  weights?: Record<string, number>; // For weighted strategy
  customMergeFn?: (results: any[]) => any; // For custom strategy
  conflictResolution?: 'mostConfident' | 'first' | 'last' | 'average';
});

Error Handling

@plust/datasleuth provides detailed error types for different failure scenarios:

ConfigurationError: Invalid configuration (missing required parameters, etc.)
ValidationError: Output doesn't match the provided schema
LLMError: Error communicating with language model
SearchError: Error executing web searches
ContentExtractionError: Error extracting content from web pages
TimeoutError: Operation exceeded the configured timeout
PipelineError: Error in pipeline execution

Each error includes:

Descriptive message
Detailed error information
Suggestions for resolving the issue

Example handling errors:

import { research, BaseResearchError } from '@plust/datasleuth';
import { z } from 'zod';

try {
  const results = await research({
    query: 'Quantum computing applications',
    outputSchema: z.object({
      /*...*/
    }),
  });
} catch (error) {
  if (error instanceof BaseResearchError) {
    console.error(`Research error: ${error.message}`);
    console.error(`Details: ${JSON.stringify(error.details)}`);
    console.error(`Suggestions: ${error.suggestions.join('\n')}`);
  } else {
    console.error(`Unexpected error: ${error}`);
  }
}

Troubleshooting

For detailed troubleshooting information, see the Troubleshooting Guide.

Contributing

Contributions are welcome! See CONTRIBUTING.md for details on how to contribute.

License

MIT