API Reference
Complete reference for all classes and methods in LLM Data Processer.
Utility Functions
read_pdf2text()
Extract text content from a PDF file.
Parameters:
pdf_path(str): Path to the PDF file
Returns:
str: Extracted text from all pages
Raises:
FileNotFoundError: If PDF file doesn't existPDFSyntaxError: If PDF is corrupted
Example:
from llm_helper import read_pdf2text
text = read_pdf2text('document.pdf')
print(f"Extracted {len(text)} characters")
InfoExtractor
Class for extracting structured information from text using custom Pydantic schemas with automatic retry logic.
Constructor
Parameters:
api_provider(str, optional): LLM provider. Currently supports'google'. Default:'google'model(str, optional): Model name. Default:'gemini-2.5-flash'path_env(str, optional): Path to .env file. Default:''
Raises:
ValueError: If unsupported API provider is specified
Example:
from llm_helper import InfoExtractor
extractor = InfoExtractor(
api_provider='google',
model='gemini-2.5-flash'
)
Methods
load_data_schema()
Dynamically creates a Pydantic model based on provided schema data.
Parameters:
schema_data(dict): Schema definition withtech_typeandfieldstech_type(str): Name for the Pydantic modelfields(dict): Field definitions withfield_typeanddescription
Returns:
BaseModel: Dynamically created Pydantic model class
Example:
schema = {
'tech_type': 'DatabaseInfo',
'fields': {
'name': {
'field_type': 'str',
'description': 'Database name'
},
'version': {
'field_type': 'str',
'description': 'Current version'
},
'features': {
'field_type': 'List[str]',
'description': 'Key features'
}
}
}
extractor.load_data_schema(schema)
Supported Field Types:
str: String fieldsint: Integer fieldsfloat: Floating point numbersbool: Boolean fieldsList[str]: List of stringsList[int]: List of integersDict[str, str]: String dictionaries
load_prompt_templates()
Load prompt templates for extraction and error fixing.
Parameters:
base_prompt_dict(dict): Initial extraction prompt withsystemandhumankeysfix_prompt_dict(dict): Retry/fix prompt withsystemandhumankeys
Template Variables:
Base prompt:
- {technology_name}: Name of item being extracted
- {info_source}: Source text
- {format_instructions}: Auto-generated Pydantic format
Fix prompt:
- {technology_name}: Name of item
- {malformed_output}: Previous failed output
- {format_instructions}: Expected format
Example:
base_prompts = {
'system': 'You are an expert at extracting structured data.',
'human': '''Extract information about {technology_name}:
Source: {info_source}
Format: {format_instructions}'''
}
fix_prompts = {
'system': 'You fix malformed JSON outputs.',
'human': '''Fix this output for {technology_name}:
{malformed_output}
Expected format: {format_instructions}'''
}
extractor.load_prompt_templates(base_prompts, fix_prompts)
load_info_source()
Load the information source and technology name for extraction.
Parameters:
technology_name(str): Name/identifier of the technology or entityinfo_source(str): Source text to extract information from
Example:
info_text = """
Redis is an in-memory data structure store...
"""
extractor.load_info_source('Redis', info_text)
validate_setup()
Validates that all required components are configured before extraction.
Returns:
bool:Trueif all components are ready
Raises:
ValueError: If any required component is missing, with detailed error messages
Validates:
- DataSchema is loaded
- Parser is initialized
- Base prompt is configured
- Fix prompt is configured
- Technology name is set
- Info source is provided
Example:
try:
extractor.validate_setup()
print("✓ Setup complete")
except ValueError as e:
print(f"Setup error: {e}")
extract_tech_info()
Attempts to extract structured information with automatic retry on parsing errors.
Parameters:
max_retries(int, optional): Maximum retry attempts for fixing malformed outputs. Default:3
Returns:
BaseModel: Pydantic model instance with extracted data
Raises:
ValueError: If setup validation failsOutputParserException: If extraction fails after all retries
Workflow:
- Validates setup
- Generates initial extraction using base prompt
- Attempts to parse JSON output
- On failure, uses fix prompt to correct output
- Retries up to
max_retriestimes - Returns validated Pydantic object
Example:
try:
result = extractor.extract_tech_info(max_retries=3)
print(f"Name: {result.name}")
print(f"Description: {result.description}")
print(f"Features: {', '.join(result.features)}")
except OutputParserException as e:
print(f"Extraction failed: {e}")
except ValueError as e:
print(f"Setup error: {e}")
Attributes
DataSchema
Dynamically created Pydantic model. Initially None, set by load_data_schema().
llm
LangChain LLM instance for generation.
parser
JSON parser with Pydantic validation. Created by load_data_schema().
technology_name
Name of the entity being extracted. Set by load_info_source().
info_source
Source text for extraction. Set by load_info_source().
base_prompt
LangChain prompt for initial extraction. Set by load_prompt_templates().
fix_prompt
LangChain prompt for fixing malformed outputs. Set by load_prompt_templates().
Complete Example
from llm_helper import InfoExtractor
# Initialize
extractor = InfoExtractor(api_provider='google')
# Define schema
schema = {
'tech_type': 'StorageTechnology',
'fields': {
'name': {'field_type': 'str', 'description': 'Technology name'},
'type': {'field_type': 'str', 'description': 'Storage type'},
'advantages': {'field_type': 'List[str]', 'description': 'Key benefits'},
'disadvantages': {'field_type': 'List[str]', 'description': 'Limitations'}
}
}
# Configure prompts
base_prompts = {
'system': 'Extract structured storage technology information.',
'human': 'Extract data for {technology_name} from: {info_source}\n{format_instructions}'
}
fix_prompts = {
'system': 'Fix malformed JSON to match the schema.',
'human': 'Fix this: {malformed_output}\nFormat: {format_instructions}'
}
# Setup
extractor.load_data_schema(schema)
extractor.load_prompt_templates(base_prompts, fix_prompts)
# Extract
info = "PostgreSQL is a powerful open-source relational database..."
extractor.load_info_source('PostgreSQL', info)
# Get structured result
result = extractor.extract_tech_info(max_retries=3)
print(f"✓ Extracted: {result.name} ({result.type})")
AIHelper
Main class for interacting with Hugging Face models.
Constructor
Parameters:
model_name(str, optional): Model to use. Options:'Llama-3.1','Mistral-7B'. Default:'Mistral-7B'display_response(bool, optional): Whether to display responses in Markdown. Default:True
Example:
Methods
ask()
Generate a response from the LLM.
ask(
prompt: str,
display_response: bool = None,
with_guideline: bool = True,
with_data: bool = True,
with_history: bool = True
) -> str
Parameters:
prompt(str): Your question or instructiondisplay_response(bool, optional): Override instance display settingwith_guideline(bool, optional): Include custom guidelines. Default:Truewith_data(bool, optional): Include attached DataFrames. Default:Truewith_history(bool, optional): Include conversation history. Default:True
Returns:
str: The model's response (ifdisplay_response=False)
Example:
# Basic usage
response = ai.ask("What is Python?")
# Without history
response = ai.ask("Explain AI", with_history=False)
# Without guidelines or data
response = ai.ask("General question", with_guideline=False, with_data=False)
add_guideline()
Add a custom guideline to influence model behavior.
Parameters:
guideline_name(str): Unique name/key for the guidelineguideline(str): A string describing desired behavior
Example:
ai.add_guideline('format', "Respond in bullet points")
ai.add_guideline('length', "Keep responses under 100 words")
ai.add_guideline('tone', "Use professional language")
attach_data()
Attach a pandas DataFrame or string data to the AI context.
Parameters:
data_name(str): Unique name/key for the attached dataattached_data(pd.DataFrame or str): Data to attach (DataFrame or string)
Example:
import pandas as pd
# Attach DataFrame
df = pd.DataFrame({"Name": ["Alice", "Bob"], "Age": [25, 30]})
ai.attach_data('employees', df)
# Attach string data
ai.attach_data('context', "Company founded in 2020...")
ai.ask("Who is older in the employee data?")
chat_widget()
Launch an interactive chat interface (Jupyter notebooks only).
Example:
Features:
- Text input area
- Checkboxes for guideline/data toggling
- Ask button
- Live output display
Attributes
chat_history
List of conversation turns. Each entry is a dict with role and content keys.
Example:
guideline
Dictionary of custom guidelines with name/key mapping to guideline text.
Example:
# View guidelines
print(ai.guideline)
# Output: {'format': 'Respond in bullet points', 'length': 'Keep under 100 words'}
# Clear guidelines
ai.guideline = {}
attached_data
Dictionary of attached data with name/key mapping to DataFrame or string data.
Example:
# View attached data
print(ai.attached_data.keys())
# Output: dict_keys(['employees', 'sales_data'])
# Clear data
ai.attached_data = {}
llm_models
Dictionary mapping model names to Hugging Face model IDs.
Available Models:
{
'Llama-3.1': 'meta-llama/Llama-3.1-8B-Instruct',
'Mistral-7B': 'mistralai/Mistral-7B-Instruct-v0.2'
}
config
Configuration dictionary for generation parameters.
Default:
AIHelper_Google
Class for Google Gemini with Google Search grounding.
Constructor
AIHelper_Google(
model: str = 'gemini-2.5-flash',
path_env: str = '',
display_response: bool = True
)
Parameters:
model(str, optional): Gemini model to use. Default:'gemini-2.5-flash'path_env(str, optional): Path to .env file. Default:''display_response(bool, optional): Whether to display responses. Default:True
Example:
Methods
ask()
Generate a response using Google Gemini.
Parameters:
prompt(str): Your question or instructiondisplay_response(bool, optional): Override instance display setting
Returns:
str: The model's response (ifdisplay_response=False)
Example:
# Current events with Google Search
response = ai.ask("What are the latest AI developments in 2025?")
# Fact-checking
response = ai.ask("What is the population of Japan?")
Attributes
history
List of (prompt, response) tuples representing conversation history.
Example:
config
Google Gemini configuration object.
Default Settings:
{
'temperature': 0.3,
'top_p': 0.9,
'top_k': 40,
'candidate_count': 1,
'max_output_tokens': 2000,
'tools': [GoogleSearch()]
}
Configuration Objects
LLM Models
llm_models = {
'Llama-3.1': 'meta-llama/Llama-3.1-8B-Instruct',
'Mistral-7B': 'mistralai/Mistral-7B-Instruct-v0.2'
}
Hugging Face Config
config = {
'max_tokens': 2000, # Maximum response length
'temperature': 0.7 # Randomness (0.0-1.0)
}
Temperature Guide:
0.0-0.3: Focused, deterministic0.4-0.7: Balanced (recommended)0.8-1.0: Creative, varied
Google Gemini Config
config_google = types.GenerateContentConfig(
temperature=0.3,
top_p=0.9,
top_k=40,
candidate_count=1,
max_output_tokens=2000,
tools=[types.Tool(google_search=types.GoogleSearch())]
)
Environment Variables
Required
HF_TOKEN: Hugging Face API tokenGEMINI_API_KEY: Google Gemini API key
Optional
OPENAI_API_KEY: OpenAI API key (for future providers)
Set in Shell:
Or in Python:
Error Handling
Common Exceptions
try:
ai = AIHelper(model_name='Llama-3.1')
response = ai.ask("Test question")
except ImportError as e:
print(f"Missing dependency: {e}")
except KeyError as e:
print(f"API key not found: {e}")
except Exception as e:
print(f"Error: {e}")
Authentication Errors
If API keys are missing or invalid:
# Check if keys are set
import os
if not os.getenv('HF_TOKEN'):
raise ValueError("HF_TOKEN not set. Get one at https://huggingface.co/settings/tokens")
Type Hints
For type checking and IDE support:
from typing import List, Dict, Optional
import pandas as pd
def process_data(ai: AIHelper, df: pd.DataFrame) -> str:
ai.attach_data(df)
response: str = ai.ask("Analyze this data", display_response=False)
return response
Next Steps
- See Usage Guide for practical examples
- Check Examples for complete code samples
- Read Contributing to extend the API