🚀 快速安装

复制以下命令并运行,立即安装此 Skill:

npx skills add https://skills.sh/claude-office-skills/skills/table-extractor

💡 提示:需要 Node.js 和 NPM

表格提取器技能 (Table Extractor Skill)

概述 (Overview)

此技能支持使用 camelot 从 PDF 文档中精确提取表格——这是 PDF 表格提取的黄金标准。以高精度处理具有合并单元格、无边框表格和多页布局的复杂表格。

使用方法 (How to Use)

  1. 提供包含表格的 PDF
  2. 可选择指定页面或表格检测方法
  3. 我将以 pandas DataFrame 形式提取表格

示例提示 (Example prompts):

  • “从这份 PDF 中提取所有表格”
  • “获取此报告第 5 页的表格”
  • “从此文档中提取无边框表格”
  • “将 PDF 表格转换为 Excel 格式”

领域知识 (Domain Knowledge)

camelot 基础知识 (camelot Fundamentals)

import camelot

# 从 PDF 中提取表格 (Extract tables from PDF)
tables = camelot.read_pdf('document.pdf')

# 访问结果 (Access results)
print(f"找到 {len(tables)} 个表格 (Found {len(tables)} tables)")

# 获取第一个表格作为 DataFrame (Get first table as DataFrame)
df = tables[0].df
print(df)

提取方法 (Extraction Methods)

方法 (Method) 使用场景 (Use Case) 描述 (Description)
lattice 带边框的表格 (Bordered tables) 通过线条/边框检测表格 (Detects table by lines/borders)
stream 无边框表格 (Borderless tables) 使用文本定位 (Uses text positioning)
# Lattice 方法(默认) - 适用于有可见边框的表格 (Lattice method - default - for tables with visible borders)
tables = camelot.read_pdf('document.pdf', flavor='lattice')

# Stream 方法 - 适用于无边框表格 (Stream method - for borderless tables)
tables = camelot.read_pdf('document.pdf', flavor='stream')

页面选择 (Page Selection)

# 单页 (Single page)
tables = camelot.read_pdf('document.pdf', pages='1')

# 多页 (Multiple pages)
tables = camelot.read_pdf('document.pdf', pages='1,3,5')

# 页面范围 (Page range)
tables = camelot.read_pdf('document.pdf', pages='1-5')

# 所有页面 (All pages)
tables = camelot.read_pdf('document.pdf', pages='all')

高级选项 (Advanced Options)

Lattice 选项 (Lattice Options)

tables = camelot.read_pdf(
    'document.pdf',
    flavor='lattice',
    line_scale=40,              # 线条检测灵敏度 (Line detection sensitivity)
    copy_text=['h', 'v'],       # 跨合并单元格复制文本 (Copy text across merged cells)
    shift_text=['l', 't'],      # 文本对齐偏移 (Shift text alignment)
    split_text=True,            # 在换行符处拆分文本 (Split text at newlines)
    flag_size=True,             # 标记上标/下标 (Flag super/subscripts)
    strip_text='\n',            # 要去除的字符 (Characters to strip)
    process_background=False,   # 处理背景线条 (Process background lines)
)

Stream 选项 (Stream Options)

tables = camelot.read_pdf(
    'document.pdf',
    flavor='stream',
    edge_tol=500,               # 边缘容差 (Edge tolerance)
    row_tol=10,                 # 行容差 (Row tolerance)
    column_tol=0,               # 列容差 (Column tolerance)
    strip_text='\n',            # 要去除的字符 (Characters to strip)
)

表格区域指定 (Table Area Specification)

# 从特定区域提取 (x1, y1, x2, y2) (Extract from specific area - x1, y1, x2, y2)
# 坐标从左下角开始,单位为 PDF 点(72 点 = 1 英寸)(Coordinates from bottom-left, in PDF points - 72 points = 1 inch)
tables = camelot.read_pdf(
    'document.pdf',
    table_areas=['72,720,540,400'],  # 一个区域 (One area)
)

# 多个区域 (Multiple areas)
tables = camelot.read_pdf(
    'document.pdf',
    table_areas=['72,720,540,400', '72,380,540,200'],
)

列指定 (Column Specification)

# 手动指定列位置(用于 stream 方法)(Manually specify column positions - for stream method)
tables = camelot.read_pdf(
    'document.pdf',
    flavor='stream',
    columns=['100,200,300,400'],  # 列分隔符的 X 坐标 (X positions of column separators)
)

处理结果 (Working with Results)

import camelot

tables = camelot.read_pdf('document.pdf')

for i, table in enumerate(tables):
    # 访问 DataFrame (Access DataFrame)
    df = table.df
    
    # 表格元数据 (Table metadata)
    print(f"表格 {i+1} (Table {i+1}):")
    print(f"  页数 (Page): {table.page}")
    print(f"  准确度 (Accuracy): {table.accuracy}")
    print(f"  空白 (Whitespace): {table.whitespace}")
    print(f"  顺序 (Order): {table.order}")
    print(f"  形状 (Shape): {df.shape}")
    
    # 解析报告 (Parsing report)
    report = table.parsing_report
    print(f"  报告 (Report): {report}")

导出选项 (Export Options)

import camelot

tables = camelot.read_pdf('document.pdf')

# 导出到 CSV (Export to CSV)
tables[0].to_csv('table.csv')

# 导出到 Excel (Export to Excel)
tables[0].to_excel('table.xlsx')

# 导出到 JSON (Export to JSON)
tables[0].to_json('table.json')

# 导出到 HTML (Export to HTML)
tables[0].to_html('table.html')

# 导出所有表格 (Export all tables)
for i, table in enumerate(tables):
    table.to_excel(f'table_{i+1}.xlsx')

可视化调试 (Visual Debugging)

import camelot

# 启用可视化调试 (Enable visual debugging)
tables = camelot.read_pdf('document.pdf')

# 绘制检测到的表格区域 (Plot detected table areas)
camelot.plot(tables[0], kind='contour').show()

# 在表格上绘制文本 (Plot text on table)
camelot.plot(tables[0], kind='text').show()

# 绘制检测到的线条(仅 lattice)(Plot detected lines - lattice only)
camelot.plot(tables[0], kind='joint').show()
camelot.plot(tables[0], kind='line').show()

# 保存绘图 (Save plot)
fig = camelot.plot(tables[0])
fig.savefig('debug.png')

处理跨页表格 (Handling Multi-page Tables)

import camelot
import pandas as pd

def extract_multipage_table(pdf_path, pages='all'):
    """提取并合并跨页的表格 (Extract and combine tables that span multiple pages)."""
    
    tables = camelot.read_pdf(pdf_path, pages=pages)
    
    # 按相似结构(列)对表格分组 (Group tables by similar structure - columns)
    table_groups = {}
    
    for table in tables:
        cols = tuple(table.df.columns)
        if cols not in table_groups:
            table_groups[cols] = []
        table_groups[cols].append(table.df)
    
    # 合并相似表格 (Combine similar tables)
    combined = []
    for cols, dfs in table_groups.items():
        if len(dfs) > 1:
            # 合并并去重标题行 (Combine and deduplicate header rows)
            combined_df = pd.concat(dfs, ignore_index=True)
            combined.append(combined_df)
        else:
            combined.append(dfs[0])
    
    return combined

最佳实践 (Best Practices)

  1. 尝试两种方法 (Try Both Methods):有边框表格使用 lattice,无边框使用 stream
  2. 检查准确度分数 (Check Accuracy Score):通常 90% 以上是好的
  3. 使用可视化调试 (Use Visual Debugging):理解提取结果
  4. 指定区域 (Specify Areas):对于包含多种表格类型的 PDF
  5. 处理表头 (Handle Headers):第一行通常需要特殊处理

常见模式 (Common Patterns)

批量表格提取 (Batch Table Extraction)

import camelot
from pathlib import Path
import pandas as pd

def batch_extract_tables(input_dir, output_dir):
    """从目录中的所有 PDF 提取表格 (Extract tables from all PDFs in directory)."""
    
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    results = []
    
    for pdf_file in input_path.glob('*.pdf'):
        try:
            tables = camelot.read_pdf(str(pdf_file), pages='all')
            
            for i, table in enumerate(tables):
                # 跳过低准确度的表格 (Skip low accuracy tables)
                if table.accuracy < 80:
                    continue
                
                output_file = output_path / f"{pdf_file.stem}_table_{i+1}.xlsx"
                table.to_excel(str(output_file))
                
                results.append({
                    'source': str(pdf_file),
                    'table': i + 1,
                    'page': table.page,
                    'accuracy': table.accuracy,
                    'output': str(output_file)
                })
        
        except Exception as e:
            results.append({
                'source': str(pdf_file),
                'error': str(e)
            })
    
    return results

自动检测表格方法 (Auto-detect Table Method)

import camelot

def smart_extract_tables(pdf_path, pages='1'):
    """尝试两种方法并返回最佳结果 (Try both methods and return best results)."""
    
    # 先尝试 lattice (Try lattice first)
    lattice_tables = camelot.read_pdf(pdf_path, pages=pages, flavor='lattice')
    
    # 尝试 stream (Try stream)
    stream_tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream')
    
    # 比较并返回最佳结果 (Compare and return best)
    results = []
    
    if lattice_tables and lattice_tables[0].accuracy > 70:
        results.extend(lattice_tables)
    elif stream_tables:
        results.extend(stream_tables)
    
    return results

示例 (Examples)

示例 1:财务报表提取 (Example 1: Financial Statement Extraction)

import camelot
import pandas as pd

def extract_financial_tables(pdf_path):
    """从年度报告中提取财务报表 (Extract financial tables from annual report)."""
    
    # 提取所有表格 (Extract all tables)
    tables = camelot.read_pdf(pdf_path, pages='all', flavor='lattice')
    
    financial_data = {
        'income_statement': None,
        'balance_sheet': None,
        'cash_flow': None,
        'other_tables': []
    }
    
    for table in tables:
        df = table.df
        text = df.to_string().lower()
        
        # 识别表格类型 (Identify table type)
        if 'revenue' in text or 'sales' in text:
            if 'operating income' in text or 'net income' in text:
                financial_data['income_statement'] = df
        elif 'asset' in text and 'liabilities' in text:
            financial_data['balance_sheet'] = df
        elif 'cash flow' in text or 'operating activities' in text:
            financial_data['cash_flow'] = df
        else:
            financial_data['other_tables'].append({
                'page': table.page,
                'data': df,
                'accuracy': table.accuracy
            })
    
    return financial_data

financials = extract_financial_tables('annual_report.pdf')
if financials['income_statement'] is not None:
    print("找到利润表 (Income Statement found):")
    print(financials['income_statement'])

示例 2:科研数据提取 (Example 2: Scientific Data Extraction)

import camelot
import pandas as pd

def extract_research_data(pdf_path, pages='all'):
    """从研究论文中提取数据表 (Extract data tables from research paper)."""
    
    # 为有边框表格尝试 lattice (Try lattice for bordered tables)
    tables = camelot.read_pdf(pdf_path, pages=pages, flavor='lattice')
    
    if not tables or all(t.accuracy < 70 for t in tables):
        # 回退到 stream 方法处理无边框表格 (Fall back to stream for borderless)
        tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream')
    
    extracted_data = []
    
    for table in tables:
        df = table.df
        
        # 清理 DataFrame (Clean up the DataFrame)
        # 如果第一行看起来像标题,则将其设置为标题 (Set first row as header if it looks like one)
        if not df.iloc[0].str.contains(r'\d').any():
            df.columns = df.iloc[0]
            df = df[1:]
            df = df.reset_index(drop=True)
        
        extracted_data.append({
            'page': table.page,
            'accuracy': table.accuracy,
            'data': df
        })
    
    return extracted_data

data = extract_research_data('research_paper.pdf')
for i, item in enumerate(data):
    print(f"表格 {i+1} (第 {item['page']} 页, 准确度: {item['accuracy']}%):")
    print(item['data'].head())

示例 3:发票行项目 (Example 3: Invoice Line Items)

import camelot

def extract_invoice_items(pdf_path):
    """从发票中提取行项目 (Extract line items from invoice)."""
    
    # 通常发票有带边框的表格 (Usually invoices have bordered tables)
    tables = camelot.read_pdf(pdf_path, flavor='lattice')
    
    line_items = []
    
    for table in tables:
        df = table.df
        
        # 查找具有典型发票列的表格 (Look for table with typical invoice columns)
        header_text = ' '.join(df.iloc[0].astype(str)).lower()
        
        if any(term in header_text for term in ['quantity', 'qty', 'amount', 'price', 'description']):
            # 这看起来像一个行项目表 (This looks like a line items table)
            df.columns = df.iloc[0]
            df = df[1:]
            
            for _, row in df.iterrows():
                item = {}
                for col in df.columns:
                    col_lower = str(col).lower()
                    value = row[col]
                    
                    if 'desc' in col_lower or 'item' in col_lower:
                        item['description'] = value
                    elif 'qty' in col_lower or 'quantity' in col_lower:
                        item['quantity'] = value
                    elif 'price' in col_lower or 'rate' in col_lower:
                        item['unit_price'] = value
                    elif 'amount' in col_lower or 'total' in col_lower:
                        item['amount'] = value
                
                if item:
                    line_items.append(item)
    
    return line_items

items = extract_invoice_items('invoice.pdf')
for item in items:
    print(item)

示例 4:表格比较 (Example 4: Table Comparison)

import camelot
import pandas as pd

def compare_pdf_tables(pdf1_path, pdf2_path):
    """比较两个 PDF 版本之间的表格 (Compare tables between two PDF versions)."""
    
    tables1 = camelot.read_pdf(pdf1_path)
    tables2 = camelot.read_pdf(pdf2_path)
    
    comparisons = []
    
    # 按形状和位置匹配表格 (Match tables by shape and position)
    for t1 in tables1:
        best_match = None
        best_score = 0
        
        for t2 in tables2:
            if t1.df.shape == t2.df.shape:
                # 计算相似度 (Calculate similarity)
                try:
                    similarity = (t1.df == t2.df).mean().mean()
                    if similarity > best_score:
                        best_score = similarity
                        best_match = t2
                except:
                    pass
        
        if best_match:
            comparisons.append({
                'page1': t1.page,
                'page2': best_match.page,
                'similarity': best_score,
                'identical': best_score == 1.0,
                'diff': pd.DataFrame(t1.df != best_match.df)
            })
    
    return comparisons

comparison = compare_pdf_tables('report_v1.pdf', 'report_v2.pdf')

局限性 (Limitations)

  • 不支持加密的 PDF (Encrypted PDFs not supported)
  • 基于图像的 PDF 需要 OCR 预处理 (Image-based PDFs need OCR preprocessing)
  • 非常复杂的合并单元格可能需要调整 (Very complex merged cells may need tuning)
  • 旋转的表格需要预处理 (Rotated tables require preprocessing)
  • 大型 PDF 可能需要逐页处理 (Large PDFs may need page-by-page processing)

安装 (Installation)

pip install camelot-py[cv]

# 额外依赖 (Additional dependencies)
# macOS
brew install ghostscript tcl-tk

# Ubuntu
apt-get install ghostscript python3-tk

资源 (Resources)

📄 原始文档

完整文档(英文):

https://skills.sh/claude-office-skills/skills/table-extractor

💡 提示:点击上方链接查看 skills.sh 原始英文文档,方便对照翻译。

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。