🚀 快速安装

复制以下命令并运行，立即安装此 Skill：

npx skills add https://skills.sh/claude-office-skills/skills/table-extractor

💡 提示：需要 Node.js 和 NPM

表格提取器技能 (Table Extractor Skill)

概述 (Overview)

此技能支持使用 camelot 从 PDF 文档中精确提取表格——这是 PDF 表格提取的黄金标准。以高精度处理具有合并单元格、无边框表格和多页布局的复杂表格。

使用方法 (How to Use)

提供包含表格的 PDF
可选择指定页面或表格检测方法
我将以 pandas DataFrame 形式提取表格

示例提示 (Example prompts):

“从这份 PDF 中提取所有表格”
“获取此报告第 5 页的表格”
“从此文档中提取无边框表格”
“将 PDF 表格转换为 Excel 格式”

领域知识 (Domain Knowledge)

camelot 基础知识 (camelot Fundamentals)

import camelot

# 从 PDF 中提取表格 (Extract tables from PDF)
tables = camelot.read_pdf('document.pdf')

# 访问结果 (Access results)
print(f"找到 {len(tables)} 个表格 (Found {len(tables)} tables)")

# 获取第一个表格作为 DataFrame (Get first table as DataFrame)
df = tables[0].df
print(df)

提取方法 (Extraction Methods)

方法 (Method)	使用场景 (Use Case)	描述 (Description)
`lattice`	带边框的表格 (Bordered tables)	通过线条/边框检测表格 (Detects table by lines/borders)
`stream`	无边框表格 (Borderless tables)	使用文本定位 (Uses text positioning)

# Lattice 方法（默认） - 适用于有可见边框的表格 (Lattice method - default - for tables with visible borders)
tables = camelot.read_pdf('document.pdf', flavor='lattice')

# Stream 方法 - 适用于无边框表格 (Stream method - for borderless tables)
tables = camelot.read_pdf('document.pdf', flavor='stream')

页面选择 (Page Selection)

# 单页 (Single page)
tables = camelot.read_pdf('document.pdf', pages='1')

# 多页 (Multiple pages)
tables = camelot.read_pdf('document.pdf', pages='1,3,5')

# 页面范围 (Page range)
tables = camelot.read_pdf('document.pdf', pages='1-5')

# 所有页面 (All pages)
tables = camelot.read_pdf('document.pdf', pages='all')

高级选项 (Advanced Options)

Lattice 选项 (Lattice Options)

tables = camelot.read_pdf(
    'document.pdf',
    flavor='lattice',
    line_scale=40,              # 线条检测灵敏度 (Line detection sensitivity)
    copy_text=['h', 'v'],       # 跨合并单元格复制文本 (Copy text across merged cells)
    shift_text=['l', 't'],      # 文本对齐偏移 (Shift text alignment)
    split_text=True,            # 在换行符处拆分文本 (Split text at newlines)
    flag_size=True,             # 标记上标/下标 (Flag super/subscripts)
    strip_text='\n',            # 要去除的字符 (Characters to strip)
    process_background=False,   # 处理背景线条 (Process background lines)
)

Stream 选项 (Stream Options)

tables = camelot.read_pdf(
    'document.pdf',
    flavor='stream',
    edge_tol=500,               # 边缘容差 (Edge tolerance)
    row_tol=10,                 # 行容差 (Row tolerance)
    column_tol=0,               # 列容差 (Column tolerance)
    strip_text='\n',            # 要去除的字符 (Characters to strip)
)

表格区域指定 (Table Area Specification)

# 从特定区域提取 (x1, y1, x2, y2) (Extract from specific area - x1, y1, x2, y2)
# 坐标从左下角开始，单位为 PDF 点（72 点 = 1 英寸）(Coordinates from bottom-left, in PDF points - 72 points = 1 inch)
tables = camelot.read_pdf(
    'document.pdf',
    table_areas=['72,720,540,400'],  # 一个区域 (One area)
)

# 多个区域 (Multiple areas)
tables = camelot.read_pdf(
    'document.pdf',
    table_areas=['72,720,540,400', '72,380,540,200'],
)

列指定 (Column Specification)

# 手动指定列位置（用于 stream 方法）(Manually specify column positions - for stream method)
tables = camelot.read_pdf(
    'document.pdf',
    flavor='stream',
    columns=['100,200,300,400'],  # 列分隔符的 X 坐标 (X positions of column separators)
)

处理结果 (Working with Results)

import camelot

tables = camelot.read_pdf('document.pdf')

for i, table in enumerate(tables):
    # 访问 DataFrame (Access DataFrame)
    df = table.df
    
    # 表格元数据 (Table metadata)
    print(f"表格 {i+1} (Table {i+1}):")
    print(f"  页数 (Page): {table.page}")
    print(f"  准确度 (Accuracy): {table.accuracy}")
    print(f"  空白 (Whitespace): {table.whitespace}")
    print(f"  顺序 (Order): {table.order}")
    print(f"  形状 (Shape): {df.shape}")
    
    # 解析报告 (Parsing report)
    report = table.parsing_report
    print(f"  报告 (Report): {report}")

导出选项 (Export Options)

import camelot

tables = camelot.read_pdf('document.pdf')

# 导出到 CSV (Export to CSV)
tables[0].to_csv('table.csv')

# 导出到 Excel (Export to Excel)
tables[0].to_excel('table.xlsx')

# 导出到 JSON (Export to JSON)
tables[0].to_json('table.json')

# 导出到 HTML (Export to HTML)
tables[0].to_html('table.html')

# 导出所有表格 (Export all tables)
for i, table in enumerate(tables):
    table.to_excel(f'table_{i+1}.xlsx')

可视化调试 (Visual Debugging)

import camelot

# 启用可视化调试 (Enable visual debugging)
tables = camelot.read_pdf('document.pdf')

# 绘制检测到的表格区域 (Plot detected table areas)
camelot.plot(tables[0], kind='contour').show()

# 在表格上绘制文本 (Plot text on table)
camelot.plot(tables[0], kind='text').show()

# 绘制检测到的线条（仅 lattice）(Plot detected lines - lattice only)
camelot.plot(tables[0], kind='joint').show()
camelot.plot(tables[0], kind='line').show()

# 保存绘图 (Save plot)
fig = camelot.plot(tables[0])
fig.savefig('debug.png')

处理跨页表格 (Handling Multi-page Tables)

import camelot
import pandas as pd

def extract_multipage_table(pdf_path, pages='all'):
    """提取并合并跨页的表格 (Extract and combine tables that span multiple pages)."""
    
    tables = camelot.read_pdf(pdf_path, pages=pages)
    
    # 按相似结构（列）对表格分组 (Group tables by similar structure - columns)
    table_groups = {}
    
    for table in tables:
        cols = tuple(table.df.columns)
        if cols not in table_groups:
            table_groups[cols] = []
        table_groups[cols].append(table.df)
    
    # 合并相似表格 (Combine similar tables)
    combined = []
    for cols, dfs in table_groups.items():
        if len(dfs) > 1:
            # 合并并去重标题行 (Combine and deduplicate header rows)
            combined_df = pd.concat(dfs, ignore_index=True)
            combined.append(combined_df)
        else:
            combined.append(dfs[0])
    
    return combined

最佳实践 (Best Practices)

尝试两种方法 (Try Both Methods)：有边框表格使用 lattice，无边框使用 stream
检查准确度分数 (Check Accuracy Score)：通常 90% 以上是好的
使用可视化调试 (Use Visual Debugging)：理解提取结果
指定区域 (Specify Areas)：对于包含多种表格类型的 PDF
处理表头 (Handle Headers)：第一行通常需要特殊处理

常见模式 (Common Patterns)

批量表格提取 (Batch Table Extraction)

import camelot
from pathlib import Path
import pandas as pd

def batch_extract_tables(input_dir, output_dir):
    """从目录中的所有 PDF 提取表格 (Extract tables from all PDFs in directory)."""
    
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    results = []
    
    for pdf_file in input_path.glob('*.pdf'):
        try:
            tables = camelot.read_pdf(str(pdf_file), pages='all')
            
            for i, table in enumerate(tables):
                # 跳过低准确度的表格 (Skip low accuracy tables)
                if table.accuracy < 80:
                    continue
                
                output_file = output_path / f"{pdf_file.stem}_table_{i+1}.xlsx"
                table.to_excel(str(output_file))
                
                results.append({
                    'source': str(pdf_file),
                    'table': i + 1,
                    'page': table.page,
                    'accuracy': table.accuracy,
                    'output': str(output_file)
                })
        
        except Exception as e:
            results.append({
                'source': str(pdf_file),
                'error': str(e)
            })
    
    return results

自动检测表格方法 (Auto-detect Table Method)

import camelot

def smart_extract_tables(pdf_path, pages='1'):
    """尝试两种方法并返回最佳结果 (Try both methods and return best results)."""
    
    # 先尝试 lattice (Try lattice first)
    lattice_tables = camelot.read_pdf(pdf_path, pages=pages, flavor='lattice')
    
    # 尝试 stream (Try stream)
    stream_tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream')
    
    # 比较并返回最佳结果 (Compare and return best)
    results = []
    
    if lattice_tables and lattice_tables[0].accuracy > 70:
        results.extend(lattice_tables)
    elif stream_tables:
        results.extend(stream_tables)
    
    return results

示例 (Examples)

示例 1：财务报表提取 (Example 1: Financial Statement Extraction)

import camelot
import pandas as pd

def extract_financial_tables(pdf_path):
    """从年度报告中提取财务报表 (Extract financial tables from annual report)."""
    
    # 提取所有表格 (Extract all tables)
    tables = camelot.read_pdf(pdf_path, pages='all', flavor='lattice')
    
    financial_data = {
        'income_statement': None,
        'balance_sheet': None,
        'cash_flow': None,
        'other_tables': []
    }
    
    for table in tables:
        df = table.df
        text = df.to_string().lower()
        
        # 识别表格类型 (Identify table type)
        if 'revenue' in text or 'sales' in text:
            if 'operating income' in text or 'net income' in text:
                financial_data['income_statement'] = df
        elif 'asset' in text and 'liabilities' in text:
            financial_data['balance_sheet'] = df
        elif 'cash flow' in text or 'operating activities' in text:
            financial_data['cash_flow'] = df
        else:
            financial_data['other_tables'].append({
                'page': table.page,
                'data': df,
                'accuracy': table.accuracy
            })
    
    return financial_data

financials = extract_financial_tables('annual_report.pdf')
if financials['income_statement'] is not None:
    print("找到利润表 (Income Statement found):")
    print(financials['income_statement'])

示例 2：科研数据提取 (Example 2: Scientific Data Extraction)

import camelot
import pandas as pd

def extract_research_data(pdf_path, pages='all'):
    """从研究论文中提取数据表 (Extract data tables from research paper)."""
    
    # 为有边框表格尝试 lattice (Try lattice for bordered tables)
    tables = camelot.read_pdf(pdf_path, pages=pages, flavor='lattice')
    
    if not tables or all(t.accuracy < 70 for t in tables):
        # 回退到 stream 方法处理无边框表格 (Fall back to stream for borderless)
        tables = camelot.read_pdf(pdf_path, pages=pages, flavor='stream')
    
    extracted_data = []
    
    for table in tables:
        df = table.df
        
        # 清理 DataFrame (Clean up the DataFrame)
        # 如果第一行看起来像标题，则将其设置为标题 (Set first row as header if it looks like one)
        if not df.iloc[0].str.contains(r'\d').any():
            df.columns = df.iloc[0]
            df = df[1:]
            df = df.reset_index(drop=True)
        
        extracted_data.append({
            'page': table.page,
            'accuracy': table.accuracy,
            'data': df
        })
    
    return extracted_data

data = extract_research_data('research_paper.pdf')
for i, item in enumerate(data):
    print(f"表格 {i+1} (第 {item['page']} 页, 准确度: {item['accuracy']}%):")
    print(item['data'].head())

示例 3：发票行项目 (Example 3: Invoice Line Items)

import camelot

def extract_invoice_items(pdf_path):
    """从发票中提取行项目 (Extract line items from invoice)."""
    
    # 通常发票有带边框的表格 (Usually invoices have bordered tables)
    tables = camelot.read_pdf(pdf_path, flavor='lattice')
    
    line_items = []
    
    for table in tables:
        df = table.df
        
        # 查找具有典型发票列的表格 (Look for table with typical invoice columns)
        header_text = ' '.join(df.iloc[0].astype(str)).lower()
        
        if any(term in header_text for term in ['quantity', 'qty', 'amount', 'price', 'description']):
            # 这看起来像一个行项目表 (This looks like a line items table)
            df.columns = df.iloc[0]
            df = df[1:]
            
            for _, row in df.iterrows():
                item = {}
                for col in df.columns:
                    col_lower = str(col).lower()
                    value = row[col]
                    
                    if 'desc' in col_lower or 'item' in col_lower:
                        item['description'] = value
                    elif 'qty' in col_lower or 'quantity' in col_lower:
                        item['quantity'] = value
                    elif 'price' in col_lower or 'rate' in col_lower:
                        item['unit_price'] = value
                    elif 'amount' in col_lower or 'total' in col_lower:
                        item['amount'] = value
                
                if item:
                    line_items.append(item)
    
    return line_items

items = extract_invoice_items('invoice.pdf')
for item in items:
    print(item)

示例 4：表格比较 (Example 4: Table Comparison)

import camelot
import pandas as pd

def compare_pdf_tables(pdf1_path, pdf2_path):
    """比较两个 PDF 版本之间的表格 (Compare tables between two PDF versions)."""
    
    tables1 = camelot.read_pdf(pdf1_path)
    tables2 = camelot.read_pdf(pdf2_path)
    
    comparisons = []
    
    # 按形状和位置匹配表格 (Match tables by shape and position)
    for t1 in tables1:
        best_match = None
        best_score = 0
        
        for t2 in tables2:
            if t1.df.shape == t2.df.shape:
                # 计算相似度 (Calculate similarity)
                try:
                    similarity = (t1.df == t2.df).mean().mean()
                    if similarity > best_score:
                        best_score = similarity
                        best_match = t2
                except:
                    pass
        
        if best_match:
            comparisons.append({
                'page1': t1.page,
                'page2': best_match.page,
                'similarity': best_score,
                'identical': best_score == 1.0,
                'diff': pd.DataFrame(t1.df != best_match.df)
            })
    
    return comparisons

comparison = compare_pdf_tables('report_v1.pdf', 'report_v2.pdf')

局限性 (Limitations)

不支持加密的 PDF (Encrypted PDFs not supported)
基于图像的 PDF 需要 OCR 预处理 (Image-based PDFs need OCR preprocessing)
非常复杂的合并单元格可能需要调整 (Very complex merged cells may need tuning)
旋转的表格需要预处理 (Rotated tables require preprocessing)
大型 PDF 可能需要逐页处理 (Large PDFs may need page-by-page processing)

安装 (Installation)

pip install camelot-py[cv]

# 额外依赖 (Additional dependencies)
# macOS
brew install ghostscript tcl-tk

# Ubuntu
apt-get install ghostscript python3-tk

资源 (Resources)

📄 原始文档

完整文档（英文）：

https://skills.sh/claude-office-skills/skills/table-extractor

💡 提示：点击上方链接查看 skills.sh 原始英文文档，方便对照翻译。

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

Table Extractor Skill

🚀 快速安装

表格提取器技能 (Table Extractor Skill)

概述 (Overview)

使用方法 (How to Use)

领域知识 (Domain Knowledge)

camelot 基础知识 (camelot Fundamentals)

提取方法 (Extraction Methods)

页面选择 (Page Selection)

高级选项 (Advanced Options)

Lattice 选项 (Lattice Options)

Stream 选项 (Stream Options)

表格区域指定 (Table Area Specification)

列指定 (Column Specification)

处理结果 (Working with Results)

导出选项 (Export Options)

可视化调试 (Visual Debugging)

处理跨页表格 (Handling Multi-page Tables)

最佳实践 (Best Practices)

常见模式 (Common Patterns)

批量表格提取 (Batch Table Extraction)

自动检测表格方法 (Auto-detect Table Method)

示例 (Examples)

示例 1：财务报表提取 (Example 1: Financial Statement Extraction)

示例 2：科研数据提取 (Example 2: Scientific Data Extraction)

示例 3：发票行项目 (Example 3: Invoice Line Items)

示例 4：表格比较 (Example 4: Table Comparison)

局限性 (Limitations)

安装 (Installation)

资源 (Resources)

📄 原始文档

评论(0)

提示：请文明发言取消回复

Table Extractor Skill

🚀 快速安装

表格提取器技能 (Table Extractor Skill)

概述 (Overview)

使用方法 (How to Use)

领域知识 (Domain Knowledge)

camelot 基础知识 (camelot Fundamentals)

提取方法 (Extraction Methods)

页面选择 (Page Selection)

高级选项 (Advanced Options)

Lattice 选项 (Lattice Options)

Stream 选项 (Stream Options)

表格区域指定 (Table Area Specification)

列指定 (Column Specification)

处理结果 (Working with Results)

导出选项 (Export Options)

可视化调试 (Visual Debugging)

处理跨页表格 (Handling Multi-page Tables)

最佳实践 (Best Practices)

常见模式 (Common Patterns)

批量表格提取 (Batch Table Extraction)

自动检测表格方法 (Auto-detect Table Method)

示例 (Examples)

示例 1：财务报表提取 (Example 1: Financial Statement Extraction)

示例 2：科研数据提取 (Example 2: Scientific Data Extraction)

示例 3：发票行项目 (Example 3: Invoice Line Items)

示例 4：表格比较 (Example 4: Table Comparison)

局限性 (Limitations)

安装 (Installation)

资源 (Resources)

📄 原始文档

评论(0)

提示：请文明发言 取消回复

相关文章

geo-fundamentals – 汇易宝

mapkit-location – 汇易宝

query-address-info – 汇易宝

sardis-identity – 汇易宝

提示：请文明发言取消回复