由于 PDF 文档的复杂性，从 PDF 文件中提取表格数据可能是一项具有挑战性的任务。与简单的文本提取不同，表格需要小心处理，以保留表格结构以及行和列之间的关系。您无需从大量 PDF 表中手动提取数据，而是可以通过编程方式简化和自动化此过程。在本文中，我们将演示如何使

用于将 PDF 表格提取为文本、Excel 和 CSV 的 Python 库

要将 PDF 表中的数据提取为文本、excel 和 CSV 文件，我们可以使用 Spire.PDF for Python 和 Spire.XLS for Python 库。Spire.PDF for Python 主要用于从 PDF 中提取表格数据，Spire.XLS for Python 主要用于将提取的表格数据保存为 Excel 和 CSV 文件。

您可以在项目的终端中运行以下 pip 命令来安装 Spire.PDF for Python 和 Spire.XLS for Python：

pip install Spire.Pdf
pip install Spire.Xls

如果您已经安装了 Spire.PDF for Python 和 Spire.XLS for Python，并且想要升级到最新版本，请使用以下 pip 命令：

pip install --upgrade Spire.Pdf
pip install --upgrade Spire.Xls

在 Python 中将 PDF 表格提取为文本

Spire.PDF for Python 提供的
PdfTableExtractor.ExtractTable（pageIndex： int） 函数允许您访问 PDF 中的表。访问后，您可以使用 PdfTable.GetText（rowIndex： int， columnIndex： int） 函数轻松地从表中检索数据。然后，您可以将检索到的数据保存到文本文件中以供以后使用。

以下示例显示了如何使用 Python 和 Spire.PDF for Python 从 PDF 文件中提取表数据并将结果保存到文本文件中：

from spire.pdf import *
from spire.xls import *

# Define an extract_table_data function to extract table data from PDF
def extract_table_data(pdf_path):
    # Create an instance of the PdfDocument class
    doc = PdfDocument()
    
    try:
        # Load a PDF document
        doc.LoadFromFile(pdf_path)
        # Create a list to store the extracted table data
        table_data = []

        # Create an instance of the PdfTableExtractor class
        extractor = PdfTableExtractor(doc)

        # Iterate through the pages in the PDF document
        for page_index in range(doc.Pages.Count):
            # Get tables within each page
            tables = extractor.ExtractTable(page_index)
            if tables is not None and len(tables) > 0:

                # Iterate through the tables
                for table_index, table in enumerate(tables):
                    row_count = table.GetRowCount()
                    col_count = table.GetColumnCount()

                    table_data.append(f"Table {table_index + 1} of Page {page_index + 1}:\n")

                    # Extract data from each table and append the data to the table_data list
                    for row_index in range(row_count):
                        row_data = []
                        for column_index in range(col_count):
                            data = table.GetText(row_index, column_index)
                            row_data.append(data.strip())
                        table_data.append("  ".join(row_data))

                    table_data.append("\n")

        return table_data

    except Exception as e:
        print(f"Error occurred: {str(e)}")
        return None

# Define a save_table_data_to_text function to save the table data extracted from a PDF to a text file
def save_table_data_to_text(table_data, output_path):
    try:
        with open(output_path, "w", encoding="utf-8") as file:
            file.write("\n".join(table_data))
        print(f"Table data saved to '{output_path}' successfully.")
    except Exception as e:
        print(f"Error occurred while saving table data: {str(e)}")

# Example usage
pdf_path = "Tables.pdf"
output_path = "table_data.txt"

data = extract_table_data(pdf_path)
if data:
    save_table_data_to_text(data, output_path)

使用 Python 从 PDF 中提取表格

在 Python 中将 PDF 表格提取到 Excel

当您需要对表格数据执行进一步的分析、计算或可视化时，将 PDF 表格提取到 Excel 非常有用。通过将 Spire.PDF for Python 与 Spire.XLS for Python 结合使用，您可以轻松地将数据从 PDF 表格导出到 Excel 工作表。

以下示例显示了如何使用 Spire.PDF for Python 和 Spire.XLS for Python 将数据从 PDF 表导出到 Python 中的 Excel 工作表：

from spire.pdf import *
from spire.xls import *

# Define a function to extract data from PDF tables to Excel
def extract_table_data_to_excel(pdf_path, xls_path):
    # Create an instance of the PdfDocument class
    doc = PdfDocument()

    try:
        # Load a PDF document
        doc.LoadFromFile(pdf_path)

        # Create an instance of the PdfTableExtractor class
        extractor = PdfTableExtractor(doc)

        # Create an instance of the Workbook class
        workbook = Workbook()
        # Remove the default 3 worksheets
        workbook.Worksheets.Clear()
        
        # Iterate through the pages in the PDF document
        for page_index in range(doc.Pages.Count):
            # Extract tables from each page
            tables = extractor.ExtractTable(page_index)
            if tables is not None and len(tables) > 0:
                # Iterate through the extracted tables
                for table_index, table in enumerate(tables):
                    # Create a new worksheet for each table
                    worksheet = workbook.CreateEmptySheet()  
                    # Set the worksheet name
                    worksheet.Name = f"Table {table_index + 1} of Page {page_index + 1}"  
                    
                    row_count = table.GetRowCount()
                    col_count = table.GetColumnCount()

                    # Extract data from the table and populate the worksheet
                    for row_index in range(row_count):
                        for column_index in range(col_count):
                            data = table.GetText(row_index, column_index)
                            worksheet.Range[row_index + 1, column_index + 1].Value = data.strip()
                    
                    # Auto adjust column widths of the worksheet
                    worksheet.Range.AutoFitColumns()

        # Save the workbook to the specified Excel file
        workbook.SaveToFile(xls_path, ExcelVersion.Version2013)

    except Exception as e:
        print(f"Error occurred: {str(e)}")

# Example usage
pdf_path = "Tables.pdf"
xls_path = "table_data.xlsx"
extract_table_data_to_excel(pdf_path, xls_path)

使用 Python 将 PDF 表格提取到 Excel

在 Python 中将 PDF 表提取为 CSV

CSV 是一种通用格式，可以通过电子表格软件、数据库、编程语言和数据分析工具打开和处理。将 PDF 表格提取为 CSV 格式使数据易于访问并与各种应用程序和工具兼容。

以下示例显示了如何使用 Spire.PDF for Python 和 Spire.XLS for Python 将数据从 PDF 表导出到 Python 中的 CSV 文件：

from spire.pdf import *
from spire.xls import *

# Define a function to extract data from PDF tables to CSV
def extract_table_data_to_csv(pdf_path, csv_directory):
    # Create an instance of the PdfDocument class
    doc = PdfDocument()

    try:
        # Load a PDF document
        doc.LoadFromFile(pdf_path)

        # Create an instance of the PdfTableExtractor class
        extractor = PdfTableExtractor(doc)

        # Create an instance of the Workbook class
        workbook = Workbook()
        # Remove the default 3 worksheets
        workbook.Worksheets.Clear()
        
        # Iterate through the pages in the PDF document
        for page_index in range(doc.Pages.Count):
            # Extract tables from each page
            tables = extractor.ExtractTable(page_index)
            if tables is not None and len(tables) > 0:
                # Iterate through the extracted tables
                for table_index, table in enumerate(tables):
                    # Create a new worksheet for each table
                    worksheet = workbook.CreateEmptySheet()  
 
                    row_count = table.GetRowCount()
                    col_count = table.GetColumnCount()

                    # Extract data from the table and populate the worksheet
                    for row_index in range(row_count):
                        for column_index in range(col_count):
                            data = table.GetText(row_index, column_index)
                            worksheet.Range[row_index + 1, column_index + 1].Value = data.strip()
                    
                    csv_name = csv_directory + f"Table {table_index + 1} of Page {page_index + 1}" + ".csv"

                    # Save each worksheet to a separate CSV file
                    worksheet.SaveToFile(csv_name, ",", Encoding.get_UTF8())

    except Exception as e:
        print(f"Error occurred: {str(e)}")

# Example usage
pdf_path = "Tables.pdf"
csv_directory = "CSV/"
extract_table_data_to_csv(pdf_path, csv_directory)

使用 Python 将 PDF 表格提取为 CSV

在 Python 中将 PDF 表格提取为文本、Excel 和 CSV

用于将 PDF 表格提取为文本、Excel 和 CSV 的 Python 库

在 Python 中将 PDF 表格提取为文本

在 Python 中将 PDF 表格提取到 Excel

在 Python 中将 PDF 表提取为 CSV

相关推荐

Python实现人事自动打卡，再也不会被批评

【验证码逆向专栏】vaptcha 手势验证码逆向分析

Psutil + Flask + Pyecharts + Bootstrap 开发动态可视化系统监控

一个解决支持HTML/CSS/JS网页转PDF(高质量)的终极解决方案

再见Swagger UI 国人开源了一款超好用的 API 文档生成框架，真香

网页转成pdf文件的经验分享网页转成pdf文件的经验分享怎么弄

C++ std::vector 简介

飞牛OS入门安装遇到问题，如何解决?

系统C盘清理:微信PC端文件清理，扩大C盘可用空间步骤

10款高性能NAS丨双十一必看，轻松搞定虚拟机、Docker、软路由

在 Python 中将 PDF 表格提取为文本、Excel 和 CSV

用于将 PDF 表格提取为文本、Excel 和 CSV 的 Python 库

在 Python 中将 PDF 表格提取为文本

在 Python 中将 PDF 表格提取到 Excel

在 Python 中将 PDF 表提取为 CSV

相关推荐

Python实现人事自动打卡，再也不会被批评

【验证码逆向专栏】vaptcha 手势验证码逆向分析

Psutil + Flask + Pyecharts + Bootstrap 开发动态可视化系统监控

一个解决支持HTML/CSS/JS网页转PDF(高质量)的终极解决方案

再见Swagger UI 国人开源了一款超好用的 API 文档生成框架，真香

网页转成pdf文件的经验分享 网页转成pdf文件的经验分享怎么弄

C++ std::vector 简介

飞牛OS入门安装遇到问题，如何解决?

系统C盘清理:微信PC端文件清理，扩大C盘可用空间步骤

10款高性能NAS丨双十一必看，轻松搞定虚拟机、Docker、软路由

网页转成pdf文件的经验分享网页转成pdf文件的经验分享怎么弄