由于 PDF 文档的复杂性,从 PDF 文件中提取表格数据可能是一项具有挑战性的任务。与简单的文本提取不同,表格需要小心处理,以保留表格结构以及行和列之间的关系。您无需从大量 PDF 表中手动提取数据,而是可以通过编程方式简化和自动化此过程。在本文中,我们将演示如何使
用于将 PDF 表格提取为文本、Excel 和 CSV 的 Python 库
要将 PDF 表中的数据提取为文本、excel 和 CSV 文件,我们可以使用 Spire.PDF for Python 和 Spire.XLS for Python 库。Spire.PDF for Python 主要用于从 PDF 中提取表格数据,Spire.XLS for Python 主要用于将提取的表格数据保存为 Excel 和 CSV 文件。
您可以在项目的终端中运行以下 pip 命令来安装 Spire.PDF for Python 和 Spire.XLS for Python:
pip install Spire.Pdf
pip install Spire.Xls
如果您已经安装了 Spire.PDF for Python 和 Spire.XLS for Python,并且想要升级到最新版本,请使用以下 pip 命令:
pip install --upgrade Spire.Pdf
pip install --upgrade Spire.Xls
在 Python 中将 PDF 表格提取为文本
Spire.PDF for Python 提供的
PdfTableExtractor.ExtractTable(pageIndex: int) 函数允许您访问 PDF 中的表。访问后,您可以使用 PdfTable.GetText(rowIndex: int, columnIndex: int) 函数轻松地从表中检索数据。然后,您可以将检索到的数据保存到文本文件中以供以后使用。
以下示例显示了如何使用 Python 和 Spire.PDF for Python 从 PDF 文件中提取表数据并将结果保存到文本文件中:
from spire.pdf import *
from spire.xls import *
# Define an extract_table_data function to extract table data from PDF
def extract_table_data(pdf_path):
# Create an instance of the PdfDocument class
doc = PdfDocument()
try:
# Load a PDF document
doc.LoadFromFile(pdf_path)
# Create a list to store the extracted table data
table_data = []
# Create an instance of the PdfTableExtractor class
extractor = PdfTableExtractor(doc)
# Iterate through the pages in the PDF document
for page_index in range(doc.Pages.Count):
# Get tables within each page
tables = extractor.ExtractTable(page_index)
if tables is not None and len(tables) > 0:
# Iterate through the tables
for table_index, table in enumerate(tables):
row_count = table.GetRowCount()
col_count = table.GetColumnCount()
table_data.append(f"Table {table_index + 1} of Page {page_index + 1}:\n")
# Extract data from each table and append the data to the table_data list
for row_index in range(row_count):
row_data = []
for column_index in range(col_count):
data = table.GetText(row_index, column_index)
row_data.append(data.strip())
table_data.append(" ".join(row_data))
table_data.append("\n")
return table_data
except Exception as e:
print(f"Error occurred: {str(e)}")
return None
# Define a save_table_data_to_text function to save the table data extracted from a PDF to a text file
def save_table_data_to_text(table_data, output_path):
try:
with open(output_path, "w", encoding="utf-8") as file:
file.write("\n".join(table_data))
print(f"Table data saved to '{output_path}' successfully.")
except Exception as e:
print(f"Error occurred while saving table data: {str(e)}")
# Example usage
pdf_path = "Tables.pdf"
output_path = "table_data.txt"
data = extract_table_data(pdf_path)
if data:
save_table_data_to_text(data, output_path)
使用 Python 从 PDF 中提取表格
在 Python 中将 PDF 表格提取到 Excel
当您需要对表格数据执行进一步的分析、计算或可视化时,将 PDF 表格提取到 Excel 非常有用。通过将 Spire.PDF for Python 与 Spire.XLS for Python 结合使用,您可以轻松地将数据从 PDF 表格导出到 Excel 工作表。
以下示例显示了如何使用 Spire.PDF for Python 和 Spire.XLS for Python 将数据从 PDF 表导出到 Python 中的 Excel 工作表:
from spire.pdf import *
from spire.xls import *
# Define a function to extract data from PDF tables to Excel
def extract_table_data_to_excel(pdf_path, xls_path):
# Create an instance of the PdfDocument class
doc = PdfDocument()
try:
# Load a PDF document
doc.LoadFromFile(pdf_path)
# Create an instance of the PdfTableExtractor class
extractor = PdfTableExtractor(doc)
# Create an instance of the Workbook class
workbook = Workbook()
# Remove the default 3 worksheets
workbook.Worksheets.Clear()
# Iterate through the pages in the PDF document
for page_index in range(doc.Pages.Count):
# Extract tables from each page
tables = extractor.ExtractTable(page_index)
if tables is not None and len(tables) > 0:
# Iterate through the extracted tables
for table_index, table in enumerate(tables):
# Create a new worksheet for each table
worksheet = workbook.CreateEmptySheet()
# Set the worksheet name
worksheet.Name = f"Table {table_index + 1} of Page {page_index + 1}"
row_count = table.GetRowCount()
col_count = table.GetColumnCount()
# Extract data from the table and populate the worksheet
for row_index in range(row_count):
for column_index in range(col_count):
data = table.GetText(row_index, column_index)
worksheet.Range[row_index + 1, column_index + 1].Value = data.strip()
# Auto adjust column widths of the worksheet
worksheet.Range.AutoFitColumns()
# Save the workbook to the specified Excel file
workbook.SaveToFile(xls_path, ExcelVersion.Version2013)
except Exception as e:
print(f"Error occurred: {str(e)}")
# Example usage
pdf_path = "Tables.pdf"
xls_path = "table_data.xlsx"
extract_table_data_to_excel(pdf_path, xls_path)
使用 Python 将 PDF 表格提取到 Excel
在 Python 中将 PDF 表提取为 CSV
CSV 是一种通用格式,可以通过电子表格软件、数据库、编程语言和数据分析工具打开和处理。将 PDF 表格提取为 CSV 格式使数据易于访问并与各种应用程序和工具兼容。
以下示例显示了如何使用 Spire.PDF for Python 和 Spire.XLS for Python 将数据从 PDF 表导出到 Python 中的 CSV 文件:
from spire.pdf import *
from spire.xls import *
# Define a function to extract data from PDF tables to CSV
def extract_table_data_to_csv(pdf_path, csv_directory):
# Create an instance of the PdfDocument class
doc = PdfDocument()
try:
# Load a PDF document
doc.LoadFromFile(pdf_path)
# Create an instance of the PdfTableExtractor class
extractor = PdfTableExtractor(doc)
# Create an instance of the Workbook class
workbook = Workbook()
# Remove the default 3 worksheets
workbook.Worksheets.Clear()
# Iterate through the pages in the PDF document
for page_index in range(doc.Pages.Count):
# Extract tables from each page
tables = extractor.ExtractTable(page_index)
if tables is not None and len(tables) > 0:
# Iterate through the extracted tables
for table_index, table in enumerate(tables):
# Create a new worksheet for each table
worksheet = workbook.CreateEmptySheet()
row_count = table.GetRowCount()
col_count = table.GetColumnCount()
# Extract data from the table and populate the worksheet
for row_index in range(row_count):
for column_index in range(col_count):
data = table.GetText(row_index, column_index)
worksheet.Range[row_index + 1, column_index + 1].Value = data.strip()
csv_name = csv_directory + f"Table {table_index + 1} of Page {page_index + 1}" + ".csv"
# Save each worksheet to a separate CSV file
worksheet.SaveToFile(csv_name, ",", Encoding.get_UTF8())
except Exception as e:
print(f"Error occurred: {str(e)}")
# Example usage
pdf_path = "Tables.pdf"
csv_directory = "CSV/"
extract_table_data_to_csv(pdf_path, csv_directory)
使用 Python 将 PDF 表格提取为 CSV