Before running any Python script, you can verify if a PDF contains real Khmer text (not just images) using this simple script:
import pypdfdef verify_khmer_pdf(pdf_path): reader = pypdf.PdfReader(pdf_path) sample_text = "" for page in reader.pages[:2]: # Check first 2 pages sample_text += page.extract_text()
# Khmer Unicode range: \u1780 to \u17FF khmer_chars = [c for c in sample_text if '\u1780' <= c <= '\u17FF'] if len(khmer_chars) > 10: print(f"✅ Verified: Found len(khmer_chars) Khmer characters.") return True else: print("❌ Not verified: PDF may be scanned image or missing font.") return False
verify_khmer_pdf("my_document.pdf")
Before deploying any script, ensure:
| Criterion | Verification Method |
|-----------|---------------------|
| Extractable text | pypdf.PdfReader().pages[0].extract_text() returns readable Khmer |
| Correct subscripts | Word "ព្រះ" shows as consonant + subscript ro + vowel. |
| Copy-paste from Adobe | Paste into Notepad – order preserved. |
| Searchable (Ctrl+F) | Find "សាលា" highlights correctly. |
| No missing characters | All 32+ Khmer consonants visible. |
$ khmer-pdf-verify check --input suspect.pdf --hash hash.txt Output: ✅ Document is VERIFIED (Hash matches)
class KhmerPDFValidator:
def __init__(self, pdf_path, use_ocr=False):
self.pdf_path = pdf_path
self.use_ocr = use_ocr
self.raw_text = ""
self.verified_text = ""
def extract(self):
if self.use_ocr:
self.raw_text = ocr_khmer_pdf(self.pdf_path)
else:
self.raw_text = extract_khmer_from_pdf(self.pdf_path)
return self
def verify(self):
validation = validate_khmer_text(self.raw_text)
if validation['has_isolated_diacritics']:
# Attempt repair: normalize and filter
self.verified_text = validation['normalized_text']
else:
self.verified_text = self.raw_text
return self
def segment(self):
return segment_khmer_words(self.verified_text)
def report(self):
return
'original_length': len(self.raw_text),
'verified_length': len(self.verified_text),
'valid_khmer_ratio': len([c for c in self.verified_text if '\u1780' <= c <= '\u17FF']) / len(self.verified_text) if self.verified_text else 0
Example (using reportlab + reportlab.pdfbase.ttfonts):
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfgen import canvas
font_path = "NotoSansKhmer-Regular.ttf"
pdfmetrics.registerFont(TTFont("NotoKhmer", font_path))
c = canvas.Canvas("khmer_sample.pdf")
c.setFont("NotoKhmer", 14)
c.drawString(72, 750, "សួស្តី ពិភពលោក") # "Hello world" in Khmer
c.save()
Alternative: fpdf2 supports TTF embedding similarly.
Verification status: ✅ Verified (requires Khmer trained data)
If your PDF is a scanned image of Khmer text, you need OCR. The verified combination is pdf2image + pytesseract with the Khmer language pack. python khmer pdf verified
Installation:
sudo apt-get install tesseract-ocr-khm
pip install pdf2image pytesseract
Verified code:
from pdf2image import convert_from_path
import pytesseract
pages = convert_from_path('scanned_khmer_document.pdf', 300)
for i, page in enumerate(pages):
# Use 'khm' for Khmer language verification
text = pytesseract.image_to_string(page, lang='khm')
print(f"Page i+1 verified text:\ntext")
Working with Khmer script in PDF files using Python presents unique challenges due to complex Unicode shaping and font rendering. Whether you are building an automated verification system or an OCR pipeline, 1. The Core Challenge: Khmer Script in PDFs
Khmer is a "complex" script. Unlike Latin characters, Khmer involves vowel signs and subscripts that must be "shaped" (repositioned and reordered) by a rendering engine to look correct. Standard PDF libraries often fail to read or write these characters properly because they treat them as individual, static glyphs rather than a cohesive linguistic unit. 2. Best Tools for Extracting Khmer Text
To verify the content of a Khmer PDF, you first need to reliably extract it. Depending on whether the PDF is "searchable" (digital) or "scanned" (images), you have two main paths: For Searchable Digital PDFs
fpdf2: This library is highly recommended for Khmer because it supports a shaping engine (Harfbuzz). To ensure subscripts and vowels are handled correctly, you must explicitly set the script and language:
pdf.set_text_shaping(use_shaping_engine=True, script="khmr", language="khm") ``` Use code with caution. Copied to clipboard Before running any Python script, you can verify
PyMuPDF: Widely considered one of the fastest and most accurate open-source libraries for text extraction. It preserves document structure better than many alternatives. For Scanned PDFs (OCR)
If the PDF contains images of text, you must use Optical Character Recognition (OCR):
Tesseract OCR: You can install the Khmer-specific language pack (tesseract-ocr-khm) and use the pytesseract wrapper to extract text.
NextOCR: A specialized tool by Khmer font expert Danh Hong that offers high-accuracy extraction in just a few lines of code. 3. Verifying Document Integrity
"Verification" typically refers to two things: ensuring the file is a valid PDF and checking digital signatures. Checking File Validity
Before processing, verify that the file is not corrupted or merely a renamed extension. You can use the file command via subprocess to check the MIME type:
from subprocess import Popen, PIPE filetype = Popen("/usr/bin/file -b --mime -", shell=True, stdout=PIPE, stdin=PIPE).communicate(open("file.pdf", "rb").read(1024))[0] ``` #### Verifying Digital Signatures To verify that a signed Khmer document hasn't been altered: * **[pyHanko](https://pyhanko.readthedocs.io/en/latest/cli-guide/validation.html)**: A robust library for validating PDF signatures. It can provide a "pretty-print" status report of a signature's validity. * **[pypdf](https://github.com/py-pdf/pypdf/discussions/2678)**: Useful for quickly detecting if a PDF has been digitally signed at all by checking the `/Root` and `/AcroForm` flags. ### 4. Advanced NLP Verification If your goal is to verify the *linguistic* correctness of extracted Khmer text (e.g., checking for typos or proper word breaks), you should integrate: * **[khmer-nltk](https://medium.com/data-science/khmer-natural-language-processing-in-python-c770afb84784)**: Excellent for word segmentation and part-of-speech tagging. * **[PyKhmerNLP](https://pypi.org/project/pykhmernlp/)**: Provides modules for dictionary lookups and address processing to help validate the actual data you've extracted. Would you like a **specific code example** for extracting Khmer text from a scanned PDF using Tesseract? Use code with caution. Copied to clipboard
To produce a verified PDF with Khmer text using Python, you should use libraries that support Unicode and TrueType Fonts (TTF), as standard PDF generators often fail to render Khmer script correctly without specific font embedding. Recommended Approach
Library: Use ReportLab or FPDF2. These are industry standards for document generation.
Font: You must use a Khmer-compatible font (e.g., Khmer OS, Hanuman, or Kantumruy). verify_khmer_pdf("my_document
Verification: Digital signing for "verified" status can be handled by libraries like pyHanko or Endesive. Sample Code (FPDF2)
This script demonstrates how to embed a Khmer font to ensure the text renders correctly:
Create PDFs with Pure Python - Perfect For Document Automation
This is an excellent topic, as it sits at the intersection of Southeast Asian NLP (low-resource languages), digital document forensics, and Python automation.
Below is a structured, ready-to-use template for a research paper or technical report. You can fill in the specific data based on your implementation.
Verification status: ✅ Verified (preserves Khmer text layer)
pypdf (formerly PyPDF2) is excellent for merging, splitting, and rotating PDFs without breaking the Khmer text layer.
Verified merging example:
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for khmer_pdf in ["cover.pdf", "content_khmer.pdf", "back.pdf"]:
reader = PdfReader(khmer_pdf)
for page in reader.pages:
writer.add_page(page)
with open("merged_verified_khmer.pdf", "wb") as out_file:
writer.write(out_file)