OCR with Phi-3-Vision: Revolutionizing Document Processing

Discover how Phi-3-Vision-128K-Instruct transforms document extraction, OCR, and image understanding with AI-driven precision.

7 min readOct 13, 2024

In today’s advancing artificial intelligence (AI) world, multimodal models establish a new standard for processing visual and textual data together. Among these cutting-edge models, one of the most noteworthy breakthroughs is Phi-3-Vision-128K-Instruct. This state-of-the-art model is reshaping AI’s capabilities in document processing, especially in Optical Character Recognition (OCR) and overall image understanding.

As AI-powered document extraction becomes increasingly crucial for businesses, research, and automation, let’s delve into how Phi-3-Vision-128K-Instruct is set to transform the field.

What Is Phi-3-Vision-128K-Instruct?

Phi-3-Vision-128K-Instruct is part of the Phi-3 model family, which is known for its strong ability to process multimodal data. This model can handle up to 128,000 tokens and is excellent at understanding both visual and textual data at the same time.

With an architecture containing 4.2 billion parameters, the model includes an image encoder, projector, and the Phi-3 Mini language model, making it lightweight yet extremely powerful. It was trained on 500 billion tokens from high-quality synthetic data and carefully curated real-world datasets. The model has been designed to provide accurate and reliable solutions for various document processing needs through supervised fine-tuning and optimization.

Why OCR and Document Extraction Matter

Document extraction is an important process for businesses and organizations that need to convert physical or scanned documents into machine-readable formats. Whether it’s invoice processing, PDF parsing, or legal document analysis, OCR plays a vital role in digitizing information, saving time, and eliminating manual data entry.

This is where models like Phi-3-Vision-128K come into play. With its ability to handle complex layouts such as tables, charts, and diagrams, it can drastically improve productivity in document-intensive industries.

Example Use Cases:

Digitizing Physical Documents — Converting physical records, such as scanned contracts or forms, into digital formats through automation.
Data Extraction from PDFs — Extracting structured data from semi-structured PDFs such as financial statements, legal forms, or research papers.
AI-Powered Analysis — Leveraging the model to not only extract text but also to perform intelligent data analysis on the information extracted from these documents.

Key Features of Phi-3-Vision-128K

Here’s what makes the Phi-3-Vision-128K-Instruct model stand out:

Extended Context Length: With a maximum context of 128,000 tokens, this model can process and understand large documents, including entire PDFs, in a single go.
Multimodal Understanding: The model seamlessly integrates both image and text data, making it well-suited for tasks that involve images, graphs, tables, and charts.
Optimized for Performance: It’s designed to run efficiently in memory-constrained environments, making it practical for real-world applications, even where computational resources are limited.

Setting Up Phi-3-Vision-128K-Instruct

If you’re interested in using this powerful model, here’s how you can set up your environment:

# Required Packages
torch
torchvision
torchaudio
flash_attn
numpy
Pillow
Requests
transformers

Once you’ve installed the required packages, you can initialize the model using Python. Here’s a simple code snippet to load the model and run inferences:

from PIL import Image
import requests
from transformers import AutoModelForCausalLM, AutoProcessor

class Phi3VisionModel:
    def __init__(self, model_id="microsoft/Phi-3-vision-128k-instruct", device="cuda"):
        """
        Initialize the Phi3VisionModel with the specified model ID and device.
        
        Args:
            model_id (str): The identifier of the pre-trained model from Hugging Face's model hub.
            device (str): The device to load the model on ("cuda" for GPU or "cpu").
        """
        self.model_id = model_id
        self.device = device
        self.model = self.load_model()  # Load the model during initialization
        self.processor = self.load_processor()  # Load the processor during initialization
    
    def load_model(self):
        """
        Load the pre-trained language model with causal language modeling capabilities.
        
        Returns:
            model (AutoModelForCausalLM): The loaded model.
        """
        print("Loading model...")
        # Load the model with automatic device mapping and data type adjustment
        return AutoModelForCausalLM.from_pretrained(
            self.model_id, 
            device_map="auto",  # Automatically map model to the appropriate device(s)
            torch_dtype="auto",  # Use an appropriate torch data type based on the device
            trust_remote_code=True,  # Allow execution of custom code for loading the model
            _attn_implementation='flash_attention_2'  # Use optimized attention implementation
        ).to(self.device)  # Move the model to the specified device
    
    def load_processor(self):
        """
        Load the processor associated with the model for processing inputs and outputs.
        
        Returns:
            processor (AutoProcessor): The loaded processor for handling text and images.
        """
        print("Loading processor...")
        # Load the processor with trust_remote_code=True to handle any custom processing logic
        return AutoProcessor.from_pretrained(self.model_id, trust_remote_code=True)
    
    def predict(self, image_url, prompt):
        """
        Perform a prediction using the model given an image and a prompt.
        
        Args:
            image_url (str): The URL of the image to be processed.
            prompt (str): The textual prompt that guides the model's generation.
        
        Returns:
            response (str): The generated response from the model.
        """
        # Load the image from the provided URL
        image = Image.open(requests.get(image_url, stream=True).raw)
        
        # Format the input prompt template for the model
        prompt_template = f"<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n"
        
        # Process the inputs, converting the prompt and image into tensor format
        inputs = self.processor(prompt_template, [image], return_tensors="pt").to(self.device)
        
        # Set generation arguments for the model's response generation
        generation_args = {
            "max_new_tokens": 500,  # Maximum number of tokens to generate
            "temperature": 0.7,     # Sampling temperature for diversity in generation
            "do_sample": False      # Disable sampling for deterministic output
        }
        print("Generating response...")
        # Generate the output IDs using the model, skipping the input tokens
        output_ids = self.model.generate(**inputs, **generation_args)
        output_ids = output_ids[:, inputs['input_ids'].shape[1]:]  # Ignore the input prompt in the output
        
        # Decode the generated output tokens to obtain the response text
        response = self.processor.batch_decode(output_ids, skip_special_tokens=True)[0]
        return response

Initialize the model

# Initialize the model
phi_model = Phi3VisionModel()

Example prediction

# Example prediction
image_url = "https://example.com/sample_image.png"  # URL of the sample image
prompt = "Extract the data in json format."  # Prompt for model guidance

Get the response from the model

response = phi_model.predict(image_url, prompt)  # Get the response from the model

print("Response:", response)  # Print the generated response

Performance on Real-World Tasks

In tests conducted with scanned ID cards, passports, and other semi-structured documents, Phi-3-Vision-128K-Instruct showcased impressive OCR capabilities. It was able to accurately extract not only the text but also structured data such as names, dates, and passport numbers.

Image 1: A fictional Utopian passport with detailed text, including personal information such as name, nationality, place of birth, date of issue, and expiration date. The text is slightly stylized, and there is a machine-readable zone at the bottom. The image quality is high, with no significant background noise.

Input:

Output:


{
  "Type/Type": "PP",
  "Issuing Country/Pays émetteur": "CAN",
  "Passport No./N° de passeport": "P123456AA",
  "Surname/Nom": "MARTIN",
  "Given names/Prénoms": "SARAH",
  "Nationality/Nationalité": "CANADIAN/CANADIENNE",
  "Date of birth/Date de naissance": "01 AUGUAUT 1990",
  "Place of birth/Lieu de naissance": "OTTAWA CAN",
  "Sex/Sexe": "F",
  "Date of issue/Date de délivrance": "14 JANJAN 2023",
  "Date of expiry/Date d'expiration": "14 JANJAN 2033",
  "Authority/Autorité": "GATINEAU",
}

Image 2: A Dutch passport with a clear image of the holder and neatly formatted text. Fields include the passport number, name, date of birth, nationality, and expiration date. The document is presented with high contrast, making text extraction relatively straightforward. The machine-readable zone (MRZ) at the bottom offers a structured data format that can help validate the accuracy of extracted information.

Input:

Output:

{
  "passport": {
    "issuingCountry": "Netherlands",
    "issuingAuthority": "Koninkrijk der Nederlanden",
    "passportNumber": "SPEC12014",
    "issuingDate": "09 MAR 2014",
    "expiryDate": "09 MAR 2024",
    "holder": {
      "gender": "F",
      "nationality": "Netherlands",
      "placeOfBirth": "SPECIMEN",
      "sex": "WF",
      "firstNames": [
        "Willem",
        "Lieselotte"
      ]
    },
    "physicalDescription": {
      "height": "1.75 m",
      "hairColor": "gray",
      "hairLength": "short"
    },
    "issuingOffice": "Burg. van Stad en Dorp",
    "issuingDateAsInt": "14032014",
    "expiryDateAsInt": "14032024",
    "fieldsExtracted": [
      {
        "code": "NL",
        "dateOfBirth": "10 MAR 1965",
        "dateOfIssue": "09 MAR 2014",
        "dateOfExpiry": "09 MAR 2024",
        "firstNames": [
          "Willem",
          "Lieselotte"
        ],
        "nationality": "Netherlands",
        "passportNumber": "SPEC12014",
        "placeOfBirth": "SPECIMEN",
        "sex": "WF"
      }
    ]
  }
}

OCR Performance Benchmarks

Phi-3-Vision-128K has been evaluated against multiple benchmarks, including AI2D, ChartQA, and ScienceQA, achieving stellar results:

81.4% accuracy on ChartQA (interpreting chart data)
76.7% on AI2D (solving document-based questions)

These numbers demonstrate its potential to handle complex visual and textual data simultaneously, outperforming many existing models in multimodal comprehension.

Responsible AI and Future Development

While Phi-3-Vision-128K-Instruct is a powerful model, developers need to be mindful of potential biases and limitations. Models like this may sometimes reinforce stereotypes or generate inaccurate data, especially when handling sensitive or high-stakes tasks, such as in legal or medical document processing.

For these applications, it is recommended to implement additional verification layers and content filtering to ensure the safety and accuracy of the generated outputs.

Conclusion

Phi-3-Vision-128K-Instruct represents a significant advancement in the AI-powered OCR and document processing field. Its capability to handle multimodal data at scale, along with its extensive training and state-of-the-art architecture, makes it a promising tool for businesses, researchers, and developers.

The future of document extraction and analysis is here, and it’s driven by AI innovations, such as Phi-3-Vision-128K.

If you are interested in further exploring this model, you can directly try Phi-3-Vision-128K-Instruct via Azure AI.

For additional reading on large language models and AI, check out this detailed guide on Mastering LLMs.

Colab Link:

Google ColabEdit description
colab.research.google.com