> 技术文档 > InternVL3:最强开源OCR大模型,轻松应对手写体、模糊PDF和表格识别!

InternVL3:最强开源OCR大模型,轻松应对手写体、模糊PDF和表格识别!

InternVL3,这是一个高级的多模态大型语言模型(MLLM)系列,展示了卓越的整体性能。与 InternVL 2.5 相比,InternVL3 表现出更优越的多模态感知和推理能力,并将多模态能力进一步扩展到工具使用、GUI 代理、工业图像分析、3D 视觉感知等地方。此外, 比较了 InternVL3 与 Qwen2.5 Chat 模型,其对应的预训练基础模型被用作 InternVL3 中语言组件的初始化。得益于原生多模态预训练,InternVL3 系列在整体文本性能上甚至优于 Qwen2.5 系列。

图片

InternVL3 系列概述:

Model Name

Vision Part

Language Part

HF Link

InternVL3-1B

InternViT-300M-448px-V2_5

Qwen2.5-0.5B

link链接

InternVL3-2B

InternViT-300M-448px-V2_5

Qwen2.5-1.5B

link链接

InternVL3-8B

InternViT-300M-448px-V2_5

Qwen2.5-7B

link链接

InternVL3-9B

InternViT-300M-448px-V2_5

internlm3-8b-instruct

link链接

InternVL3-14B

InternViT-300M-448px-V2_5

Qwen2.5-14B

link链接

InternVL3-38B

InternViT-6B-448px-V2_5

Qwen2.5-32B

link链接

InternVL3-78B

InternViT-6B-448px-V2_5

Qwen2.5-72B

link链接

图片

模型架构:

InternVL3 保留了与 InternVL 2.5 及其前辈 InternVL 1.5 和 2.0 相同的模型架构,遵循\"ViT-MLP-LLM\"范式。在新版本中,集成了一种新的增量预训练的 InternViT,并使用随机初始化的 MLP 投影仪整合了各种预训练模型,包括 InternLM 3 和 Qwen 2.5。

图片

与上一版本类似,应用了像素解乱序操作,将视觉标记的数量减少到原来的四分之一。此外,采用了与 InternVL 1.5 类似的动态分辨率策略,将图像划分为 448×448 像素的瓦片。从 InternVL 2.0 开始的关键区别是,增加了对多图像和视频数据的支持。

显著的是,在 InternVL3 中,集成了可变视觉位置编码(V2PE),它使用更小、更灵活的位置增量来表示视觉标记。得益于 V2PE,InternVL3 相比其前辈展现出更好的长上下文理解能力。

训练策略:

提出了一种原生多模态预训练方法,将语言和视觉学习整合到单个预训练阶段。与先训练语言模型然后再适应处理其他模态的标准范式不同, 将多模态数据(例如,图像-文本、视频-文本或图像-文本交织序列)与大规模文本语料库交织。这种统一的训练方案允许模型同时学习语言和多模态表示,最终增强其处理视觉-语言任务的能力,无需单独的对齐或桥接模块。

监督微调:

在这个阶段,InternVL2.5 中提出的随机 JPEG 压缩、平方损失重新加权和多模态数据打包技术也被应用于 InternVL3 系列。与 InternVL2.5 相比,InternVL3 中 SFT 阶段的主要进步在于使用了更高质量和更多样化的训练数据。具体来说,我们进一步扩展了工具使用、3D 场景理解、GUI 操作、长上下文任务、视频理解、科学图表、创意写作和多模态推理的训练样本。

混合偏好优化:

在预训练和微调过程中,模型被训练为根据之前的真实标记预测下一个标记。然而,在推理过程中,模型根据其自身的先前输出预测每个标记。这种真实标记和模型预测标记之间的差异引入了分布偏移,这可能会损害模型的思维链(CoT)推理能力。为了减轻这个问题,采用 MPO,它从正样本和负样本中引入额外的监督,以使模型响应分布与真实分布对齐,从而提高推理性能。

测试时间缩放:

测试时缩放已被证明是增强LLMs和 MLLM 推理能力的一种有效方法。在这项工作中,我们使用 Best-of-N 评估策略,并采用 VisualPRM-8B 作为评判模型来选择最佳推理和数学评估的响应。

多模态能力评估-多模态推理与数学:

图片

光学字符识别、图表和文档理解:

图片

测试用例:

手写体:

图片

票据内容提取:

图片

提取结果:

图片

模糊的PDF识别:

图片

使用模型:

模型加载

1、16 位(bf16 / fp16)

import torch

from transformers import AutoTokenizer, AutoModel

path = \"OpenGVLab/InternVL3-14B\"

model = AutoModel.from_pretrained(

    path,

    torch_dtype=torch.bfloat16,

    low_cpu_mem_usage=True,

    use_flash_attn=True,

    trust_remote_code=True).eval().cuda()

2、BNB 8 位量化

import torch

from transformers import AutoTokenizer, AutoModel

path = \"OpenGVLab/InternVL3-14B\"

model = AutoModel.from_pretrained(

    path,

    torch_dtype=torch.bfloat16,

    load_in_8bit=True,

    low_cpu_mem_usage=True,

    use_flash_attn=True,

    trust_remote_code=True).eval()

多块 GPU:

import math

import torch

from transformers import AutoTokenizer, AutoModel

def split_model(model_name):

    device_map = {}

    world_size = torch.cuda.device_count()

    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)

    num_layers = config.llm_config.num_hidden_layers

    # Since the first GPU will be used for ViT, treat it as half a GPU.

    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))

    num_layers_per_gpu = [num_layers_per_gpu] * world_size

    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)

    layer_cnt = 0

    for i, num_layer in enumerate(num_layers_per_gpu):

        for j in range(num_layer):

            device_map[f\'language_model.model.layers.{layer_cnt}\'] = i

            layer_cnt += 1

    device_map[\'vision_model\'] = 0

    device_map[\'mlp1\'] = 0

    device_map[\'language_model.model.tok_embeddings\'] = 0

    device_map[\'language_model.model.embed_tokens\'] = 0

    device_map[\'language_model.output\'] = 0

    device_map[\'language_model.model.norm\'] = 0

    device_map[\'language_model.model.rotary_emb\'] = 0

    device_map[\'language_model.lm_head\'] = 0

    device_map[f\'language_model.model.layers.{num_layers - 1}\'] = 0

   return device_map

path = \"OpenGVLab/InternVL3-14B\"

device_map = split_model(\'InternVL3-14B\')

model = AutoModel.from_pretrained(

    path,

    torch_dtype=torch.bfloat16,

    low_cpu_mem_usage=True,

    use_flash_attn=True,

    trust_remote_code=True,

    device_map=device_map).eval()

Transformer:

import math

import numpy as np

import torch

import torchvision.transforms as T

from decord import VideoReader, cpu

from PIL import Image

from torchvision.transforms.functional import InterpolationMode

from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)

IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):

    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD

    transform = T.Compose([

        T.Lambda(lambda img: img.convert(\'RGB\') if img.mode != \'RGB\' else img),

        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),

        T.ToTensor(),

        T.Normalize(mean=MEAN, std=STD)

    ])

    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):

    best_ratio_diff = float(\'inf\')

    best_ratio = (1, 1)

    area = width * height

    for ratio in target_ratios:

        target_aspect_ratio = ratio[0] / ratio[1]

        ratio_diff = abs(aspect_ratio - target_aspect_ratio)

        if ratio_diff < best_ratio_diff:

            best_ratio_diff = ratio_diff

            best_ratio = ratio

        elif ratio_diff == best_ratio_diff:

            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:

                best_ratio = ratio

    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):

    orig_width, orig_height = image.size

    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio

    target_ratios = set(

        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if

        i * j = min_num)

    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target

    target_aspect_ratio = find_closest_aspect_ratio(

        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height

    target_width = image_size * target_aspect_ratio[0]

    target_height = image_size * target_aspect_ratio[1]

    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image

    resized_img = image.resize((target_width, target_height))

    processed_images = []

    for i in range(blocks):

        box = (

            (i % (target_width // image_size)) * image_size,

            (i // (target_width // image_size)) * image_size,

            ((i % (target_width // image_size)) + 1) * image_size,

            ((i // (target_width // image_size)) + 1) * image_size

        )

        # split the image

        split_img = resized_img.crop(box)

        processed_images.append(split_img)

    assert len(processed_images) == blocks

    if use_thumbnail and len(processed_images) != 1:

        thumbnail_img = image.resize((image_size, image_size))

        processed_images.append(thumbnail_img)

    return processed_images

def load_image(image_file, input_size=448, max_num=12):

    image = Image.open(image_file).convert(\'RGB\')

    transform = build_transform(input_size=input_size)

    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)

    pixel_values = [transform(image) for image in images]

    pixel_values = torch.stack(pixel_values)

    return pixel_values

def split_model(model_name):

    device_map = {}

    world_size = torch.cuda.device_count()

    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)

    num_layers = config.llm_config.num_hidden_layers

    # Since the first GPU will be used for ViT, treat it as half a GPU.

    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))

    num_layers_per_gpu = [num_layers_per_gpu] * world_size

    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)

    layer_cnt = 0

    for i, num_layer in enumerate(num_layers_per_gpu):

        for j in range(num_layer):

            device_map[f\'language_model.model.layers.{layer_cnt}\'] = i

            layer_cnt += 1

    device_map[\'vision_model\'] = 0

    device_map[\'mlp1\'] = 0

    device_map[\'language_model.model.tok_embeddings\'] = 0

    device_map[\'language_model.model.embed_tokens\'] = 0

    device_map[\'language_model.output\'] = 0

    device_map[\'language_model.model.norm\'] = 0

    device_map[\'language_model.model.rotary_emb\'] = 0

    device_map[\'language_model.lm_head\'] = 0

    device_map[f\'language_model.model.layers.{num_layers - 1}\'] = 0

    return device_map

# If you set `load_in_8bit=True`, you will need two 80GB GPUs.

# If you set `load_in_8bit=False`, you will need at least three 80GB GPUs.

path = \'OpenGVLab/InternVL3-14B\'

device_map = split_model(\'InternVL3-14B\')

model = AutoModel.from_pretrained(

    path,

    torch_dtype=torch.bfloat16,

    load_in_8bit=False,

    low_cpu_mem_usage=True,

    use_flash_attn=True,

    trust_remote_code=True,

    device_map=device_map).eval()

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`

pixel_values = load_image(\'./examples/image1.jpg\', max_num=12).to(torch.bfloat16).cuda()

generation_config = dict(max_new_tokens=1024, do_sample=True)

# pure-text conversation (纯文本对话)

question = \'Hello, who are you?\'

response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)

print(f\'User: {question}\\nAssistant: {response}\')

question = \'Can you tell me a story?\'

response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)

print(f\'User: {question}\\nAssistant: {response}\')

# single-image single-round conversation (单图单轮对话)

question = \'\\nPlease describe the image shortly.\'

response = model.chat(tokenizer, pixel_values, question, generation_config)

print(f\'User: {question}\\nAssistant: {response}\')

# single-image multi-round conversation (单图多轮对话)

question = \'\\nPlease describe the image in detail.\'

response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)

print(f\'User: {question}\\nAssistant: {response}\')

question = \'Please write a poem according to the image.\'

response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)

print(f\'User: {question}\\nAssistant: {response}\')

# multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)

pixel_values1 = load_image(\'./examples/image1.jpg\', max_num=12).to(torch.bfloat16).cuda()

pixel_values2 = load_image(\'./examples/image2.jpg\', max_num=12).to(torch.bfloat16).cuda()

pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = \'\\nDescribe the two images in detail.\'

response, history = model.chat(tokenizer, pixel_values, question, generation_config,

                               history=None, return_history=True)

print(f\'User: {question}\\nAssistant: {response}\')

question = \'What are the similarities and differences between these two images.\'

response, history = model.chat(tokenizer, pixel_values, question, generation_config,

                               history=history, return_history=True)

print(f\'User: {question}\\nAssistant: {response}\')

# multi-image multi-round conversation, separate images (多图多轮对话,独立图像)

pixel_values1 = load_image(\'./examples/image1.jpg\', max_num=12).to(torch.bfloat16).cuda()

pixel_values2 = load_image(\'./examples/image2.jpg\', max_num=12).to(torch.bfloat16).cuda()

pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = \'Image-1: \\nImage-2: \\nDescribe the two images in detail.\'

response, history = model.chat(tokenizer, pixel_values, question, generation_config,

                               num_patches_list=num_patches_list,

                               history=None, return_history=True)

print(f\'User: {question}\\nAssistant: {response}\')

question = \'What are the similarities and differences between these two images.\'

response, history = model.chat(tokenizer, pixel_values, question, generation_config,

                               num_patches_list=num_patches_list,

                               history=history, return_history=True)

print(f\'User: {question}\\nAssistant: {response}\')

# batch inference, single image per sample (单图批处理)

pixel_values1 = load_image(\'./examples/image1.jpg\', max_num=12).to(torch.bfloat16).cuda()

pixel_values2 = load_image(\'./examples/image2.jpg\', max_num=12).to(torch.bfloat16).cuda()

num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = [\'\\nDescribe the image in detail.\'] * len(num_patches_list)

responses = model.batch_chat(tokenizer, pixel_values,

                             num_patches_list=num_patches_list,

                             questions=questions,

                             generation_config=generation_config)

for question, response in zip(questions, responses):

    print(f\'User: {question}\\nAssistant: {response}\')

# video multi-round conversation (视频多轮对话)

def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):

    if bound:

        start, end = bound[0], bound[1]

    else:

        start, end = -100000, 100000

    start_idx = max(first_idx, round(start * fps))

    end_idx = min(round(end * fps), max_frame)

    seg_size = float(end_idx - start_idx) / num_segments

    frame_indices = np.array([

        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))

        for idx in range(num_segments)

    ])

    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):

    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)

    max_frame = len(vr) - 1

    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []

    transform = build_transform(input_size=input_size)

    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)

    for frame_index in frame_indices:

        img = Image.fromarray(vr[frame_index].asnumpy()).convert(\'RGB\')

        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)

        pixel_values = [transform(tile) for tile in img]

        pixel_values = torch.stack(pixel_values)

        num_patches_list.append(pixel_values.shape[0])

        pixel_values_list.append(pixel_values)

    pixel_values = torch.cat(pixel_values_list)

    return pixel_values, num_patches_list

video_path = \'./examples/red-panda.mp4\'

pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)

pixel_values = pixel_values.to(torch.bfloat16).cuda()

video_prefix = \'\'.join([f\'Frame{i+1}: \\n\' for i in range(len(num_patches_list))])

question = video_prefix + \'What is the red panda doing?\'

# Frame1: \\nFrame2: \\n...\\nFrame8: \\n{question}

response, history = model.chat(tokenizer, pixel_values, question, generation_config,

                               num_patches_list=num_patches_list, history=None, return_history=True)

print(f\'User: {question}\\nAssistant: {response}\')

question = \'Describe this video in detail.\'

response, history = model.chat(tokenizer, pixel_values, question, generation_config,

                               num_patches_list=num_patches_list, history=history, return_history=True)

print(f\'User: {question}\\nAssistant: {response}\')

流式输出:

from transformers import TextIteratorStreamer

from threading import Thread

# Initialize the streamer

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)

# Define the generation configuration

generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)

# Start the model chat in a separate thread

thread = Thread(target=model.chat, kwargs=dict(

    tokenizer=tokenizer, pixel_values=pixel_values, question=question,

    history=None, return_history=False, generation_config=generation_config,

))

thread.start()

# Initialize an empty string to store the generated text

generated_text = \'\'

# Loop through the streamer to get the new text as it is generated

for new_text in streamer:

    if new_text == model.conv_template.sep:

        break

    generated_text += new_text

    print(new_text, end=\'\', flush=True)  # Print each new chunk of generated text on the same line

多图像推理:

当处理多张图像时,可以将它们全部放入一个列表中。请注意,多张图像会导致输入标记的数量增加,因此通常需要增加上下文窗口的大小。

from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig

from lmdeploy.vl import load_image

from lmdeploy.vl.constants import IMAGE_TOKEN

model = \'OpenGVLab/InternVL3-14B\'

pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=16384, tp=1), chat_template_config=ChatTemplateConfig(model_name=\'internvl2_5\'))

image_urls=[

    \'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg\',

    \'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg\'

]

images = [load_image(img_url) for img_url in image_urls]

# Numbering images improves multi-image conversations

response = pipe((f\'Image-1: {IMAGE_TOKEN}\\nImage-2: {IMAGE_TOKEN}\\ndescribe these two images\', images))

print(response.text)

开源地址:https://huggingface.co/OpenGVLab/InternVL3-14B#ocr-chart-and-document-understanding

 大模型&AI产品经理如何学习

求大家的点赞和收藏,我花2万买的大模型学习资料免费共享给你们,来看看有哪些东西。

 

1.学习路线图

第一阶段: 从大模型系统设计入手,讲解大模型的主要方法;

第二阶段: 在通过大模型提示词工程从Prompts角度入手更好发挥模型的作用;

第三阶段: 大模型平台应用开发借助阿里云PAI平台构建电商领域虚拟试衣系统;

第四阶段: 大模型知识库应用开发以LangChain框架为例,构建物流行业咨询智能问答系统;

第五阶段: 大模型微调开发借助以大健康、新零售、新媒体领域构建适合当前领域大模型;

第六阶段: 以SD多模态大模型为主,搭建了文生图小程序案例;

第七阶段: 以大模型平台应用与开发为主,通过星火大模型,文心大模型等成熟大模型构建大模型行业应用。

2.视频教程

网上虽然也有很多的学习资源,但基本上都残缺不全的,这是我自己整理的大模型视频教程,上面路线图的每一个知识点,我都有配套的视频讲解。

(都打包成一块的了,不能一一展开,总共300多集)

因篇幅有限,仅展示部分资料,需要点击下方图片前往获取

3.技术文档和电子书 

这里主要整理了大模型相关PDF书籍、行业报告、文档,有几百本,都是目前行业最新的。

4.LLM面试题和面经合集

这里主要整理了行业目前最新的大模型面试题和各种大厂offer面经合集。

👉学会后的收获:👈
• 基于大模型全栈工程实现(前端、后端、产品经理、设计、数据分析等),通过这门课可获得不同能力;

• 能够利用大模型解决相关实际项目需求: 大数据时代,越来越多的企业和机构需要处理海量数据,利用大模型技术可以更好地处理这些数据,提高数据分析和决策的准确性。因此,掌握大模型应用开发技能,可以让程序员更好地应对实际项目需求;

• 基于大模型和企业数据AI应用开发,实现大模型理论、掌握GPU算力、硬件、LangChain开发框架和项目实战技能, 学会Fine-tuning垂直训练大模型(数据准备、数据蒸馏、大模型部署)一站式掌握;

• 能够完成时下热门大模型垂直领域模型训练能力,提高程序员的编码能力: 大模型应用开发需要掌握机器学习算法、深度学习框架等技术,这些技术的掌握可以提高程序员的编码能力和分析能力,让程序员更加熟练地编写高质量的代码。

1.AI大模型学习路线图
2.100套AI大模型商业化落地方案
3.100集大模型视频教程
4.200本大模型PDF书籍
5.LLM面试题合集
6.AI产品经理资源合集***

👉获取方式:
😝有需要的小伙伴,可以保存图片到wx扫描二v码免费领取【保证100%免费】🆓