微调LLaMA 7B

技术文档

一 . self-instruct 方式

方法简介：

StanfordAlpaca是斯坦福团队提出的一种结合中英文语料，通过Self-Instruct方式微调 LLaMA7B 大模型的方法。该方法通过收集和构建高质量的指令-响应数据集，使微调后的 LLaMA 7B 在单轮对话任务上表现优于GPT-3.5（text-davinci-003）。

数据集构建流程

数据来源：主要基于斯坦福团队发布的52K条英文指令数据（Alpaca数据集），并对部分数据进行了中文翻译和表达方式优化。

数据格式：每条数据为一个字典，包含以下字段：

instruction：任务描述，指明模型应执行的操作（唯一）。

input：任务上下文或输入内容（可选）。

output：由GPT-3.5 API（text-davinci-003）生成的标准答案。

数据集地址：

https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json

具体实现流程：

1. ⼈⼯设计 175 个任务，每个任务都有对应的 { 指令输⼊输出 / 实例 } 或 { 指令输出 / 实例 } ，将这 175 个任务数据作为种⼦集； 2. 调用模型API或者本地模型生成指令，使用种子集作为上下文示例申城更多的指令； 3. 定义该模型生成的指令判断是都分类任务； 4. 使用模型生成实例； 5. 对上述模型申城的数据{指令输入输出/实例}过滤低质量或相似度高的； 6.讲经过过滤和后处理的数据添加到种子池中重复以上2-6直到种子池有更多的数据；主要生成数据的程序如下：

\"\"\"batch_selfinstruct_generate.pyrun:python -m generate_instruction generate_instruction_following_data \\ --output_dir ./ \\ --num_instructions_to_generate 10 \\ --model_name=\"text-davinci-003\" \\\"\"\"import timeimport jsonimport osimport randomimport reimport stringfrom functools import partial # 引入偏函数模块from multiprocessing import Pool # 引入多进程模块import numpy as npimport tqdmfrom rouge_score import rouge_scorer # 引入rouge评分器，用于文本相似度计算import utilsimport fire #用于命令行参数解析# 将多个提示指令编码成单⼀字符串的函数def encode_prompt(prompt_instructions): \"\"\"Encode multiple prompt instructions into a single string.\"\"\" prompt = open(\"./prompt.txt\").read() + \"\\n\" for idx, task_dict in enumerate(prompt_instructions): (instruction, input, output) = task_dict[\"instruction\"], task_dict[\"input\"], task_dict[\"output\"] instruction = re.sub(r\"\\s+\", \" \", instruction).strip().rstrip(\":\") input = \"\" if input.lower() == \"\" else input prompt += f\"###\\n\" prompt += f\"{idx + 1}. Instruction: {instruction}\\n\" prompt += f\"{idx + 1}. Input:\\n{input}\\n\" prompt += f\"{idx + 1}. Output:\\n{output}\\n\" prompt += f\"###\\n\" prompt += f\"{idx + 2}. Instruction:\" return prompt# 对GPT-3响应进⾏后处理的函数，抽取⽣成的新指令def post_process_gpt3_response(num_prompt_instructions, response): if response is None: return [] raw_instructions = f\"{num_prompt_instructions+1}. Instruction:\" + response[\"text\"] raw_instructions = re.split(\"###\", raw_instructions) instructions = [] for idx, inst in enumerate(raw_instructions): # if the decoding stops due to length, the last example is likely truncated so we discard it if idx == len(raw_instructions) - 1 and response[\"finish_reason\"] == \"length\": continue idx += num_prompt_instructions + 1 splitted_data = re.split(f\"{idx}\\.\\s+(Instruction|Input|Output):\", in

微调LLaMA 7B

数据集构建流程

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

微调LLaMA 7B

数据集构建流程

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签