分布式大模型训练错误总结
一、背景
主要通过llama-factory在对qwen72b做强化学习(dpo)期间遇见的问题进行总结。
二、问题
1.能支持的长度
(1)batch_size为 1, 使用zero3, 非 offlown方式,rejected 和 chosen 1k以内。prompt最长支持8k。
(2)后面使用llama-factory新版模型,使用一下几个参数:
--gradient_checkpointing True \
--enable_liger_kernel True \
--use_unsloth_gc True \
可以使得prompt支持到10k。
(3)使用offload,内存 ,同时加上:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
可以使得prompt支持到12.5k,更长应该也能支持,需要实验
2.保持模型时候内存不足
在满足以上条件的时候,如果rejected里面或者chosen有少量数据长度2k,会导致模型保存的时候,出现内存不足。
现象为:在第一个epoch可以正常保存模型,第二个epoch无法保持。
日志报错代码:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 905, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
src/train.py FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-12-26_14:14:44
host : fortuneanna-kmaker-033148041009
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 263138)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 263138
解决办法:
- 通过shuf数据,可以部分解决
b.rejected里面和chosen都限制在1k以内,问题解决。
3.分布式后台命令 nohub导致退出窗口,程序失败。
没有使用pdsh类似的工具,需要每个节点执行训练命令:
nohup sh start.sh > log.log 2>&1 &
导致错误:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 883, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
执行命令为:
nohup sh start.sh > log.log 2>&1 &
exit
三、dpo没有效果的主要原因
1.之前的rejected构建思路
chatgpt套答案方向不对
rejected数据分布和chosen数据分布偏差大的情况,看来对模型是负向收益
特别是rejected数据还不是模型本身输出的
2.有效的解决
rejected答案需要时模型自己输出的。
chosen是模型答案输出的改动。