2024年12月

一、背景

主要通过llama-factory在对qwen72b做强化学习(dpo)期间遇见的问题进行总结。

二、问题

1.能支持的长度

(1)batch_size为 1, 使用zero3, 非 offlown方式,rejected 和 chosen 1k以内。prompt最长支持8k。

(2)后面使用llama-factory新版模型,使用一下几个参数:

--gradient_checkpointing True \
--enable_liger_kernel True \
--use_unsloth_gc True \

可以使得prompt支持到10k。

(3)使用offload,内存 ,同时加上:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
可以使得prompt支持到12.5k,更长应该也能支持,需要实验

2.保持模型时候内存不足

在满足以上条件的时候,如果rejected里面或者chosen有少量数据长度2k,会导致模型保存的时候,出现内存不足。
现象为:在第一个epoch可以正常保存模型,第二个epoch无法保持。
日志报错代码:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 905, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
src/train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-26_14:14:44
  host      : fortuneanna-kmaker-033148041009
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 263138)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 263138

解决办法:

  1. 通过shuf数据,可以部分解决
    b.rejected里面和chosen都限制在1k以内,问题解决。

3.分布式后台命令 nohub导致退出窗口,程序失败。

没有使用pdsh类似的工具,需要每个节点执行训练命令:
nohup sh start.sh > log.log 2>&1 &

导致错误:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 883, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
src/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):

解决:加上exit,参考:https://discuss.pytorch.org/t/ddp-error-torch-distributed-elastic-agent-server-api-received-1-death-signal-shutting-down-workers/135720/13

执行命令为:
nohup sh start.sh > log.log 2>&1 &
exit

三、dpo没有效果的主要原因

1.之前的rejected构建思路
chatgpt套答案方向不对
rejected数据分布和chosen数据分布偏差大的情况,看来对模型是负向收益
特别是rejected数据还不是模型本身输出的
2.有效的解决
rejected答案需要时模型自己输出的。
chosen是模型答案输出的改动。