Q: What should I check when I encounter an error with the message "Error: exit status 1" or "Failed to deploy basic model" when deploying a multi-process compute engine?
A: There are many possibilities for the engine failure to start. Some common reasons are listed below:
ps -ef|grep redis
on the command line to check whether other redis services are running in the system, resulting in port conflicts. MindSpore Pandas' redis runs on port 6379 by default. If you need to modify it, you can modify the redis_port
field in mindpandas/dist_executor/modules/config/config.xml
in the MindSpore Pandas installation directory to other non-conflicting ports.netstat -tunpl|grep -E "32379|32380"
in the command line to check whether the etcd port is occupied. If there is a conflict, please try to release the corresponding port.Q: How to solve the error "**ERROR** memory for function instances deployment is less than 0" when deploying a multi-process compute engine?
A: The problem is caused by insufficient running memory. Please try to decrease the value of the --datamem
parameter or increase the value of the --mem
parameter when deploying.
Q: What should I do if the error message "Failed to request, code:1001, message: invalid resource parameter, request resource is greater than each node's max resource." is reported when running a Python script using a multi-process backend?
A: This error is caused by insufficient resources configured when starting the distributed compute engine. Please use larger --cpu
and --mem
parameter values when deploying the cluster.
Q: When using a multi-process backend, what should I do if "Client number upper to the limit" is reported during the Python script?
A: Please try to redeploy the cluster and reduce the value of the --cpu
parameter.
Q: What should I do if the error message "health check failed, please check port: <port>" is reported during the deployment of a multi-process compute engine?
A: The MindSpore Pandas compute engine will start multiple processes, and each process has a corresponding port. If the ports conflict, this error will be reported. The solutions are as follows:
netstat -tunpl|grep <port>
to check the port occupancy. If the port conflicts, there are two solutions:
dist_executor/modules/config/config.xml
, search for the conflicting port number and change it to another available port.ps -ef |grep mindpandas/dist_executor
to check the PID of the residual process, and then manually clean up the process.Q: What should I do when the error message "failed to request, code:3003, put object failed, id:<id>,requestID:<id>,errr: code:[Out of memory]" is reported during using a multi-process backend?
A: It may be due to insufficient shared memory space of the compute engine. Please try to stop the engine and then redeploy, and set a larger --datamem
parameter value.
Q: What should I do when the error message "Failed to request, code:1001, message: invalid resource parameter, request resource is greater than each node's max resource" is reported during using a multi-process backend?
A: It may be because the CPU and memory resources requested during deployment are too few. Please try the following solutions:
Q: How to solve the "RuntimeError: system not initialized" error when running on a machine with large specifications (such as more than 100 CPU cores)?
A: Data transfer in compute engine relies on file descriptors. It is required that the number of available file descriptors should be at least four times the number of CPU cores in the cluster. You can view and increase the limit on the number of file descriptors on the current machine through the ulimit
command:
$ ulimit –a # Where open files is the upper limit of the file descriptor. If the value is too small, it will be raised.
open files (-n) 1024
$ ulimit -n 4096
Q: How to solve "ImportError: /lib/libc.so.6: version 'GLIBC_2.25' not found" when using multi-process backend?
A: Please upgrade the glibc version in the environment to 2.25 or above.
Q: How to solve the error "TypeError: cannot unpack non-iterable <class 'yr.exception.YRInvokeError'> object" when I use the pytest
command to execute a script in a multi-process backend?
A: Due to the execution mechanism of pytest
, if you use a user-defined function, please make sure that other functions called in it are Python closures.
Q: How to solve the "yr.exception.YRequestError: failed to request, code:3003, message: retry etcd operation Put exceed the max times" message when running with a multi-process backend?
A: The compute engine uses etcd to maintain the consistency of internal data. This error may be not working properly caused by etcd. You can use the following command to check whether the etcd process exists. If the process does not exist, you need to redeploy the compute engine.
ps -ef |grep dist_executor/modules/basic/bin/etcd/etcd
Q: How to solve "RuntimeError: code: [RPC unavailable], msg [ Thread ID && RPC unavailable. Disconnected from worker . Line of code : 117 File : object_client_impl.cpp]" when running ?
A: It may be that the module using rpc communication in the compute engine is abnormal. Please use the following command to check the corresponding process. If the number of processes found is less than 3, the compute engine needs to be redeployed.
ps -ef |grep dist_executor/modules/datasystem
Q: How to solve the "xmllint: command not found" when a multi-process compute engine is deployed?
A: Install libxml2-utils to solve this problem.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。