在"Too many"上运行(Linux)BCP的任何特定问题都将在运行中出现.线程? [英] Any specific problems running (linux) BCP on "too many" threads?

查看:105
本文介绍了在"Too many"上运行(Linux)BCP的任何特定问题都将在运行中出现.线程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

运行Microsoft的BCP实用程序是否存在任何特定问题(在CentOS 7上,

Are there any specific problems with running Microsoft's BCP utility (on CentOS 7, https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-migrate-bcp?view=sql-server-2017) on multiple threads? Googling could not find much, but am looking at a problem that seems to be related to just that.

使用某种形式的代码将一组大型TSV文件从HDFS复制到远程MSSQL Server

Copying a set of large TSV files from HDFS to a remote MSSQL Server with some code of the form

bcpexport() {
    filename=$1
    TO_SERVER_ODBCDSN=$2
    DB=$3 
    TABLE=$4 
    USER=$5
    PASSWORD=$6
    RECOMMEDED_IMPORT_MODE=$7 
    DELIMITER=$8 

    echo -e "\nRemoving header from TSV file $filename"
    echo -e "Current head:\n"
    echo $(head -n 1 $filename)
    echo "$(tail -n +2 $filename)" > $filename
    echo "First line of file is now..."    
    echo $(head -n 1 $filename)

    # temp. workaround safeguard for NFS latency
    #sleep 5 #FIXME: appears to sometimes cause script to hang, workaround implemented below, throws error if timeout reached 
    timeout 30 sleep 5

    echo -e "\nReplacing null literal values with empty chars"
    NULL_WITH_TAB="null\t" # WARN: assumes the first field is prime-key so never null
    TAB="\t"
    sed -i -e "s/$NULL_WITH_TAB/$TAB/g" $filename
    echo -e "Lines containing null (expect zero): $(grep -c "\tnull\t" $filename)"
    
    # temp. workaround safeguard for NFS latency
    #sleep 5 #FIXME: appears to sometimes cause script to hang, workaround implemented below 
    timeout 30 sleep 5

    /opt/mssql-tools/bin/bcp "$TABLE" in "$filename" \
        $TO_SERVER_ODBCDSN \
        -U $USER -P $PASSWORD \
        -d $DB \
        $RECOMMEDED_IMPORT_MODE \
        -t "\t" \
        -e ${filename}.bcperror.log
}

export -f bcpexport
parallel -q -j 7 bcpexport {} "$TO_SERVER_ODBCDSN" $DB $TABLE $USER $PASSWORD $RECOMMEDED_IMPORT_MODE $DELIMITER \
    ::: $DATAFILES/$TARGET_GLOB 

其中$DATAFILES/$TARGET_GLOB构造一个glob,该glob列出目录中的一组文件.

where $DATAFILES/$TARGET_GLOB constructs a glob that lists a set of files in a directory.

在为一组TSV文件运行此代码时,发现有时一些(但不是全部)并行BCP线程失败.一些文件成功复制到MSSQL Server

When running this code for a set of TSV files, finding that sometimes some (but not all) of the parallel BCP threads fail, ie. some files successfully copy to MSSQL Server

开始复制...

Starting copy...

5397376行已复制.

5397376 rows copied.

网络数据包大小(字节):4096

Network packet size (bytes): 4096

时钟时间(毫秒)总计:154902平均:(34843.8行/秒)

Clock Time (ms.) Total : 154902 Average : (34843.8 rows per sec.)

而其他人输出错误消息

开始复制...

Starting copy...

BCP复制失败

通常情况下,请参见此模式:在返回的前几个线程中执行了一些成功的BCP复制操作,然后一堆失败的线程返回其输出,直到用完文件为止(GNU Parallel仅在完成显示整个线程时才返回输出就像顺序的一样.

Usually, see this pattern: a few successful BCP copy-in operations in the first few threads returned, then a bunch of failing threads return their output until run out of files (GNU Parallel returns output only when whole thread done to appear same as if sequential).

在代码中请注意,有-e选项可以为每个BCP复制操作生成错误文件(请参阅 https://docs.microsoft.com/zh-cn/sql/tools/bcp-utility?view=sql-server- 2017#e ).在观察到这些失败行为后检查文件时,所有文件均为空白,没有错误消息.

Notice in the code there is the -e option to produce an error file for each BCP copy-in operation (see https://docs.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-2017#e). When examining the files after observing these failing behaviors, all are blank, no error messages.

仅在线程数> = 10时才看到这种情况(并且仅对于某些数据集(假定文件总数与文件大小有关,但是...)),没有看到错误到目前为止,使用〜7个线程时,这进一步使我怀疑这与多线程有关.

Only have seen this with the number of threads >= 10 (and only for certain sets of data (assuming has something to do with total number of files are files sizes, and yet...)), no errors seen so far when using ~7 threads, which further makes me suspect this has something to do with multi-threading.

(通过free -mh)监视系统资源表明,通常〜13GB或RAM总是可用的.

Monitoring system resources (via free -mh) shows that generally ~13GB or RAM is always available.

可能需要注意的是,尝试复制的数据bcp的长度可能约为500000-1000000个记录,每个记录的上限约为100列.

May be helpful to note that the data bcp is trying to copy-in may be ~500000-1000000 records long with an upper limit of ~100 columns per record.

有人知道这里会发生什么吗?请注意,使用BCP以及GNU并行和多线程是一个相当新的东西.

Does anyone have any idea what could be going on here? Note, am pretty new to using BCP as well as GNU Parallel and multi-threading.

推荐答案

TLDR :添加更多线程以同时运行以具有数据的bcp副本文件似乎具有使用写指令使端点MSSQL Server压倒,导致bcp线程失败(也许超时?).当线程数过多时,似乎取决于bcp复制的文件大小(即文件中的记录数以及每条记录的宽度(即数列)).

TLDR: Adding more threads to run concurrently to have bcp copy-in files of data seems to have the affect of overwhelming the endpoint MSSQL Server with write instructions, causing the bcp threads to fail (maybe timeing out?). When the number of threads becomes too many seems to depend on the size of the files getting copy-in'ed by bcp (ie. both the number of records in the file as well as the width of each record (ie. number of columns)).

长版(我的理论有更多原因):

1. 当运行大量bcp线程并查看计算机上启动的进程时(

1. When running a larger number of bcp threads and looking at the processes started on the machine (https://clustershell.readthedocs.io/en/latest/tools/clush.html)

ps -aux | grep bcp

看到一堆睡眠过程(请注意S,请参见 https://askubuntu.com/a/360253/760862 ),如下所示(添加了换行符以提高可读性)

seeing a bunch of sleeping processes (notice the S, see https://askubuntu.com/a/360253/760862) as shown below (added newlines for readability)

我135296 14.5 0.0 77596 6940吗? S 00:32 0:01

me 135296 14.5 0.0 77596 6940 ? S 00:32 0:01

/opt/mssql-tools/bin/bcp TABLENAME.tsv -D -S MyMSSQLServer -U myusername -P -d myDB -c -t \ t -e/path/to/logfile

/opt/mssql-tools/bin/bcp TABLENAME in /path/to/tsv/1_16_0.tsv -D -S MyMSSQLServer -U myusername -P -d myDB -c -t \t -e /path/to/logfile

这些线程似乎睡眠了很长时间.进一步调试这些线程为何处于休眠状态表明它们可能实际上正在完成其预期的工作(这将进一步暗示问题可能来自BCP本身(请参阅https://stackoverflow.com/a/52748660/8236733 )).来自 https://unix.stackexchange.com/a/47259/260742 https://unix.stackexchange.com/a/36200/260742 )

These threads appear to sleep for very long time. Further debugging into why these threads are sleeping suggests that they may in fact be doing their intended job (which would further imply that the problem may be coming from BCP itself (see https://stackoverflow.com/a/52748660/8236733)). From https://unix.stackexchange.com/a/47259/260742 and https://unix.stackexchange.com/a/36200/260742)

处于S状态的进程通常处于阻塞的系统调用中,例如读取或写入文件或网络,或等待另一个被调用程序完成.

(例如,写入ODBCDSN中指定给bcp的MSSQL Server端点目标)

(eg. writing to the MSSQL Server endpoint destination given to bcp in the ODBCDSN)

您的进程在进行读取(可能是阻塞的写入)时将处于状态.在等待信号量或其他同步原语时也可能会发生...这都是正常现象,是正常现象,通常不成问题...您不希望它在等待用户输入时浪费CPU.

Your process will be in S state when it is doing reads and possibly writes that are blocking. Can also happen while waiting on semaphores or other synchronization primitives... This is all normal and expected, and not usually a problem... you don't want it to waste CPU while it's waiting for user input.

2..当运行不同组的文件时,每个文件的记录量不同(例如,范围为500000-1000000行/文件)和每个文件的记录宽度(〜10) -100列/行),发现在数据宽度或数据量非常大的情况下,运行一组固定的bcp线程会失败.

2. When running different sets of files of varying record-amount-per-file (eg. ranges of 500000 - 1000000 rows/file) and record-width-per-file (~10 - 100 columns/row), found that in cases with either very large data width or amounts, running a fixed set of bcp threads would fail.

例如.对于一组约33个TSV,每组约500,000行,每行约100列宽,一组30个线程将写入前几个OK,但随后所有其余线程将开始返回失败消息.结合@jamie的回答,返回的失败消息是"BCP copy in failed"错误这一事实并不一定意味着它与所讨论的数据内容有关. @jamie的帖子没有将实际内容从我的进程写入-e错误日志文件中,

Eg. for a set of ~33 TSVs with ~500000 rows each, each row being ~100 columns wide, a set of 30 threads would write the first few OK, but then all the rest would start returning failure messages. Incorporating a bit from @jamie's answer, the fact the the failure messages returned are "BCP copy in failed" errors does not necessarily mean it has do do with the content of the data in question. Having no actual content being written into the -e errorlog files from my process, @jamie's post says this

关于"-e"选项,它旨在输出数据错误. 登录错误,错误的表名...使用-e选项创建的文件中未报告许多其他错误.当使用"-e"选项获得输出时,您会看到诸如值被截断"之类的信息,诸如此类...将为您提供行号和有争议的样本数据.

Regarding the "-e" option, it is meant to output data errors. login errors, bad table names... many other errros are not reported in the file created with the -e option. When you get output using the "-e" option, you'll see info like "value truncated" and such... will give you line numbers and sample data that was at issue.

同时,一组约33个TSV,每组约500000行,每行约100宽,仍使用30个bcp线程将快速完成而不会出错(减少线程数或减少线程数也将更快).文件集). 唯一的区别是bcp复制到MSSQL Server的数据的整体大小.

Meanwhile, a set of ~33 TSVs with ~500000 rows each, each row being ~100 wide, and still using 30 bcp threads would complete quickly and without error (also would be faster when reducing the number of threads or file set). The only difference here being the overall size of the data being bcp copy-in'ed to the MSSQL Server.

这一切

free -mh 

仍然显示在每种情况下,运行线程的计算机仍具有〜15GB的可用RAM(这又是为什么我怀疑问题与远程MSSQL Server端点有关,而不是与代码或本地计算机本身.

still showed that the machine running the threads still had ~15GB of free RAM remaining in each case (which is again why I suspect that the problem has to do with the remote MSSQL Server endpoint rather than with the code or local machine itself).

3..运行(2)中的某些测试时,发现手动杀死了parallel进程(通过CTL+C),然后尝试远程截断要写入的测试表.在本地计算机上使用/opt/mssql-tools/bin/sqlcmd -Q "truncate table mytable"将花费很长时间(与手动登录MSSQL Server并在数据库中执行truncate mytable相对).再次,这使我认为这与MSSQL Server的连接过多有关/只是不知所措.

3. When running some of the tests from (2), found that manually killing the parallel process (via CTL+C) and then trying to remotely truncate the testing table being written to with /opt/mssql-tools/bin/sqlcmd -Q "truncate table mytable" on the local machine would take a very long time (as opposed to manually logging into the MSSQL Server and executing a truncate mytable in the DB). Again this makes me think that this has something to do with the MSSQL Server having too many connections and just being overwhelmed.

**具有任何MSSQL Mgmt Studio经验的任何人都读过这篇文章(我基本上没有),如果您在此处看到任何使您认为我的理论不正确的信息,请告诉我您的想法.

** Anyone with any MSSQL Mgmt Studio experience reading this (I have basically none), if you see anything here that makes you think that my theory is incorrect please let me know your thoughts.

这篇关于在"Too many"上运行(Linux)BCP的任何特定问题都将在运行中出现.线程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆