public void informFailure(T failedObject) {
//If there is no backoff this method is a no-op.
if (!shouldBackOff) {
return;
}
//将该主机暂时移除可用主机列表
...
所以解决办法:配置max back off
问题3:Flume Log4j失败重连策略异常
问题体现在,设置了max back off,重连时间居然一直是2000ms,看了一下它的算法,指数退避算法。在OrderSelector.java的informFailure函数中。
1234567891011121314151617181920212223
public void informFailure(T failedObject) {
//If there is no backoff this method is a no-op.
if (!shouldBackOff) {
return;
}
FailureState state = stateMap.get(failedObject);
long now = System.currentTimeMillis();
long delta = now - state.lastFail;
long lastBackoffLength = Math.min(maxTimeout, 1000 * (1 << state.sequentialFails));
long allowableDiff = lastBackoffLength + CONSIDER_SEQUENTIAL_RANGE;
if (allowableDiff > delta) {
if (state.sequentialFails < EXP_BACKOFF_COUNTER_LIMIT) {
state.sequentialFails++;
}
} else {
state.sequentialFails = 1;
}
state.lastFail = now;
//Depending on the number of sequential failures this component had, delay
//its restore time. Each time it fails, delay the restore by 1000 ms,
//until the maxTimeOut is reached.
state.restoreTime = now + Math.min(maxTimeout, 1000 * (1 << state.sequentialFails));
}
最后生成的restoreTime即下一次进行重试的时间。我没有去设置avro connect time out 和request time out,默认都是20s,应该算是偏长了。根据他的算法,delta永远是大于40s,但是allowableDiff却一直是3s,4s.所以我直接改了判定条件,allowableDiff < delta,之后就正常。但是还存在一个问题,sequentialFails并不会在一段时间后reset.
问题4:Log4j异步加载器丢失日志数据
AsyncAppender默认缓冲区大小128,满了之后会丢失数据。调大缓冲区,avro connect time out 和request time out也得适当调一下