Hystrix 使用手册 | 官方文档翻译-白红宇

Hystrix 使用手册 | 官方文档翻译

阅读量：419 次

发布时间：2019-03-06

本文共 19016 字，大约阅读时间需要 63 分钟。

由于时间关系可能还没有翻译全，但重要部分已基本包含

本人水平有限，如有翻译不当，请多多批评指出，我一定会修正，谢谢大家。有关 ObservableHystrixCommand 我有的部分选择性忽略了，因为常用的是 HystrixCommand，但每一个例子都有这两种 command 的解释

最好的学习方式其实还是看官方原文，下面附上链接

官方文档地址：

简述

什么是 Hystrix？

　　在分布式环境中，不可避免地会遇到所依赖的服务挂掉的情况，Hystrix 可以通过增加 延迟容忍度 与 错误容忍度，来控制这些分布式系统的交互。Hystrix 在服务与服务之间建立了一个中间层，防止服务之间出现故障，并提供了失败时的 fallback 策略，来增加你系统的整体可靠性和弹性。

Hystrix 做了那些事情？

　　Hystrix 提供了以下服务

引入第三方的 client 类库，通过延迟与失败的检测，来保护服务与服务之间的调用（网络间调用最为典型）

阻止复杂的分布式系统中出现级联故障

快速失败与快速恢复机制

提供兜底方案（fallback）并在适当的时机优雅降级

提供实时监控、报警与操作控制

Hystrix 解决了什么问题？

　　在复杂的分布式架构中，服务之间都是相互依赖的，任何一个节点都不可避免会宕机。如果主节点不能从这些宕机节点中独立出来，那主节点将会面临被这些宕机的节点拖垮的风险。举个例子，如果一个应用依赖了 30 个服务，每个服务保证 99.99% 的时间是正常的，那可以计算出

99.99³⁰ = 99.7% uptime
0.3% of 1 billion requests = 3,000,000 failures
2+ hours downtime/month even if all dependencies have excellent uptime.

　　实际情况往往更糟糕

　　完好情况下，请求流如下：

　　当一个依赖的节点坏掉时，将阻塞整个的用户请求：

　　流量高峰时，一个单节点的宕机或延迟，会迅速导致所有服务负载达到饱和。应用中任何一个可能通过网络访问其他服务的节点，都有可能成为造成潜在故障的来源。更严重的是，还可能导致服务之间的延迟增加，占用队列、线程等系统资源，从而导致多系统之间的级联故障。

　　更严重的是，当网络请求是通过第三方的一个黑盒客户端来发起时，实现细节都被隐藏起来了，而且还可能频繁变动，这样发生问题时就很难监控和改动。如果这个第三方还是通过传递依赖的，主应用程序中根本没有显示地写出调用的代码，那就更难了。

　　网络连接失败或者有延迟，服务将会产生故障或者响应变慢，最终反应成为一个 bug。

　　所有上述表现出来的故障或延迟，都需要一套管理机制，将节点变得相对独立，这样任何一个单节点故障，都至少不会拖垮整个系统的可用性。

Hystrix 的设计原则是什么？

　　Hystrix 通过以下设计原则来运作:

防止任何一个单节点将容器中的所有线程都占满

通过快速失败，取代放在队列中等待

提供在故障时的应急方法（fallback）

使用隔离技术 (如 bulkhead, swimlane, 和 circuit breaker patterns) 来限制任何一个依赖项的影响面

提供实时监控、报警等手段

提供低延迟的配置变更

防止客户端执行失败，不仅仅是执行网络请求的客户端

Hystrix 如何时间它的目标？

如下：

将远程请求或简单的方法调用包装成 HystrixCommand 或者 HystrixObservableCommand 对象，启动一个单独的线程来运行。

你可以为服务调用定义一个超时时间，可以为默认值，或者你自定义设置该属性，使得99.5%的请求时间都在该时间以下。

为每一个依赖的服务都分配一个线程池，当该线程池满了之后，直接拒绝，这样就防止某一个依赖的服务出问题阻塞了整个系统的其他服务

记录成功数、失败数、超时数以及拒绝数等指标

设置一个熔断器，将所有请求在一段时间内打到这个熔断器提供的方法上，触发条件可以是手动的，也可以根据失败率自动调整。

实时监控配置与属性的变更

　　当你启用 Hystrix 封装了原有的远程调用请求后，整个流程图变为下图所示。

接下来让我们学习如何使用它吧

快速入门

获取源码

maven


        
     
      com.netflix.hystrix
         
     
      hystrix-core
         
     
      x.y.z

lvy

如果你想下载 Jar 包而不是构建在一个工程里，如下


    
        
     
      4.0.0
         
     
      com.netflix.hystrix.download
         
     
      hystrix-download
         
     
      1.0-SNAPSHOT
         
     
      Simple POM to download hystrix-core and dependencies
         
     
      http://github.com/Netflix/Hystrix
         
             
                  
       
        com.netflix.hystrix
                   
       
        hystrix-core
                   
       
        x.y.z

然后执行

mvn -f download-hystrix-pom.xml dependency:copy-dependencies

Hello World

最简单的示例

public class CommandHelloWorld extends HystrixCommand
    
      {    private final String name;    public CommandHelloWorld(String name) {        super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"));        this.name = name;    }    @Override    protected String run() {        // a real example would do work like a network call here        return "Hello " + name + "!";    }}

该 commond 类可以用以下方法使用

String s = new CommandHelloWorld("Bob").execute();Future
    
      s = new CommandHelloWorld("Bob").queue();Observable
     
       s = new CommandHelloWorld("Bob").observe();

更多具体的用法详见 如何使用 模块

构建

下载源码并构建

$ git clone git@github.com:Netflix/Hystrix.git$ cd Hystrix/$ ./gradlew build

或者像这样构建

$ ./gradlew clean build

构建的输出大概是这样的

$ ./gradlew build:hystrix-core:compileJava:hystrix-core:processResources UP-TO-DATE:hystrix-core:classes:hystrix-core:jar:hystrix-core:sourcesJar:hystrix-core:signArchives SKIPPED:hystrix-core:assemble:hystrix-core:licenseMain UP-TO-DATE:hystrix-core:licenseTest UP-TO-DATE:hystrix-core:compileTestJava:hystrix-core:processTestResources UP-TO-DATE:hystrix-core:testClasses:hystrix-core:test:hystrix-core:check:hystrix-core:build:hystrix-examples:compileJava:hystrix-examples:processResources UP-TO-DATE:hystrix-examples:classes:hystrix-examples:jar:hystrix-examples:sourcesJar:hystrix-examples:signArchives SKIPPED:hystrix-examples:assemble:hystrix-examples:licenseMain UP-TO-DATE:hystrix-examples:licenseTest UP-TO-DATE:hystrix-examples:compileTestJava:hystrix-examples:processTestResources UP-TO-DATE:hystrix-examples:testClasses:hystrix-examples:test:hystrix-examples:check:hystrix-examples:buildBUILD SUCCESSFULTotal time: 30.758 secs

clean build 方式的输出如下

> Building > :hystrix-core:test > 147 tests completed

工作原理

流程图

下图展示了当你用使用 Hystrix 封装后的客户端请求一个服务时的流程

1. 创建 HystrixCommand 或 HystrixObservableCommand 对象

通过构建这两个对象来发起请求，构造函数中可以传入你发起请求时需要的参数

如果你需要的是返回一个单独的响应，那你就用 HystrixCommand 对象

HystrixCommand command = new HystrixCommand(arg1, arg2);

入股你需要的是聚合起来的多个响应，就用 HystrixObservableCommand 对象

HystrixObservableCommand command = new HystrixObservableCommand(arg1, arg2);

2. 执行 command

一共有四种方式可以执行 command，其中前两种方式都只适用于简单的 HystrixCommand 对象

excute() — 以阻塞方式运行，并返回返回其包装对象的响应值，或者抛出异常

queue() — 返回一个 Future 对象，你可以选择在适当时机 get

observe() —

toObservable() —

K             value   = command.execute();Future
    
          fValue  = command.queue();Observable
     
       ohValue = command.observe();         //hot observableObservable
      
        ocValue = command.toObservable();    //cold observable

实际上，同步方法 execute() 底层逻辑是调用 queue().get()，然后 queue() 实际上是调用了 toObservable().toBlocking().toFuture()，也就是说所有 HystrixCommand 的逻辑都是走 Observable 实现

3. 是否请求缓存

如果开启了请求缓存，并且该响应可以在缓存中找到，那就立刻返回缓存的响应值，而不会再走远程调用逻辑

4. 是否开启熔断

当执行 command 时，Hystrix 会判断熔断是否开启，如果是开启状态则走 (8) 进行 Fallback 降级策略，如果未开启则走 (5) ，继续下一步判断是否可以执行 command

5. 线程池\队列\信号量是否已满

如果上述三者已达到阈值，Hystrix 就会直接走 (8) 进行 Fallback 降级策略

6. HystrixObservableCommand.construct() 或 HystrixCommand.run()

这段太难了，不太会翻译，日后有理解再补充

Here, Hystrix invokes the request to the dependency by means of the method you have written for this purpose, one of the following:

— returns a single response or throws an exception

— returns an Observable that emits the response(s) or sends an onError notification

If the run() or construct() method exceeds the command’s timeout value, the thread will throw a TimeoutException (or a separate timer thread will, if the command itself is not running in its own thread). In that case Hystrix routes the response through 8. Get the Fallback, and it discards the eventual return value run() or construct() method if that method does not cancel/interrupt.

Please note that there's no way to force the latent thread to stop work - the best Hystrix can do on the JVM is to throw it an InterruptedException. If the work wrapped by Hystrix does not respect InterruptedExceptions, the thread in the Hystrix thread pool will continue its work, though the client already received a TimeoutException. This behavior can saturate the Hystrix thread pool, though the load is 'correctly shed'. Most Java HTTP client libraries do not interpret InterruptedExceptions. So make sure to correctly configure connection and read/write timeouts on the HTTP clients.

If the command did not throw any exceptions and it returned a response, Hystrix returns this response after it performs some some logging and metrics reporting. In the case of run(), Hystrix returns an Observable that emits the single response and then makes an onCompleted notification; in the case of construct() Hystrix returns the same Observable returned by construct().

时序图

熔断器

隔离机制

线程与线程池

请求合并

缓存

如何使用

Hello World

HystrixCommand 实现

public class CommandHelloWorld extends HystrixCommand
    
      {    private final String name;    public CommandHelloWorld(String name) {        super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"));        this.name = name;    }    @Override    protected String run() {        // a real example would do work like a network call here        return "Hello " + name + "!";    }}

HystrixObservableCommand 实现

public class CommandHelloWorld extends HystrixObservableCommand
    
      {    private final String name;    public CommandHelloWorld(String name) {        super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"));        this.name = name;    }    @Override    protected Observable
     
       construct() {        return Observable.create(new Observable.OnSubscribe
      
       () {            @Override            public void call(Subscriber
        observer) {                try {                    if (!observer.isUnsubscribed()) {                        // a real example would do work like a network call here                        observer.onNext("Hello");                        observer.onNext(name + "!");                        observer.onCompleted();                    }                } catch (Exception e) {                    observer.onError(e);                }            }         } ).subscribeOn(Schedulers.io());    }}

同步执行

调用 execute 方法即可

String s = new CommandHelloWorld("World").execute();

测试用例如下

@Testpublic void testSynchronous() {    assertEquals("Hello World!", new CommandHelloWorld("World").execute());    assertEquals("Hello Bob!", new CommandHelloWorld("Bob").execute());}

对于 HystrixObservableCommand 可以用 .toBlocking().toFuture().get()

异步执行

调用 queue 方法即可

Future
    
      fs = new CommandHelloWorld("World").queue();

返回值可以这样拿到

String s = fs.get();

测试用例如下

@Testpublic void testAsynchronous1() throws Exception {      assertEquals("Hello World!", new CommandHelloWorld("World").queue().get());      assertEquals("Hello Bob!", new CommandHelloWorld("Bob").queue().get());}@Testpublic void testAsynchronous2() throws Exception {      Future
    
      fWorld = new CommandHelloWorld("World").queue();      Future
     
       fBob = new CommandHelloWorld("Bob").queue();      assertEquals("Hello World!", fWorld.get());      assertEquals("Hello Bob!", fBob.get());  }

以下两种写法是等价的

String s1 = new CommandHelloWorld("World").execute();String s2 = new CommandHelloWorld("World").queue().get();

对于 HystrixObservableCommand 可以用 .toBlocking().toFuture()

Reactive Execution

暂未翻译

Reactive Commands

暂未翻译

Fallback

你可以为 Hystrix 提供一个降级策略并提供相应的降级方法 Fallback，这样当调用出错时，Hystrix 会选择执行你的 Fallback 方法并返回。你可以为大多数的 command 都设置降级策略，但以下几种情况除外：

写操作：如果一个方法只是写操作，而不需要返回一个值，其实就是返回值为 void 的方法。这时候你就不需要再设置 fallback 了。如果写失败了你反而希望错误信息传递过来

离线计算：如果一个方法的使命是写缓存、生成报告、或者大量的离线计算，这时最好不要设置 fallback，让错误信息返回以便重试，而不是毫无察觉地替换为降级方法

启用 fallback 你只需要实现 HystrixCommand 的 fallback() 方法即可，Hystrix 将会在错误发生时执行该方法，所谓的错误包括：抛出异常、超时、线程池或信号量触发拒绝、熔断器打开。

public class CommandHelloFailure extends HystrixCommand
    
      {    private final String name;    public CommandHelloFailure(String name) {        super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"));        this.name = name;    }    @Override    protected String run() {        throw new RuntimeException("this command always fails");    }    @Override    protected String getFallback() {        return "Hello Failure " + name + "!";    }}

该 run 方法每次都必然抛出异常，然后执行降级方法，以下为测试用例

@Testpublic void testSynchronous() {    assertEquals("Hello Failure World!", new CommandHelloFailure("World").execute());    assertEquals("Hello Failure Bob!", new CommandHelloFailure("Bob").execute());}

Error Propagation

run() 方法抛出的所有异常，除了 HystrixBadRequestException 以外，都会记为失败数，并触发 fallback 和熔断器的相关逻辑。

Failure Type	Exception class	Exception.cause	subject to fallback
FAILURE	`HystrixRuntimeException`	underlying exception (user-controlled)	YES
TIMEOUT	`HystrixRuntimeException`	`j.u.c.TimeoutException`	YES
SHORT_CIRCUITED	`HystrixRuntimeException`	`j.l.RuntimeException`	YES
THREAD_POOL_REJECTED	`HystrixRuntimeException`	`j.u.c.RejectedExecutionException`	YES
SEMAPHORE_REJECTED	`HystrixRuntimeException`	`j.l.RuntimeException`	YES
BAD_REQUEST	`HystrixBadRequestException`	underlying exception (user-controlled)

Command Name

一个 command 的名字，默认根据类名来定义

getClass().getSimpleName();

明确定义 command 的名称，需要通过构造方法传入

　　public CommandHelloWorld(String name) {    　　super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"))            　　.andCommandKey(HystrixCommandKey.Factory.asKey("HelloWorld")));    　　this.name = name;　　}

你也可以把固定的 Setter 保存起来，以便每次都传入一样的值

private static final Setter cachedSetter =         Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"))            .andCommandKey(HystrixCommandKey.Factory.asKey("HelloWorld"));        public CommandHelloWorld(String name) {        super(cachedSetter);        this.name = name;    }

Command Group

Hystrix 用这个分组的 key 去做统一的报表、监控、仪表盘等数据统计。上面代码中已经包含

Command Thread-Pool

thread-pool key 对应着 HystrixThreadPool，每一个 command 都属于一个 HystrixTreadPool，也即对应着一个 HystrixThreadPoolKey。如果不指定，那就默认同 HystrixGroupKey 相同。

之所欲需要这个参数，而不是用不同的 Group Key 来区分，是因为有时候你需要用 Group Key 来做数据的统计，但同时又希望里面的 command 进行分组独立。

以下为一个例子

两个 command 用于获取一个视频的元信息

group 名为 “VideoMetadata”

command A 依赖资源 a

command B 依赖资源 b

这种情况下，如果 command A 有问题比如说延迟之类的，对它的处理不应该影响到 command B，因为他们请求的是不同的后台资源。所以，我们希望这两个 command 逻辑上成为一个整体，但隔离时又希望他们是独立的，就需要设置这个参数。

Request Cache

请求缓存可以通过实现 getRequestKey 方法

public class CommandUsingRequestCache extends HystrixCommand
    
      {    private final int value;    protected CommandUsingRequestCache(int value) {        super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"));        this.value = value;    }    @Override    protected Boolean run() {        return value == 0 || value % 2 == 0;    }    @Override    protected String getCacheKey() {        return String.valueOf(value);    }}

由于这个依赖请求的上下文，所以我们必须先初始化 HystrixRequestContext，测试代码如下

@Test        public void testWithoutCacheHits() {            HystrixRequestContext context = HystrixRequestContext.initializeContext();            try {                assertTrue(new CommandUsingRequestCache(2).execute());                assertFalse(new CommandUsingRequestCache(1).execute());                assertTrue(new CommandUsingRequestCache(0).execute());                assertTrue(new CommandUsingRequestCache(58672).execute());            } finally {                context.shutdown();            }        }

一般来说这个 context 对象的初始化和销毁应该通过 ServletFilter 来控制。下面的例子展示了 context 对象对缓存的影响（包括获取的值以及是否是从缓存中获取的这个判断）

@Test        public void testWithCacheHits() {            HystrixRequestContext context = HystrixRequestContext.initializeContext();            try {                CommandUsingRequestCache command2a = new CommandUsingRequestCache(2);                CommandUsingRequestCache command2b = new CommandUsingRequestCache(2);                assertTrue(command2a.execute());                // this is the first time we've executed this command with                // the value of "2" so it should not be from cache                assertFalse(command2a.isResponseFromCache());                assertTrue(command2b.execute());                // this is the second time we've executed this command with                // the same value so it should return from cache                assertTrue(command2b.isResponseFromCache());            } finally {                context.shutdown();            }            // start a new request context            context = HystrixRequestContext.initializeContext();            try {                CommandUsingRequestCache command3b = new CommandUsingRequestCache(2);                assertTrue(command3b.execute());                // this is a new request context so this                 // should not come from cache                assertFalse(command3b.isResponseFromCache());            } finally {                context.shutdown();            }        }

Request Collapsing

这个技术允许多个请求被压缩在一个单独的 command 里面发出请求。

Operations

配置

Command Properties

Execution：决定了 command 如何被 execute

execution.isolation.strategy：以哪种隔离策略执行，分为 THREAD 和 SEMAPHORE 两种。默认的同时也是推荐的方式是，HystrixCommand 用 THREAD 方式，HystrixObservableCommand 用 SEMAPHORE 方式。如果不是 QPS 特别特别高，一般是非远程调用，没必要用 SEMAPHORE 来控制，用 THREAD 就好了。

execution.isolation.thread.timeoutInMilliseconds：HystrixCommand 默认是有超时时间控制（execution.timeout.enabled = true）并且分配降级策略的，这个参数就设定了超时时间，默认为 1000 ms

execution.timeout.enabled：如上所述，是否开启超时控制，默认为 true

execution.isolation.thread.interruptOnTimeout：超时后是否允许 interrupt，默认为 true

execution.isolation.thread.interruptOnCancel：cancel 后是否 interrupt，默认为 false

execution.isolation.semaphore.maxConcurrentRequests：如果你的隔离策略配置的是 ExecutionIsolationStrategy.SEMAPHORE，那这个参数就是表明信号量的值，也就是最大的并发请求数。如果达到了这个值，随后的请求将会被拒绝。默认为 10。你设置这个信号量值的逻辑，应该和你选择网线程池里放多少个线程的逻辑是一样的，但信号量的开销要小得多，而且方法的执行速度也要快得多，如果不是这样的情况，最好还是选择线程池方式，也就是 THREAD

Fallback：以下配置决定了 HystrixCommand.getFallback 的逻辑，这些配置同时适用于 THREAD 和 SEMAPHORE

fallback.isolation.semaphore.maxConcurrentRequests：该值为请求 Fallback 的最大并发请求数，默认为 10，如果达到了这个值，随后的请求将会抛出异常

fallback.enabled：当错误或超时发生时，是否走降级策略，默认为 true

Circuit Breaker

circuitBreaker.enabled：是否开启熔断器，默认为 true

circuitBreaker.requestVolumeThreshold：滑动窗口大小，即触发熔断的最小请求数量，默认为 20。举个例子，一共只有 19 个请求落在窗口内，就算全都失败了，也不会触发熔断

circuitBreaker.sleepWindowInMilliseconds：设置一个时间，当触发熔断后，多少秒之后再次进行访问尝试，看是否仍然要保持熔断状态，默认为 5000ms

circuitBreaker.errorThresholdPercentage：设置一个失败率，失败的请求达到这个值时，就触发熔断，默认为 50%

circuitBreaker.forceOpen：这个值没太看懂什么意思，附上原文（This property, if true, forces the circuit breaker into an open (tripped) state in which it will reject all requests.）

circuitBreaker.forceClosed：这个值没太看懂什么意思，附上原文（This property, if true, forces the circuit breaker into a closed state in which it will allow requests regardless of the error percentage.The circuitBreaker.forceOpen property takes precedence so if it is set to true this property does nothing.）

Metrics：该配置指定了如何在运行过程中收集 metrics

metrics.rollingStats.timeInMilliseconds：指标收集的滑动窗口时间，也就是 Hystrix 保持多久的一个指标收集，为之后的使用和上报做准备，默认为 10000 ms。下图为具体图示

metrics.rollingStats.numBuckets：配合上面的参数使用，表示一个滑动窗口时间被分割为多少个 buckets 来进行细粒度指标收集，默认为 10

metrics.rollingPercentile.enabled：进行百分比、均值等指标的收集，默认为 true，如果不选，则所有这类的指标返回 -1

metrics.rollingPercentile.timeInMilliseconds：进行百分比均值等指标收集的窗口时间，默认为 60000 ms

metrics.rollingPercentile.numBuckets：同理，上述百分比指标将被分为多少个 buckets 来进行收集，必须整除

metrics.rollingPercentile.bucketSize：收集百分比指标时，每一个 buckets 最大收集的请求数，默认为 100。举个例子，如果该值设置为 100，那一个 bucket 有 500 个请求过来时，只会用后 100 个请求做指标计算。

metrics.healthSnapshot.intervalInMilliseconds：设置一个时间，来指定收集健康指标的时间间隔（比如计算成功数、错误率等），默认为 500ms。该指标的意义是如果你的系统 CPU 负载很高，该指标计算同样也是 CPU 密集型运算，这个值可以让你控制多久进行一次健康统计

Request Context：以下参数会影响 HystrixRequestContex

requestCache.enabled：该参数决定了 HystrixCommand.getCacheKey 是否被启用，默认为 true

requestLog.enabled：该参数决定了执行过程中的日志，是否会输出到 HystrixRequestLog

Collapser Properties

不重要，过

Thread Pool Properties

以下参数控制 command 在执行时所需的线程池参数，与 Java 中的 ThreadPoolExecutor 的参数是对应的。大多数情况下 10 个线程足够了（甚至更小）。要想判断到底多少个线程合适，有以下的经验计算公式

流量顶峰时的 QPS * 99%请求时延 + 一些用来缓冲的空间

coreSize：核心线程数，默认为 10

maximumSize：最大线程数，该参数在 allowMaximumSizeToDivergeFromCoreSize 为 true 时才生效，默认与核心线程数一样，都是 10

maxQueueSize：设置 BlockingQueue 所实现队列的最大队列大小，默认为 -1。为 -1 代表用的是 SynchronousQueue，否则就是固定大小的 LinkedBlockingQueue。当核心线程池的线程都在忙碌时，新请求将会落在这个队列里，但超出队列部分将会被拒绝

allowMaximumSizeToDivergeFromCoreSize：是否允许线程池的大小扩大到 maximumSize，默认为 false

queueSizeRejectionThreshold：当队列里的等待线程数达到该值时，随后的请求将会被拒绝，即使还没达到 maxQueueSize。该参数的意义是因为 maxQueueSize 在线程刚创建时就固定了大小，无法改变，该值可以弥补这个缺憾，所以有时候 maxQueueSize 不起作用就是因为这个。默认为 5