RLB: Reordering-Robust Load Balancing in Lossless Datacenter Networks | Proceedings of the 52nd International Conference on Parallel Processing (2024)

research-article

Authors: Jinbin Hu, Yi He, Jin Wang, Wangqing Luo, and Jiawei Huang

ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

August 2023

Pages 576 - 584

Published: 13 September 2023 Publication History

  • 0citation
  • 124
  • Downloads

Metrics

Total Citations0Total Downloads124

Last 12 Months124

Last 6 weeks23

  • Get Citation Alerts

    New Citation Alert added!

    This alert has been successfully added and will be sent to:

    You will be notified whenever a record that you have chosen has been cited.

    To manage your alert preferences, click on the button below.

    Manage my Alerts

    New Citation Alert!

    Please log in to your account

  • Get Access

      • Get Access
      • References
      • Media
      • Tables
      • Share

    Abstract

    Many existing load balancing mechanisms work effectively in lossy datacenter networks (DCNs), but they suffer from serious packet reordering in lossless Ethernet DCNs deployed with the hop-by-hop Priority-based Flow Control (PFC). The key reason is that the prior solutions are not able to correctly and timely perceive PFC triggering when making load balancing decisions. Once the forwarding path pauses transmission due to PFC triggering, the packets allocated on it are blocked, inevitably leading to out-of-order packets and retransmission. In this paper, we present a Reordering-robust Load Balancing (RLB) scheme with PFC prediction in lossless DCNs. At its heart, RLB leverages the derivative of ingress queue length to predict PFC triggering and proactively notifies the upstream switches to choose an appropriate rerouting path or perform packet recirculation to avoid reordering. As a building block for existing load balancing mechanisms, we have integrated RLB into Presto, LetFlow, Hermes and DRILL. The test results show that the RLB-enhanced solutions deliver significant performance by avoiding packet reordering. For example, it reduces the 99th percentile flow completion time (FCT) by up to 58%, 67%, 72% and 54% over Presto, LetFlow, Hermes and DRILL, respectively.

    Supplementary Material

    Appendix (apdx175s2-filled_template_compiled_pdf_outfn.pdf)

    • Download
    • 277.10 KB

    References

    [1]

    [1]M. Alizadeh, A. Greenberg, D. A. Maltz, et al. Data Center TCP (DCTCP). In Proc. ACM SIGCOMM, 2010.

    Digital Library

    [2]

    [2]Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron, J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang. Congestion Control for Large-Scale RDMA Deployments. In Proc. ACM SIGCOMM, 2015.

    Digital Library

    [3]

    [3]W. Cheng, K. Qian, W. Jiang, T. Zhang, and F. Ren. Re-architecting Congestion Management in Lossless Ethernet. In Proc. USENIX NSDI, 2020.

    [4]

    [4]K. Qian, W. Cheng, T. Zhang, and F. Ren. Gentle Flow Control: Avoiding Deadlock in Lossless Networks. In Proc. ACM SIGCOMM, 2019.

    Digital Library

    [5]

    [5]Y. Zhu, M. Ghobadi, V. Misra, and J. Padhye. ECN or Delay: Lessons Learnt from Analysis of DCQCN and TIMELY. In Proc. ACM CoNEXT, 2016.

    Digital Library

    [6]

    [6]C. Guo, H. Wu, Z. Deng, G. Soni, J. Ye, J. Padhye, and M. Lipshteyn. RDMA over Commodity Ethernet at Scale. In Proc. ACM SIGCOMM, 2016.

    Digital Library

    [7]

    [7] W. Bai, A. Agrawal, A. Bhagat, M. Elhaddad, N. John, J. Padhye, et al. Empowering Azure Storage with 100 × 100 RDMA. In Proc. USENIX NSDI, 2023.

    [8]

    [8]J. Xue, M. U. Chaudhry, B. Vamanan, T. N. Vijaykumar, and M. Thottethodi. Dart: Divide and Specialize for Fast Response to Congestion in RDMA-based Datacenter Networks. IEEE/ACM Transactions on Networking, 28(1):322-335, 2020.

    Digital Library

    [9]

    [9]Y. Lu, G. Chen, B. Li, K. Tan, Y. Xiong, P. Cheng, J. Zhang, E. Chen, and Thomas Moscibroda. multipath Transport for RDMA in Datacenters. In Proc. USENIX NSDI, 2018.

    [10]

    [10]C. Tian, B. Li, L. Qin, J. Zheng, J. Yang, W. Wang, G. Chen, and W. Dou. P-PFC: Reducing Tail Latency with Predictive PFC in Lossless Data Center Networks. IEEE Transactions on Parallel and Distributed Systems, 31(6):1447-1459, 2020.

    [11]

    [11]R. Mittal, V. T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi, A. Vahdat, Y. Wang, D. Wetherall, and D. Zats. TIMELY: RTT-based Congestion Control for the Datacenter. In Proc. ACM SIGCOMM, 2015.

    Digital Library

    [12]

    [12]G. Kumar, N. Dukkipati, K. Jang, et al. Swift: Delay is Simple and Effective for Congestion Control in the Datacenter. In Proc. ACM SIGCOMM, 2020.

    [13]

    [13]M. Alizadeh, T. Edsall, S. Dharmapurikar, R. Vaidyanathan, K. Chu, A. Fingerhut, V. T. Lam, F. Matus, R. Pan, N. Yadav, G. Varghese. CONGA: Distributed Congestion-Aware Load Balancing for Datacenters. In Proc. ACM SIGCOMM, 2014.

    Digital Library

    [14]

    [14]K. He, E. Rozner, K. Agarwal, W. Felter, J. Carter and A. Akellay. Presto: Edge-based Load Balancing for Fast Datacenter networks. In Proc. ACM SIGCOMM, 2015.

    Digital Library

    [15]

    [15]E. Vanini, R. Pan, M. Alizadeh, P. Taheri and T. Edsall. Let It Flow: Resilient Asymmetric Load Balancing with Flowlet Switching. In Proc. USENIX NSDI, 2017.

    [16]

    [16]H. Zhang, J. Zhang, W. Bai, K. Chen, and M. Chowdhury. Resilient Datacenter Load Balancing in the Wild. In Proc. ACM SIGCOMM, 2017.

    Digital Library

    [17]

    [17]S. Ghorbani, Z. Yang, P. Godfrey, Y. Ganjali, and A. Firoozshahian. DRILL: Micro Load Balancing for Low-Latency Data Center Networks. In Proc. ACM SIGCOMM, 2017.

    Digital Library

    [18]

    [18]The P4.org Architecture Working Group. P416 Portable Switch Architecture (PSA). https://p4.org/p4-spec/docs/PSA-v0.9.0-draft.html#sec-recirculate.

    [19]

    [19]Z. Liu,K. Chen, H. Wu,S. Hu, Y. Hu, Y. Wang, G. Zhang. Enabling work-conserving bandwidth guarantees for multi-tenant datacenters via dynamic tenant-queue binding. In Proc. IEEE INFOCOM 2018.

    Digital Library

    [20]

    [20]W. Bai, S. Hu, K. Chen, K. Tan, Y. Xiong. One more config is enough: Saving (DC) TCP for high-speed extremely shallow-buffered datacenters.IEEE/ACM Transactions on Networking,2020, 29(2), 489-502.

    [21]

    [21]IEEE 802.1 Qbb - Priority-based Flow Control. https://1.ieee802.org/dcb/802-1qbb/.

    [22]

    [22]C. Guo, L. Yuan, D. Xiang, Y. Dang, R. Huang, D. Maltz, Z. Liu, V. Wang, B. Pang, H. Chen, Z. Lin, V. Kurien. Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. In Proc. ACM SIGCOMM, 2015.

    Digital Library

    [23]

    [23]N. Katta, M. Hira, C. Kim, A. Sivaraman, and J. Rexford. Hula: Scalable Load Balancing Using Programmable Data Planes. In Proc. ACM Symposium on SDN Research, 2016.

    Digital Library

    [24]

    [24]N. Katta, A. Ghag, M. Hira, I. Keslassy, A. Bergman, C. Kim, and J. Rexford. Clove: Congestion-Aware Load Balancing at the Virtual Edges. In Proc. ACM CoNEXT, 2017.

    Digital Library

    [25]

    [25]W.Bai, S. Hu, K. Chen, K. Tan, Y. Xiong. One More Config is Enough: Saving (DC)TCP for High-speed Extremely Shallow-buffered Datacenters. In Proc. IEEE INFOCOM 2020.

    [26]

    [26]J. Hu, J. Huang, Z. Li, Y. Li, W. Jiang, K. Chen, J. Wang and T. He. RPO: Receiver-driven Transport Protocol Using Opportunistic Transmission in Data Center. In Proc. IEEE ICNP, 2021.

    [27]

    [27]J. Hu, J. Huang, Z. Li, J. Wang and T. He. A Receiver-Driven Transport Protocol with High Link Utilization Using Anti-ECN Marking in Data Center Networks. IEEE Transactions on Network and Service Management. 2022.

    Digital Library

    [28]

    [28]J. Zhang, W. Bai, K. Chen. Enabling ECN for datacenter networks with RTT variations. In Proc. ACM CoNEXT, 2019.

    Digital Library

    [29]

    [29]IEEE. 802.1Qau – Congestion Notification. http://www.ieee802.org/1/pages/802.1au.html.

    [30]

    [30]A. Saeed, V. Gupta, P. Goyal, M. Sharif, R. Pan, M. Ammar, E. Zegura, K. Jang, M. Alizadeh, A. Kabbani, and A. Vahdat. Annulus: A Dual Congestion Control Loop for Datacenter and WAN Traffic Aggregates. In Proc. ACM SIGCOMM, 2020.

    Digital Library

    [31]

    [31]A. Dixit, P. Prakash, Y. C. Hu, and R. R. Kompella. On the Impact of Packet Spraying in Data Center Networks. In Proc. of IEEE INFOCOM, 2013.

    [32]

    [32]M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera: Dynamic Flow Scheduling for Data Center Networks. In Proc. USENIX NSDI, 2010.

    Digital Library

    [33]

    [33]T. Benson, A. Anand, A. Akella, and M. Zhang. MicroTE: Fine Grained Traffic Engineering for Data Centers. In Proc. ACM CoNEXT, 2011.

    Digital Library

    [34]

    [34]J. Hu, J. Huang, W. Lv, W. Li, J. Wang and T. He. TLB: Traffic-aware Load Balancing with Adaptive Granularity in Data Center Networks. In Proc. ACM ICPP, 2019.

    Digital Library

    [35]

    [35]C. Raiciu, S. Barre, C. Pluntke, A. Greenhalgh, D. Wischik, and M. Handley. Improving Datacenter Performance and Robustness with Multipath TCP. In Proc. ACM SIGCOMM, 2011.

    Digital Library

    [36]

    [36]R. Mittal, A. Shpiner, A. Panda, E. Zahavi, A. Krishnamurthy, S. Ratnasamy, and S. Shenker. Revisiting Network Support for RDMA. In Proc. ACM SIGCOMM, 2018.

    Digital Library

    Index Terms

    1. RLB: Reordering-Robust Load Balancing in Lossless Datacenter Networks

      1. Networks

        1. Network architectures

          1. Network protocols

            1. Network layer protocols

              1. Routing protocols

            2. Network types

              1. Data center networks

          Recommendations

          • A distributed backoff-channel deflection algorithm with load balancing for optical burst switching networks

            Optical burst contention is one of the major factors that cause the burst loss in the optical burst switching (OBS) networks. So far, various contention resolution schemes have been proposed. Among them, the deflection path is more attractive due to its ...

            Read More

          • Load balancing for heterogeneous traffic in datacenter networks

            Abstract

            In modern datacenter networks (DCNs), the overwhelming heterogeneous flows have various stringent demands, ranging from delay-sensitive short flows, throughput-sensitive long flows to best-effort flows without deadline. Recently, many ...

            Highlights

            • We conduct in-depth research to analyze the two main issues brought about by heterogeneous traffic transmission of the same granularity: mixed transmission ...

            Read More

          • Load Balancing in PFC-Enabled Datacenter Networks

            APNet '22: Proceedings of the 6th Asia-Pacific Workshop on Networking

            In Priority Flow Control (PFC) enabled datacenter networks (DCNs), PFC is inevitably triggered due to bursty traffic even with end-to-end congestion control. Load balancing as a complementary mechanism to transport protocols can make rerouting ...

            Read More

          Comments

          Information & Contributors

          Information

          Published In

          RLB: Reordering-Robust Load Balancing in Lossless Datacenter Networks | Proceedings of the 52nd International Conference on Parallel Processing (6)

          ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

          August 2023

          858 pages

          ISBN:9798400708435

          DOI:10.1145/3605573

          Copyright © 2023 ACM.

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [emailprotected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 13 September 2023

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. Data Center
          2. Load Balancing
          3. Lossless Networks
          4. Reordering

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Funding Sources

          • The Scientific Research Fund of Hunan Provincial Education Department
          • The Natural Science Foundation of Hunan Province
          • The National Natural Science Foundation of China

          Conference

          ICPP 2023

          ICPP 2023: 52nd International Conference on Parallel Processing

          August 7 - 10, 2023

          UT, Salt Lake City, USA

          Acceptance Rates

          Overall Acceptance Rate 91 of 313 submissions, 29%

          Contributors

          RLB: Reordering-Robust Load Balancing in Lossless Datacenter Networks | Proceedings of the 52nd International Conference on Parallel Processing (7)

          Other Metrics

          View Article Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Total Citations

          • 124

            Total Downloads

          • Downloads (Last 12 months)124
          • Downloads (Last 6 weeks)23

          Other Metrics

          View Author Metrics

          Citations

          View Options

          Get Access

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          Get this Publication

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Media

          Figures

          Other

          Tables

          RLB: Reordering-Robust Load Balancing in Lossless Datacenter Networks | Proceedings of the 52nd International Conference on Parallel Processing (2024)

          References

          Top Articles
          Latest Posts
          Article information

          Author: Nathanael Baumbach

          Last Updated:

          Views: 5360

          Rating: 4.4 / 5 (75 voted)

          Reviews: 90% of readers found this page helpful

          Author information

          Name: Nathanael Baumbach

          Birthday: 1998-12-02

          Address: Apt. 829 751 Glover View, West Orlando, IN 22436

          Phone: +901025288581

          Job: Internal IT Coordinator

          Hobby: Gunsmithing, Motor sports, Flying, Skiing, Hooping, Lego building, Ice skating

          Introduction: My name is Nathanael Baumbach, I am a fantastic, nice, victorious, brave, healthy, cute, glorious person who loves writing and wants to share my knowledge and understanding with you.