Fabric Network

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 123 Experts worldwide ranked by ideXlab platform

Justin Meza - One of the best experts on this subject based on the ideXlab platform.

  • Internet Measurement Conference - A Large Scale Study of Data Center Network Reliability
    Proceedings of the Internet Measurement Conference 2018, 2018
    Co-Authors: Justin Meza, Kaushik Veeraraghavan, Onur Mutlu
    Abstract:

    The ability to tolerate, remediate, and recover from Network incidents (caused by device failures and fiber cuts, for example) is critical for building and operating highly-available web services. Achieving fault tolerance and failure preparedness requires system architects, software developers, and site operators to have a deep understanding of Network reliability at scale, along with its implications on the software systems that run in data centers. Unfortunately, little has been reported on the reliability characteristics of large scale data center Network infrastructure, let alone its impact on the availability of services powered by software running on that Network infrastructure. This paper fills the gap by presenting a large scale, longitudinal study of data center Network reliability based on operational data collected from the production Network infrastructure at Facebook, one of the largest web service providers in the world. Our study covers reliability characteristics of both intra and inter data center Networks. For intra data center Networks, we study seven years of operation data comprising thousands of Network incidents across two different data center Network designs, a cluster Network design and a state-of-the-art Fabric Network design. For inter data center Networks, we study eighteen months of recent repair tickets from the field to understand reliability of Wide Area Network (WAN) backbones. In contrast to prior work, we study the effects of Network reliability on software systems, and how these reliability characteristics evolve over time. We discuss the implications of Network reliability on the design, implementation, and operation of large scale data center systems and how it affects highly-available web services. We hope our study forms a foundation for understanding the reliability of large scale Network infrastructure, and inspires new reliability solutions to Network incidents.

  • Large Scale Studies of Memory, Storage, and Network Failures in a Modern Data Center
    2018
    Co-Authors: Justin Meza
    Abstract:

    The workloads running in the modern data centers of large scale Internet service providers (such asAlibaba, Amazon, Baidu, Facebook, Google, and Microsoft) support billions of users and span globallydistributed infrastructure. Yet, the devices used in modern data centers fail due to a variety of causes, fromfaulty components to bugs to misconfiguration. Faulty devices make operating large scale data centerschallenging because the workloads running in modern data centers consist of interdependent programsdistributed across many servers, so failures that are isolated to a single device can still have a widespreadeffect on a workload.In this dissertation, we measure and model the device failures in a large scale Internet service company,Facebook. We focus on three device types that form the foundation of Internet service data centerinfrastructure: DRAM for main memory, SSDs for persistent storage, and switches and backbone linksfor Network connectivity. For each of these device types, we analyze long term device failure data brokendown by important device attributes and operating conditions, such as age, vendor, and workload. Wealso build and release statistical models of the failure trends for the devices we analyze.For DRAM devices, we analyze the memory errors in the entire fleet of servers at Facebook over thecourse of fourteen months, representing billions of device days of operation. The systems we examinecover a wide range of devices commonly used in modern servers, with DIMMs that use the modernDDR3 communication protocol, manufactured by 4 vendors in capacities ranging from 2GB to 24GB.We observe several new reliability trends for memory systems that have not been discussed before inliterature, develop a model for memory reliability, show how system design choices such as using lowerdensity DIMMs and fewer cores per chip can reduce failure rates of a baseline server by up to 57.7%.We perform the first implementation and real-system analysis of page offlining at scale, on a cluster ofthousands of servers, identify several real-world impediments to the technique, and show that it canreduce memory error rate by 67%. We also examine the efficacy of a new technique to reduce DRAMfaults, physical page randomization, and examine its potential for improving reliability and its overheads.For SSD devices, we perform a large scale study of flash-based SSD reliability at Facebook. We analyzedata collected across a majority of flash-based solid state drives over nearly four years and manymillions of operational hours in order to understand failure properties and trends of flash-based SSDs.Our study considers a variety of SSD characteristics, including: the amount of data written to and readfrom flash chips; how data is mapped within the SSD address space; the amount of data copied, erased,and discarded by the flash controller; and flash board temperature and bus power. Based on our fieldanalysis of how flash memory errors manifest when running modern workloads on modern SSDs, we make several major observations and find that SSD failure rates do not increase monotonically with flashchip wear, but instead they go through several distinct periods corresponding to how failures emerge andare subsequently detected.For Network devices, we perform a large scale, longitudinal study of data center Network reliabilitybased on operational data collected from the production Network infrastructure at Facebook. Our studycovers reliability characteristics of both intra and inter data center Networks. For intra data center Networks,we study seven years of operation data comprising thousands of Network incidents across twodifferent data center Network designs, a cluster Network design and a state-of-the-art Fabric Network design.For inter data center Networks, we study eighteen months of recent repair tickets from the field tounderstand the reliability of Wide Area Network (WAN) backbones. In contrast to prior work, we studythe effects of Network reliability on software systems, and how these reliability characteristics evolve overtime. We discuss the implications of Network reliability on the design, implementation, and operation oflarge scale data center systems and how the Network affects highly-available web services.Our key conclusion in this dissertation is that we can gain a deep understanding of why devicesfail—and how to predict their failure—using measurement and modeling. We hope that the analysis,techniques, and models we present in this dissertation will enable the community to better measure,understand, and prepare for the hardware reliability challenges we face in the future.

Onur Mutlu - One of the best experts on this subject based on the ideXlab platform.

  • Internet Measurement Conference - A Large Scale Study of Data Center Network Reliability
    Proceedings of the Internet Measurement Conference 2018, 2018
    Co-Authors: Justin Meza, Kaushik Veeraraghavan, Onur Mutlu
    Abstract:

    The ability to tolerate, remediate, and recover from Network incidents (caused by device failures and fiber cuts, for example) is critical for building and operating highly-available web services. Achieving fault tolerance and failure preparedness requires system architects, software developers, and site operators to have a deep understanding of Network reliability at scale, along with its implications on the software systems that run in data centers. Unfortunately, little has been reported on the reliability characteristics of large scale data center Network infrastructure, let alone its impact on the availability of services powered by software running on that Network infrastructure. This paper fills the gap by presenting a large scale, longitudinal study of data center Network reliability based on operational data collected from the production Network infrastructure at Facebook, one of the largest web service providers in the world. Our study covers reliability characteristics of both intra and inter data center Networks. For intra data center Networks, we study seven years of operation data comprising thousands of Network incidents across two different data center Network designs, a cluster Network design and a state-of-the-art Fabric Network design. For inter data center Networks, we study eighteen months of recent repair tickets from the field to understand reliability of Wide Area Network (WAN) backbones. In contrast to prior work, we study the effects of Network reliability on software systems, and how these reliability characteristics evolve over time. We discuss the implications of Network reliability on the design, implementation, and operation of large scale data center systems and how it affects highly-available web services. We hope our study forms a foundation for understanding the reliability of large scale Network infrastructure, and inspires new reliability solutions to Network incidents.

Chandra, Shekhar S. - One of the best experts on this subject based on the ideXlab platform.

  • Fabric Image Representation Encoding Networks for Large-scale 3D Medical Image Analysis
    2020
    Co-Authors: Liu Siyu, Dai Wei, Engstrom Craig, Fripp Jurgen, Greer, Peter B., Crozier Stuart, Dowling, Jason A., Chandra, Shekhar S.
    Abstract:

    Deep neural Networks are parameterised by weights that encode feature representations, whose performance is dictated through generalisation by using large-scale feature-rich datasets. The lack of large-scale labelled 3D medical imaging datasets restrict constructing such generalised Networks. In this work, a novel 3D segmentation Network, Fabric Image Representation Networks (FIRENet), is proposed to extract and encode generalisable feature representations from multiple medical image datasets in a large-scale manner. FIRENet learns image specific feature representations by way of 3D Fabric Network architecture that contains exponential number of sub-architectures to handle various protocols and coverage of anatomical regions and structures. The Fabric Network uses Atrous Spatial Pyramid Pooling (ASPP) extended to 3D to extract local and image-level features at a fine selection of scales. The Fabric is constructed with weighted edges allowing the learnt features to dynamically adapt to the training data at an architecture level. Conditional padding modules, which are integrated into the Network to reinsert voxels discarded by feature pooling, allow the Network to inherently process different-size images at their original resolutions. FIRENet was trained for feature learning via automated semantic segmentation of pelvic structures and obtained a state-of-the-art median DSC score of 0.867. FIRENet was also simultaneously trained on MR (Magnatic Resonance) images acquired from 3D examinations of musculoskeletal elements in the (hip, knee, shoulder) joints and a public OAI knee dataset to perform automated segmentation of bone across anatomy. Transfer learning was used to show that the features learnt through the pelvic segmentation helped achieve improved mean DSC scores of 0.962, 0.963, 0.945 and 0.986 for automated segmentation of bone across datasets.Comment: 12 pages, 10 figure

Shekhar S. Chandra - One of the best experts on this subject based on the ideXlab platform.

  • Fabric Image Representation Encoding Networks for Large-scale 3D Medical Image Analysis.
    arXiv: Image and Video Processing, 2020
    Co-Authors: Siyu Liu, Wei Dai, Craig Engstrom, Jurgen Fripp, Peter B. Greer, Stuart Crozier, Jason A. Dowling, Shekhar S. Chandra
    Abstract:

    Deep neural Networks are parameterised by weights that encode feature representations, whose performance is dictated through generalisation by using large-scale feature-rich datasets. The lack of large-scale labelled 3D medical imaging datasets restrict constructing such generalised Networks. In this work, a novel 3D segmentation Network, Fabric Image Representation Networks (FIRENet), is proposed to extract and encode generalisable feature representations from multiple medical image datasets in a large-scale manner. FIRENet learns image specific feature representations by way of 3D Fabric Network architecture that contains exponential number of sub-architectures to handle various protocols and coverage of anatomical regions and structures. The Fabric Network uses Atrous Spatial Pyramid Pooling (ASPP) extended to 3D to extract local and image-level features at a fine selection of scales. The Fabric is constructed with weighted edges allowing the learnt features to dynamically adapt to the training data at an architecture level. Conditional padding modules, which are integrated into the Network to reinsert voxels discarded by feature pooling, allow the Network to inherently process different-size images at their original resolutions. FIRENet was trained for feature learning via automated semantic segmentation of pelvic structures and obtained a state-of-the-art median DSC score of 0.867. FIRENet was also simultaneously trained on MR (Magnatic Resonance) images acquired from 3D examinations of musculoskeletal elements in the (hip, knee, shoulder) joints and a public OAI knee dataset to perform automated segmentation of bone across anatomy. Transfer learning was used to show that the features learnt through the pelvic segmentation helped achieve improved mean DSC scores of 0.962, 0.963, 0.945 and 0.986 for automated segmentation of bone across datasets.

Robert G. Reynolds - One of the best experts on this subject based on the ideXlab platform.

  • IEEE Congress on Evolutionary Computation - A social metrics based process model on complex social system
    2014 IEEE Congress on Evolutionary Computation (CEC), 2014
    Co-Authors: Xiangdong Che, Robert G. Reynolds
    Abstract:

    In previous work, we investigated the performance of Cultural Algorithms (CA) over the complete range of system complexities in a benchmarked environment. In this paper the goal is to discover whether there is a similar internal process going on in CA problem solving, regardless of the complexity of the problem. We are to monitor the “vital signs” of a cultural system during the problem solving process to determine whether it was on track or not and infer the complexity class of a social system based on its “vital signs”. We first demonstrate how the learning curve for a Cultural System is supported by the interaction of the knowledge sources. Next a circulatory system metaphor is used to describe how the exploratory knowledge sources generate new information that is distributed to the agents via the Social Fabric Network. We then conclude that the Social Metrics are able to indicate the progress of the problem solving in terms of its ability to periodically lower the innovation cost for the performance of a knowledge source which allows the influenced population to expand and explore new solution possibilities as seen in the dispersion metric. Hence we present the possibility to assess the complexity of a system's environment by looking at the Social Metrics.