Underlying File

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 13068 Experts worldwide ranked by ideXlab platform

Youngjae Kim - One of the best experts on this subject based on the ideXlab platform.

  • An Integrated Indexing and Search Service for Distributed File Systems
    IEEE Transactions on Parallel and Distributed Systems, 2020
    Co-Authors: Hyogi Sim, Awais Khan, Sudharshan S. Vazhkudai, Seung-hwan Lim, Ali R. Butt, Youngjae Kim
    Abstract:

    Data services such as search, discovery, and management in scalable distributed environments have traditionally been decoupled from the Underlying File systems, and are often deployed using external databases and indexing services. However, modern data production rates, looming data movement costs, and the lack of metadata, entail revisiting the decoupled File system-data services design philosophy. In this article, we present TagIt, a scalable data management service framework aimed at scientific datasets, which can be integrated into prevalent distributed File system architectures. A key feature of TagIt is a scalable, distributed metadata indexing framework, which facilitates a flexible tagging capability to support data discovery. Furthermore, the tags can also be associated with an active operator, for pre-processing, filtering, or automatic metadata extraction, which we seamlessly offload to File servers in a load-aware fashion. We have integrated TagIt into two popular distributed File systems, i.e., GlusterFS and CephFS. Our evaluation demonstrates that TagIt can expedite data search operation by up to 10× over the extant decoupled approach.

  • SC - Tagit: an integrated indexing and search service for File systems
    Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis, 2017
    Co-Authors: Hyogi Sim, Sudharshan S. Vazhkudai, Seung-hwan Lim, Youngjae Kim, Geoffroy Vallée, Ali R. Butt
    Abstract:

    Data services such as search, discovery, and management in scalable distributed environments have traditionally been decoupled from the Underlying File systems, and are often deployed using external databases and indexing services. However, modern data production rates, looming data movement costs, and the lack of metadata, entail revisiting the decoupled File system-data services design philosophy. In this paper, we present TagIt, a scalable data management service framework aimed at scientific datasets, which is tightly integrated into a shared-nothing distributed File system. A key feature of TagIt is a scalable, distributed metadata indexing framework, using which we implement a flexible tagging capability to support data discovery. The tags can also be associated with an active operator, for pre-processing, filtering, or automatic metadata extraction, which we seamlessly offload to File servers in a load-aware fashion. Our evaluation shows that TagIt can expedite data search by up to 10X over the extant decoupled approach.

Ali R. Butt - One of the best experts on this subject based on the ideXlab platform.

  • An Integrated Indexing and Search Service for Distributed File Systems
    IEEE Transactions on Parallel and Distributed Systems, 2020
    Co-Authors: Hyogi Sim, Awais Khan, Sudharshan S. Vazhkudai, Seung-hwan Lim, Ali R. Butt, Youngjae Kim
    Abstract:

    Data services such as search, discovery, and management in scalable distributed environments have traditionally been decoupled from the Underlying File systems, and are often deployed using external databases and indexing services. However, modern data production rates, looming data movement costs, and the lack of metadata, entail revisiting the decoupled File system-data services design philosophy. In this article, we present TagIt, a scalable data management service framework aimed at scientific datasets, which can be integrated into prevalent distributed File system architectures. A key feature of TagIt is a scalable, distributed metadata indexing framework, which facilitates a flexible tagging capability to support data discovery. Furthermore, the tags can also be associated with an active operator, for pre-processing, filtering, or automatic metadata extraction, which we seamlessly offload to File servers in a load-aware fashion. We have integrated TagIt into two popular distributed File systems, i.e., GlusterFS and CephFS. Our evaluation demonstrates that TagIt can expedite data search operation by up to 10× over the extant decoupled approach.

  • SC - Tagit: an integrated indexing and search service for File systems
    Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis, 2017
    Co-Authors: Hyogi Sim, Sudharshan S. Vazhkudai, Seung-hwan Lim, Youngjae Kim, Geoffroy Vallée, Ali R. Butt
    Abstract:

    Data services such as search, discovery, and management in scalable distributed environments have traditionally been decoupled from the Underlying File systems, and are often deployed using external databases and indexing services. However, modern data production rates, looming data movement costs, and the lack of metadata, entail revisiting the decoupled File system-data services design philosophy. In this paper, we present TagIt, a scalable data management service framework aimed at scientific datasets, which is tightly integrated into a shared-nothing distributed File system. A key feature of TagIt is a scalable, distributed metadata indexing framework, using which we implement a flexible tagging capability to support data discovery. The tags can also be associated with an active operator, for pre-processing, filtering, or automatic metadata extraction, which we seamlessly offload to File servers in a load-aware fashion. Our evaluation shows that TagIt can expedite data search by up to 10X over the extant decoupled approach.

Ajay Mohindra - One of the best experts on this subject based on the ideXlab platform.

  • server recovery using naturally replicated state a case study
    International Conference on Distributed Computing Systems, 1995
    Co-Authors: Murthy V Devarakonda, B Kish, Ajay Mohindra
    Abstract:

    This paper describes design and preliminary measurements of a File server recovery scheme that uses naturally replicated state among clients. This scheme, implemented in the Calypso File system, is truly transparent to the user and avoids the overhead of explicit replication. A three-phase protocol reconstructs the server state either on a backup node (if disks are multi-ported) or on the rebooted server node. Measurements show that the recovery time is about 21 seconds for a busy 10-node cluster. However, the time to rebuild the distributed state is only about 1.5 seconds, and most of the recovery time is spent in replaying the write-ahead log of the Underlying File system. Fortunately, the log redo time is bounded by the log size.

  • ICDCS - Server recovery using naturally replicated state: a case study
    Proceedings of 15th International Conference on Distributed Computing Systems, 1
    Co-Authors: Murthy V Devarakonda, B Kish, Ajay Mohindra
    Abstract:

    This paper describes design and preliminary measurements of a File server recovery scheme that uses naturally replicated state among clients. This scheme, implemented in the Calypso File system, is truly transparent to the user and avoids the overhead of explicit replication. A three-phase protocol reconstructs the server state either on a backup node (if disks are multi-ported) or on the rebooted server node. Measurements show that the recovery time is about 21 seconds for a busy 10-node cluster. However, the time to rebuild the distributed state is only about 1.5 seconds, and most of the recovery time is spent in replaying the write-ahead log of the Underlying File system. Fortunately, the log redo time is bounded by the log size.

Hyogi Sim - One of the best experts on this subject based on the ideXlab platform.

  • An Integrated Indexing and Search Service for Distributed File Systems
    IEEE Transactions on Parallel and Distributed Systems, 2020
    Co-Authors: Hyogi Sim, Awais Khan, Sudharshan S. Vazhkudai, Seung-hwan Lim, Ali R. Butt, Youngjae Kim
    Abstract:

    Data services such as search, discovery, and management in scalable distributed environments have traditionally been decoupled from the Underlying File systems, and are often deployed using external databases and indexing services. However, modern data production rates, looming data movement costs, and the lack of metadata, entail revisiting the decoupled File system-data services design philosophy. In this article, we present TagIt, a scalable data management service framework aimed at scientific datasets, which can be integrated into prevalent distributed File system architectures. A key feature of TagIt is a scalable, distributed metadata indexing framework, which facilitates a flexible tagging capability to support data discovery. Furthermore, the tags can also be associated with an active operator, for pre-processing, filtering, or automatic metadata extraction, which we seamlessly offload to File servers in a load-aware fashion. We have integrated TagIt into two popular distributed File systems, i.e., GlusterFS and CephFS. Our evaluation demonstrates that TagIt can expedite data search operation by up to 10× over the extant decoupled approach.

  • SC - Tagit: an integrated indexing and search service for File systems
    Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis, 2017
    Co-Authors: Hyogi Sim, Sudharshan S. Vazhkudai, Seung-hwan Lim, Youngjae Kim, Geoffroy Vallée, Ali R. Butt
    Abstract:

    Data services such as search, discovery, and management in scalable distributed environments have traditionally been decoupled from the Underlying File systems, and are often deployed using external databases and indexing services. However, modern data production rates, looming data movement costs, and the lack of metadata, entail revisiting the decoupled File system-data services design philosophy. In this paper, we present TagIt, a scalable data management service framework aimed at scientific datasets, which is tightly integrated into a shared-nothing distributed File system. A key feature of TagIt is a scalable, distributed metadata indexing framework, using which we implement a flexible tagging capability to support data discovery. The tags can also be associated with an active operator, for pre-processing, filtering, or automatic metadata extraction, which we seamlessly offload to File servers in a load-aware fashion. Our evaluation shows that TagIt can expedite data search by up to 10X over the extant decoupled approach.

Robert Ross - One of the best experts on this subject based on the ideXlab platform.

  • Optimizing I/O forwarding techniques for extreme-scale event tracing
    Cluster Computing, 2014
    Co-Authors: Thomas Ilsche, Robert Ross, Joseph Schuchart, Jason Cope, Dries Kimpe, Terry Jones, Andreas Knüpfer, Kamil Iskra, Wolfgang E. Nagel, Stephen Poole
    Abstract:

    Programming development tools are a vital component for understanding the behavior of parallel applications. Event tracing is a principal ingredient to these tools, but new and serious challenges place event tracing at risk on extreme-scale machines. As the quantity of captured events increases with concurrency, the additional data can overload the parallel File system and perturb the application being observed. In this work we present a solution for event tracing on extreme-scale machines. We enhance an I/O forwarding software layer to aggregate and reorganize log data prior to writing to the storage system, significantly reducing the burden on the Underlying File system. Furthermore, we introduce a sophisticated write buffering capability to limit the impact. To validate the approach, we employ the Vampir tracing toolset using these new capabilities. Our results demonstrate that the approach increases the maximum traced application size by a factor of 5× to more than 200,000 processes.

  • SC - Characterization and modeling of PIDX parallel I/O for performance optimization
    Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis on - SC '13, 2013
    Co-Authors: Sidharth Kumar, Robert Latham, Avishek Saha, Venkatram Vishwanath, Philip Carns, John A. Schmidt, Giorgio Scorzelli, Hemanth Kolla, Ray W. Grout, Robert Ross
    Abstract:

    Parallel I/O library performance can vary greatly in response to user-tunable parameter values such as aggregator count, File count, and aggregation strategy. Unfortunately, manual selection of these values is time consuming and dependent on characteristics of the target machine, the Underlying File system, and the dataset itself. Some characteristics, such as the amount of memory per core, can also impose hard constraints on the range of viable parameter values. In this work we address these problems by using machine learning techniques to model the performance of the PIDX parallel I/O library and select appropriate tunable parameter values. We characterize both the network and I/O phases of PIDX on a Cray XE6 as well as an IBM Blue Gene/P system. We use the results of this study to develop a machine learning model for parameter space exploration and performance prediction.

  • HPDC - Enabling event tracing at leadership-class scale through I/O forwarding middleware
    Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing - HPDC '12, 2012
    Co-Authors: Thomas Ilsche, Robert Ross, Joseph Schuchart, Jason Cope, Dries Kimpe, Terry Jones, Andreas Knüpfer, Kamil Iskra, Wolfgang E. Nagel, Stephen W. Poole
    Abstract:

    Event tracing is an important tool for understanding the performance of parallel applications. As concurrency increases in leadership-class computing systems, the quantity of performance log data can overload the parallel File system, perturbing the application being observed. In this work we present a solution for event tracing at leadership scales. We enhance the I/O forwarding system software to aggregate and reorganize log data prior to writing to the storage system, significantly reducing the burden on the Underlying File system for this type of traffic. Furthermore, we augment the I/O forwarding system with a write buffering capability to limit the impact of artificial perturbations from log data accesses on traced applications. To validate the approach, we modify the Vampir tracing toolset to take advantage of this new capability and show that the approach increases the maximum traced application size by a factor of 5x to more than 200,000 processes.

  • on the duality of data intensive File system design reconciling hdfs and pvfs
    IEEE International Conference on High Performance Computing Data and Analytics, 2011
    Co-Authors: Wittawat Tantisiriroj, Swapnil Patil, Samuel Lang, Garth A Gibson, Robert Ross
    Abstract:

    Data-intensive applications fall into two computing styles: Internet services (cloud computing) or high-performance computing (HPC). In both categories, the Underlying File system is a key component for scalable application performance. In this paper, we explore the similarities and differences between PVFS, a parallel File system used in HPC at large scale, and HDFS, the primary storage system used in cloud computing with Hadoop. We integrate PVFS into Hadoop and compare its performance to HDFS using a set of data-intensive computing benchmarks. We study how HDFS-specific optimizations can be matched using PVFS and how consistency, durability, and persistence tradeoffs made by these File systems affect application performance. We show how to embed multiple replicas into a PVFS File, including a mapping with a complete copy local to the writing client, to emulate HDFS's File layout policies. We also highlight implementation issues with HDFS's dependence on disk bandwidth and benefits from pipelined replication.

  • PVM/MPI - Implementing MPI-IO shared File pointers without File system support
    Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2005
    Co-Authors: Robert Latham, Robert Ross, Rajeev Thakur, Brian Toonen
    Abstract:

    The ROMIO implementation of the MPI-IO standard provides a portable infrastructure for use on top of any number of different Underlying storage targets. These targets vary widely in their capabilities, and in some cases additional effort is needed within ROMIO to support all MPI-IO semantics. The MPI-2 standard defines a class of File access routines that use a shared File pointer. These routines require communication internal to the MPI-IO implementation in order to allow processes to atomically update this shared value. We discuss a technique that leverages MPI-2 one-sided operations and can be used to implement this concept without requiring any features from the Underlying File system. We then demonstrate through a simulation that our algorithm adds reasonable overhead for independent accesses and very small overhead for collective accesses.