Scene Description

14,000,000 Leading Edge Experts on the ideXlab platform

Scan Science and Technology

Contact Leading Edge Experts & Companies

Scan Science and Technology

Contact Leading Edge Experts & Companies

The Experts below are selected from a list of 273 Experts worldwide ranked by ideXlab platform

Volker Tresp - One of the best experts on this subject based on the ideXlab platform.

  • improving visual relationship detection using semantic modeling of Scene Descriptions
    arXiv: Computation and Language, 2018
    Co-Authors: Stephan Baier, Volker Tresp
    Abstract:

    Structured Scene Descriptions of images are useful for the automatic processing and querying of large image databases. We show how the combination of a semantic and a visual statistical model can improve on the task of mapping images to their associated Scene Description. In this paper we consider Scene Descriptions which are represented as a set of triples (subject, predicate, object), where each triple consists of a pair of visual objects, which appear in the image, and the relationship between them (e.g. man-riding-elephant, man-wearing-hat). We combine a standard visual model for object detection, based on convolutional neural networks, with a latent variable model for link prediction. We apply multiple state-of-the-art link prediction methods and compare their capability for visual relationship detection. One of the main advantages of link prediction methods is that they can also generalize to triples, which have never been observed in the training data. Our experimental results on the recently published Stanford Visual Relationship dataset, a challenging real world dataset, show that the integration of a semantic model using link prediction methods can significantly improve the results for visual relationship detection. Our combined approach achieves superior performance compared to the state-of-the-art method from the Stanford computer vision group.

  • improving visual relationship detection using semantic modeling of Scene Descriptions
    International Semantic Web Conference, 2017
    Co-Authors: Stephan Baier, Volker Tresp
    Abstract:

    Structured Scene Descriptions of images are useful for the automatic processing and querying of large image databases. We show how the combination of a statistical semantic model and a visual model can improve on the task of mapping images to their associated Scene Description. In this paper we consider Scene Descriptions which are represented as a set of triples (subject, predicate, object), where each triple consists of a pair of visual objects, which appear in the image, and the relationship between them (e.g. man-riding-elephant, man-wearing-hat). We combine a standard visual model for object detection, based on convolutional neural networks, with a latent variable model for link prediction. We apply multiple state-of-the-art link prediction methods and compare their capability for visual relationship detection. One of the main advantages of link prediction methods is that they can also generalize to triples which have never been observed in the training data. Our experimental results on the recently published Stanford Visual Relationship dataset, a challenging real world dataset, show that the integration of a statistical semantic model using link prediction methods can significantly improve visual relationship detection. Our combined approach achieves superior performance compared to the state-of-the-art method from the Stanford computer vision group.

Yakoub Bazi - One of the best experts on this subject based on the ideXlab platform.

  • Assisting the Visually Impaired in Multi-object Scene Description Using OWA-Based Fusion of CNN Models
    Arabian Journal for Science and Engineering, 2020
    Co-Authors: Haikel Alhichri, Yakoub Bazi, Naif Alajlan
    Abstract:

    Advances in technology can provide a lot of support for visually impaired (VI) persons. In particular, computer vision and machine learning can provide solutions for object detection and recognition. In this work, we propose a multi-label image classification solution for assisting a VI person in recognizing the presence of multiple objects in a Scene. The solution is based on the fusion of two deep CNN models using the induced ordered weighted averaging (OWA) approach. Namely, in this work, we fuse the outputs of two pre-trained CNN models, VGG16 and SqueezeNet. To use the induced OWA approach, we need to estimate a confidence measure in the outputs of the two CNN base models. To this end, we propose the residual error between the predicted output and the true output as a measure of confidence. We estimate this residual error using another dedicated CNN model that is trained on the residual errors computed from the main CNN models. Then, the OAW technique uses these estimated residual errors as confidence measures and fuses the decisions of the two main CNN models. When tested on four image datasets of indoor environments from two separate locations, the proposed novel method improves the detection accuracy compared to both base CNN models. The results are also significantly better than state-of-the-art methods reported in the literature.

  • real time indoor Scene Description for the visually impaired using autoencoder fusion strategies with visible cameras
    Sensors, 2017
    Co-Authors: Salim Malek, Farid Melgani, Mohamed Lamine Mekhalfi, Yakoub Bazi
    Abstract:

    This paper describes three coarse image Description strategies, which are meant to promote a rough perception of surrounding objects for visually impaired individuals, with application to indoor spaces. The described algorithms operate on images (grabbed by the user, by means of a chest-mounted camera), and provide in output a list of objects that likely exist in his context across the indoor Scene. In this regard, first, different colour, texture, and shape-based feature extractors are generated, followed by a feature learning step by means of AutoEncoder (AE) models. Second, the produced features are fused and fed into a multilabel classifier in order to list the potential objects. The conducted experiments point out that fusing a set of AE-learned features scores higher classification rates with respect to using the features individually. Furthermore, with respect to reference works, our method: (i) yields higher classification accuracies, and (ii) runs (at least four times) faster, which enables a potential full real-time application.

  • fast indoor Scene Description for blind people with multiresolution random projections
    Journal of Visual Communication and Image Representation, 2017
    Co-Authors: Mohamed Lamine Mekhalfi, Yakoub Bazi, Farid Melgani, Naif Alajlan
    Abstract:

    A multiresolution random projection for image representation is presented.An indoor Scene Description for visually impaired people is proposed.Experiments are conducted on four different indoor datasets.Results qualify the framework for a near-real time blind assistance technology. Object recognition forms a substantial need for blind and visually impaired individuals. This paper proposes a new multiobject recognition framework. It consists of coarsely checking the presence of multiple objects in a portable camera-grabbed image at a considered indoor site. The outcome is a list of objects that likely appear in the indoor Scene. Such Description is meant to uplift the conscience of the blind person in order to better sense his/her surroundings. The method consists of a library containing (i) a bunch of images represented by means of the Random Projections (RP) technique and (ii) their respective list of objects, both prepared offline. Thus, given an online shot image, its RP representation is generated and matched to the RP patterns of library images. It thus inherits the objects of the closest image from the library. Extensive experiments returned promising recognition accuracies and a processing lapse of real-time standard.

  • multiple object Scene Description for the visually impaired using pre trained convolutional neural networks
    International Conference on Image Analysis and Recognition, 2016
    Co-Authors: Haikel Alhichri, Bilel Bin Jdira, Yakoub Bazi, Naif Alajlan
    Abstract:

    This paper introduces a new method for multiple object Scene Description as part of a system to guide the visually impaired in an indoor environment. Here we are interested in a coarse Scene Description, where only the presence of certain objects is indicated regardless of its position in the Scene. The proposed method is based on the extraction of powerful features using pre-trained convolutional neural networks (CNN), then training a Neural Network regression to predict the content of any unknown Scene based on its CNN feature. We have found the CNN feature to be highly descriptive, even though it is trained on auxiliary data from a completely different domain.

Roberto Manduchi - One of the best experts on this subject based on the ideXlab platform.

  • semantic interior mapology a toolbox for indoor Scene Description from architectural floor plans
    arXiv: Human-Computer Interaction, 2019
    Co-Authors: Viet Trinh, Roberto Manduchi
    Abstract:

    We introduce the Semantic Interior Mapology (SIM) toolbox for the conversion of a floor plan and its room contents (such as furnitures) to a vectorized form. The toolbox is composed of the Map Conversion toolkit and the Map Population toolkit. The Map Conversion toolkit allows one to quickly trace the layout of a floor plan, and to generate a GeoJSON file that can be rendered in 3D using web applications such as Mapbox. The Map Population toolkit takes the 3D scan of a room in the building (acquired from an RGB-D camera), and, through a semi-automatic process, populates individual objects of interest with a correct dimension and position in the GeoJSON representation of the building. SIM is easy to use and produces accurate results even in the case of complex building layouts.

  • semantic interior mapology a toolbox for indoor Scene Description from architectural floor plans
    international conference on 3D web technology, 2019
    Co-Authors: Viet Trinh, Roberto Manduchi
    Abstract:

    A computer implemented method or system including a map conversion toolkit and a map Population toolkit. The map conversion toolkit allows one to quickly trace the layout of a floor plan, generating a file (e.g., GeoJSON file) that can be rendered in two dimensions (2D) or three dimensions (3D) using web tools such as Mapbox. The map population toolkit takes the scan (e.g., 3D scan) of a room in the building (taken from an RGB-D camera), and, through a semi- automatic process, generates individual objects, which are correctly dimensioned and positioned in the (e.g., GeoJSON) representation of the building. In another example, a computer implemented method for diagraming a space comprises obtaining a layout of the space; and annotating or decorating the layout with meaningful labels that are translatable to glanceable visual signals or audio signals.

Stephan Baier - One of the best experts on this subject based on the ideXlab platform.

  • improving visual relationship detection using semantic modeling of Scene Descriptions
    arXiv: Computation and Language, 2018
    Co-Authors: Stephan Baier, Volker Tresp
    Abstract:

    Structured Scene Descriptions of images are useful for the automatic processing and querying of large image databases. We show how the combination of a semantic and a visual statistical model can improve on the task of mapping images to their associated Scene Description. In this paper we consider Scene Descriptions which are represented as a set of triples (subject, predicate, object), where each triple consists of a pair of visual objects, which appear in the image, and the relationship between them (e.g. man-riding-elephant, man-wearing-hat). We combine a standard visual model for object detection, based on convolutional neural networks, with a latent variable model for link prediction. We apply multiple state-of-the-art link prediction methods and compare their capability for visual relationship detection. One of the main advantages of link prediction methods is that they can also generalize to triples, which have never been observed in the training data. Our experimental results on the recently published Stanford Visual Relationship dataset, a challenging real world dataset, show that the integration of a semantic model using link prediction methods can significantly improve the results for visual relationship detection. Our combined approach achieves superior performance compared to the state-of-the-art method from the Stanford computer vision group.

  • improving visual relationship detection using semantic modeling of Scene Descriptions
    International Semantic Web Conference, 2017
    Co-Authors: Stephan Baier, Volker Tresp
    Abstract:

    Structured Scene Descriptions of images are useful for the automatic processing and querying of large image databases. We show how the combination of a statistical semantic model and a visual model can improve on the task of mapping images to their associated Scene Description. In this paper we consider Scene Descriptions which are represented as a set of triples (subject, predicate, object), where each triple consists of a pair of visual objects, which appear in the image, and the relationship between them (e.g. man-riding-elephant, man-wearing-hat). We combine a standard visual model for object detection, based on convolutional neural networks, with a latent variable model for link prediction. We apply multiple state-of-the-art link prediction methods and compare their capability for visual relationship detection. One of the main advantages of link prediction methods is that they can also generalize to triples which have never been observed in the training data. Our experimental results on the recently published Stanford Visual Relationship dataset, a challenging real world dataset, show that the integration of a statistical semantic model using link prediction methods can significantly improve visual relationship detection. Our combined approach achieves superior performance compared to the state-of-the-art method from the Stanford computer vision group.

Jyri Huopaniemi - One of the best experts on this subject based on the ideXlab platform.

  • advanced audiobifs virtual acoustics modeling in mpeg 4 Scene Description
    IEEE Transactions on Multimedia, 2004
    Co-Authors: Riitta Vaananen, Jyri Huopaniemi
    Abstract:

    We present the virtual acoustics modeling framework that is a part of the MPEG-4 standard. A Scene Description language called the binary format for Scenes (BIFS) is defined within MPEG-4 for making multimedia presentations that include various types of audio and visual data. BIFS also provides means for creating three-dimensional (3-D) virtual worlds or Scenes, where visual and sound objects can be positioned and given temporal behavior. Local interaction between the user and the Scene can be added to MPEG-4 applications. Typically the user can navigate in a 3-D Scene so that it is viewed from different positions. In case that there are sound source objects in a Scene, the sounds may be spatialized so that they are heard coming from the positions defined for them. A subset of BIFS, called the advanced AudioBIFS aims at enhanced modeling of 3-D sound environments. In this framework, sounds can be given positions, and also the virtual environment where they appear can be associated with acoustic properties that allow modeling of phenomena such as air absorption, Doppler effect, sound reflections, and reverberation. These features can be used for adding room acoustic effects to sound in the MPEG-4 terminal, and for creating immersive 3-D audiovisual Scenes.