Yahoo (Spain)
companyMadrid, Spain
Research output, citation impact, and the most-cited recent papers from Yahoo (Spain) (Spain). Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from Yahoo (Spain)
Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this article, we characterize adaptive learning processes; categorize existing strategies for handling concept drift; overview the most representative, distinct, and popular techniques and algorithms; discuss evaluation methodology of adaptive algorithms; and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.
We analyze the information credibility of news propagated through Twitter, a popular microblogging service. Previous research has shown that most of the messages posted on Twitter are truthful, but the service is also used to spread misinformation and false rumors, often unintentionally.
Mussels attach to solid surfaces in the sea. Their adhesion must be rapid, strong, and tough, or else they will be dislodged and dashed to pieces by the next incoming wave. Given the dearth of synthetic adhesives for wet polar surfaces, much effort has been directed to characterizing and mimicking essential features of the adhesive chemistry practiced by mussels. Studies of these organisms have uncovered important adaptive strategies that help to circumvent the high dielectric and solvation properties of water that typically frustrate adhesion. In a chemical vein, the adhesive proteins of mussels are heavily decorated with Dopa, a catecholic functionality. Various synthetic polymers have been functionalized with catechols to provide diverse adhesive, sealant, coating, and anchoring properties, particularly for critical biomedical applications.
In this paper, we describe ZooKeeper, a service for coordinating processes of distributed applications. Since ZooKeeper is part of critical infrastructure, ZooKeeper aims to provide a simple and high performance kernel for building more complex coordination primitives at the client. It incorporates elements from group messaging, shared registers, and distributed lock services in a replicated, centralized service. The interface exposed by Zoo-Keeper has the wait-free aspects of shared registers with an event-driven mechanism similar to cache invalidations of distributed file systems to provide a simple, yet powerful coordination service. The ZooKeeper interface enables a high-performance service implementation. In addition to the wait-free property, ZooKeeper provides a per client guarantee of FIFO execution of requests and linearizability for all requests that change the ZooKeeper state. These design decisions enable the implementation of a high performance processing pipeline with read requests being satisfied by local servers. We show for the target workloads, 2:1 to 100:1 read to write ratio, that ZooKeeper can handle tens to hundreds of thousands of transactions per second. This performance allows ZooKeeper to be used extensively by client applications. 1
The quality of user-generated content varies drastically from excellent to abuse and spam. As the availability of such content increases, the task of identifying high-quality content sites based on user contributions --social media sites -- becomes increasingly important. Social media in general exhibit a rich variety of information sources: in addition to the content itself, there is a wide array of non-content information available, such as links between items and explicit quality ratings from members of the community. In this paper we investigate methods for exploiting such community feedback to automatically identify high quality content. As a test case, we focus on Yahoo! Answers, a large community question/answering portal that is particularly rich in the amount and types of content and social interactions available in it. We introduce a general classification framework for combining the evidence from different sources of information, that can be tuned automatically for a given social media type and quality definition. In particular, for the community question/answering domain, we show that our system is able to separate high-quality items from the rest with an accuracy close to that of humans
Recently, there has been tremendous interest in the phenomenon of influence propagation in social networks. The studies in this area assume they have as input to their problems a social graph with edges labeled with probabilities of influence between users. However, the question of where these probabilities come from or how they can be computed from real social network data has been largely ignored until now. Thus it is interesting to ask whether from a social graph and a log of actions by its users, one can build models of influence. This is the main problem attacked in this paper. In addition to proposing models and algorithms for learning the model parameters and for testing the learned models to make predictions, we also develop techniques for predicting the time by which a user may be expected to perform an action. We validate our ideas and techniques using the Flickr data set consisting of a social graph with 1.3M nodes, 40M edges, and an action log consisting of 35M tuples referring to 300K distinct actions. Beyond showing that there is genuine influence happening in a real social network, we show that our techniques have excellent prediction performance.
In this article we explore the behavior of Twitter users under an emergency situation. In particular, we analyze the activity related to the 2010 earthquake in Chile and characterize Twitter in the hours and days following this disaster. Furthermore, we perform a preliminary study of certain social phenomenons, such as the dissemination of false rumors and confirmed news. We analyze how this information propagated through the Twitter network, with the purpose of assessing the reliability of Twitter as an information source under extreme circumstances. Our analysis shows that the propagation of tweets that correspond to rumors differs from tweets that spread news because rumors tend to be questioned more than news by the Twitter community. This result shows that it is posible to detect rumors by using aggregate analysis on tweets.
Online photo services such as Flickr and Zooomr allow users to share their photos with family, friends, and the online community at large. An important facet of these services is that users manually annotate their photos using so called tags, which describe the contents of the photo or provide additional contextual and semantical information. In this paper we investigate how we can assist users in the tagging phase. The contribution of our research is twofold. We analyse a representative snapshot of Flickr and present the results by means of a tag characterisation focussing on how users tags photos and what information is contained in the tagging. Based on this analysis, we present and evaluate tag recommendation strategies to support the user in the photo annotation task by recommending a set of tags that can be added to the photo. The results of the empirical evaluation show that we can effectively recommend relevant tags for a variety of photos with different levels of exhaustiveness of original tagging.
We consider the following problem: given a set of clusterings, find a single clustering that agrees as much as possible with the input clusterings. This problem, clustering aggregation , appears naturally in various contexts. For example, clustering categorical data is an instance of the clustering aggregation problem; each categorical attribute can be viewed as a clustering of the input rows where rows are grouped together if they take the same value on that attribute. Clustering aggregation can also be used as a metaclustering method to improve the robustness of clustering by combining the output of multiple algorithms. Furthermore, the problem formulation does not require a priori information about the number of clusters; it is naturally determined by the optimization function. In this article, we give a formal statement of the clustering aggregation problem, and we propose a number of algorithms. Our algorithms make use of the connection between clustering aggregation and the problem of correlation clustering . Although the problems we consider are NP-hard, for several of our methods, we provide theoretical guarantees on the quality of the solutions. Our work provides the best deterministic approximation algorithm for the variation of the correlation clustering problem we consider. We also show how sampling can be used to scale the algorithms for large datasets. We give an extensive empirical evaluation demonstrating the usefulness of the problem and of the solutions.
BACKGROUND: A primary concern after posterior lumbar spine arthrodesis is the potential for adjacent segment degeneration cephalad or caudad to the fusion segment. There is controversy regarding the subsequent degeneration of adjacent segments, and we are aware of no long-term studies that have analyzed both cephalad and caudad degeneration following posterior arthrodesis. A retrospective investigation was performed to determine the rates of degeneration and survival of the motion segments adjacent to the site of a posterior lumbar fusion. METHODS: Two hundred and fifteen patients who had undergone posterior lumbar arthrodesis were included in this study. The study group included 126 female patients and eighty-nine male patients. The average duration of follow-up was 6.7 years. Radiographs were analyzed with regard to arthritic degeneration at the adjacent levels both preoperatively and at the time of the last follow-up visit. Disc spaces were graded on a 4-point arthritic degeneration scale. Correlation analysis was used to determine the contribution of independent variables to the rate of degeneration. Survivorship analysis was performed to describe the degeneration of the adjacent motion segments. RESULTS: Fifty-nine (27.4%) of the 215 patients had evidence of degeneration at the adjacent levels and elected to have an additional decompression (fifteen patients) or arthrodesis (forty-four patients). Kaplan-Meier analysis predicted a disease-free survival rate of 83.5% (95% confidence interval, 77.5% to 89.5%) at five years and of 63.9% (95% confidence interval, 54.0% to 73.8%) at ten years after the index operation. Although there was a trend toward progression of the arthritic grade at the adjacent disc levels, there was no significant correlation, with the numbers available, between the preoperative arthritic grade and the need for additional surgery. CONCLUSIONS: The rate of symptomatic degeneration at an adjacent segment warranting either decompression or arthrodesis was predicted to be 16.5% at five years and 36.1% at ten years. There appeared to be no correlation with the length of fusion or the preoperative arthritic degeneration of the adjacent segment.
Big Data is a new term used to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data. The Big Data challenge is becoming one of the most exciting opportunities for the years to come. We present in this issue, a broad overview of the topic, its current status, controversy, and a forecast to the future. We introduce four articles, written by influential scientists in the field, covering the most interesting and state-of-the-art topics on Big Data mining.
Today’s networks typically carry or deploy dozens of protocols and mechanisms simultaneously such as MPLS, NAT, ACLs and route redistribution. Even when individual protocols function correctly, failures can arise from the complex interactions of their aggregate, requiring network administrators to be masters of detail. Our goal is to automatically find an important class of failures, regardless of the protocols running, for both operational and experimental networks. To this end we developed a general and protocolagnostic framework, called Header Space Analysis (HSA). Our formalism allows us to statically check network specifications and configurations to identify an important class of failures such as Reachability Failures, Forwarding Loops and Traffic Isolation and Leakage problems. In HSA, protocol header fields are not first class entities; instead we look at the entire packet header as a concatenation of bits without any associated meaning. Each packet is a point in the {0, 1} L space where L is the maximum length of a packet header, and networking boxes transform packets from one point in the space to another point or set of points (multicast). We created a library of tools, called Hassel, to implement our framework, and used it to analyze a variety of networks and protocols. Hassel was used to analyze the Stanford University backbone network, and found all the forwarding loops in less than 10 minutes, and verified reachability constraints between two subnets in 13 seconds. It also found a large and complex loop in an experimental loose source routing protocol in 4 minutes. 1
Pyogenic granuloma is one of the inflammatory hyperplasias seen in the oral cavity. This term is a misnomer because the lesion is unrelated to infection and in reality arises in response to various stimuli such as low-grade local irritation, traumatic injury or hormonal factors. It predominantly occurs in the second decade of life in young females, possibly because of the vascular effects of female hormones. Clinically, oral pyogenic granuloma is a smooth or lobulated exophytic lesion manifesting as small, red erythematous papules on a pedunculated or sometimes sessile base, which is usually hemorrhagic. The surface ranges from pink to red to purple, depending on the age of the lesion. Although excisional surgery is the treatment of choice for it, some other treatment protocols such as the use of Nd:YAG laser, flash lamp pulsed dye laser, cryosurgery, intralesional injection of ethanol or corticosteroid and sodium tetradecyl sulfate sclerotherapy have been proposed. Because of the high frequency of pyogenic granuloma in the oral cavity, especially during pregnancy, and necessity for proper diagnosis and treatment, a complete review of published information and investigations about this lesion, in addition to knowledge about new approaches for its treatment is presented.
The consumption of figs (the fruit of Ficus spp.: Moraceae) by vertebrates is reviewed using data from the literature, unpublished accounts and new field data from Borneo and Hong Kong. Records of frugivory from over 75 countries are presented for 260 Ficus species (approximately 30% of described species). Explanations are presented for geographical and taxonomic gaps in the otherwise extensive literature. In addition to a small number of reptiles and fishes, 1274 bird and mammal species in 523 genera and 92 families are known to eat figs. In terms of the number of species and genera of fig-eaters and the number of fig species eaten we identify the avian families interacting most with Ficus to be Columbidae, Psittacidae, Pycnonotidae, Bucerotidae, Sturnidae and Lybiidae. Among mammals, the major fig-eating families are Pteropodidae, Cercopithecidae, Sciuridae, Phyllostomidae and Cebidae. We assess the role these and other frugivores play in Ficus seed dispersal and identify fig-specialists. In most, but not all, cases fig specialists provide effective seed dispersal services to the Ficus species on which they feed. The diversity of fig-eaters is explained with respect to fig design and nutrient content, phenology of fig ripening and the diversity of fig presentation. Whilst at a gross level there exists considerable overlap between birds, arboreal mammals and fruit bats with regard to the fig species they consume, closer analysis, based on evidence from across the tropics, suggests that discrete guilds of Ficus species differentially attract subsets of sympatric frugivore communities. This dispersal guild structure is determined by interspecific differences in fig design and presentation. Throughout our examination of the fig-frugivore interaction we consider phylogenetic factors and make comparisons between large-scale biogeographical regions. Our dataset supports previous claims that Ficus is the most important plant genus for tropical frugivores. We explore the concept of figs as keystone resources and suggest criteria for future investigations of their dietary importance. Finally, fully referenced lists of frugivores recorded at each Ficus species and of Ficus species in the diet of each frugivore are presented as online appendices. In situations where ecological information is incomplete or its retrieval is impractical, this valuable resource will assist conservationists in evaluating the role of figs or their frugivores in tropical forest sites.
The Conference on Computational Natural Language Learning is accompanied every year by a shared task whose purpose is to promote natural language processing applications and evaluate them in a standard setting. In 2008 the shared task was dedicated to the joint parsing of syntactic and semantic dependencies. This shared task not only unifies the shared tasks of the previous four years under a unique dependency-based formalism, but also extends them significantly: this year's syntactic dependencies include more information such as named-entity boundaries; the semantic dependencies model roles of both verbal and nominal predicates. In this paper, we define the shared task and describe how the data sets were created. Furthermore, we report and analyze the results and describe the approaches of the participating systems.
A lot of research in graph mining has been devoted in the discovery of communities. Most of the work has focused in the scenario where communities need to be discovered with only reference to the input graph. However, for many interesting applications one is interested in finding the community formed by a given set of nodes. In this paper we study a query-dependent variant of the community-detection problem, which we call the community-search problem: given a graph G, and a set of query nodes in the graph, we seek to find a subgraph of G that contains the query nodes and it is densely connected.
Literature databases (i.e., PubMed, Scopus, and Web of Science) differ in terms of their coverage, focus, and the tool they provide. PubMed focuses mainly on life sciences and biomedical disciplines, whereas Scopus and Web of Science are multidisciplinary. The protocol described in the current study was used to search for publications from Jordanian authors in the years 2013-2017. In this protocol, how to use each database to conduct this type of search is explained in detail. A Scopus search resulted in the highest number of documents (11,444 documents), followed by a Web of Science search (10,943 documents). PubMed resulted in a smaller number of documents due to its narrower scope and coverage (4,363 documents). The results also show a yearly trend in: (1) the number of publications, (2) the disciplines that have the most publications, (3) the countries of collaboration, and (4) the number of open access publications. In contrast, PubMed has a sophisticated keyword optimization service (i.e., Medical Subject Heading, or MeSH), while both Scopus and Web of Science provide search analysis tools that can produce representative figures. Finally, the features of each database are explained in detail and several indices that can be extracted using the search results are provided. This study provides a base for using literature databases for bibliometric analysis.
We describe an approach for extracting semantics of tags, unstructured text-labels assigned to resources on the Web, based on each tag's usage patterns. In particular, we focus on the problem of extracting place and event semantics for tags that are assigned to photos on Flickr, a popular photo sharing website that supports time and location (latitude/longitude) metadata. We analyze two methods inspired by well-known burst-analysis techniques and one novel method: Scale-structure Identification. We evaluate the methods on a subset of Flickr data, and show that our Scale-structure Identification method outperforms the existing techniques. The approach and methods described in this work can be used in other domains such as geo-annotated web pages, where text terms can be extracted and associated with usage patterns.
Users may strive to formulate an adequate textual query for their information need. Search engines assist the users by presenting query suggestions. To preserve the original search intent, suggestions should be context-aware and account for the previous queries issued by the user. Achieving context awareness is challenging due to data sparsity. We present a novel hierarchical recurrent encoder-decoder architecture that makes possible to account for sequences of previous queries of arbitrary lengths. As a result, our suggestions are sensitive to the order of queries in the context while avoiding data sparsity. Additionally, our model can suggest for rare, or long-tail, queries. The produced suggestions are synthetic and are sampled one word at a time, using computationally cheap decoding techniques. This is in contrast to current synthetic suggestion models relying upon machine learning pipelines and hand-engineered feature sets. Results show that our model outperforms existing context-aware approaches in a next query prediction setting. In addition to query suggestion, our architecture is general enough to be used in a variety of other applications.
The problem of cross-modal retrieval from multimedia repositories is considered. This problem addresses the design of retrieval systems that support queries across content modalities, for example, using an image to search for texts. A mathematical formulation is proposed, equating the design of cross-modal retrieval systems to that of isomorphic feature spaces for different content modalities. Two hypotheses are then investigated regarding the fundamental attributes of these spaces. The first is that low-level cross-modal correlations should be accounted for. The second is that the space should enable semantic abstraction. Three new solutions to the cross-modal retrieval problem are then derived from these hypotheses: correlation matching (CM), an unsupervised method which models cross-modal correlations, semantic matching (SM), a supervised technique that relies on semantic representation, and semantic correlation matching (SCM), which combines both. An extensive evaluation of retrieval performance is conducted to test the validity of the hypotheses. All approaches are shown successful for text retrieval in response to image queries and vice versa. It is concluded that both hypotheses hold, in a complementary form, although evidence in favor of the abstraction hypothesis is stronger than that for correlation.