Yahoo (United Kingdom)

companyLondon, United Kingdom

Research output, citation impact, and the most-cited recent papers from Yahoo (United Kingdom) (United Kingdom). Aggregated across the NobleBlocks index of 300M+ scholarly works.

Total works

25.6K

Citations

578.0K

h-index

240

i10-index

10.3K

Also known as

Yahoo (United Kingdom)

Top-cited papers from Yahoo (United Kingdom)

The Hadoop Distributed File System

Konstantin V. Shvachko, Hairong Kuang, Sanjay Radia, Robert J. Chansler

20104.8Kdoi:10.1109/msst.2010.5496972

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.

Entropy and diversity

Lou Jost

2006· Oikos4.7Kdoi:10.1111/j.2006.0030-1299.14714.x

Entropies such as the Shannon–Wiener and Gini–Simpson indices are not themselves diversities. Conversion of these to effective number of species is the key to a unified and intuitive interpretation of diversity. Effective numbers of species derived from standard diversity indices share a common set of intuitive mathematical properties and behave as one would expect of a diversity, while raw indices do not. Contrary to Keylock, the lack of concavity of effective numbers of species is irrelevant as long as they are used as transformations of concave alpha, beta, and gamma entropies. The practical importance of this transformation is demonstrated by applying it to a popular community similarity measure based on raw diversity indices or entropies. The standard similarity measure based on untransformed indices is shown to give misleading results, but transforming the indices or entropies to effective numbers of species produces a stable, easily interpreted, sensitive general similarity measure. General overlap measures derived from this transformed similarity measure yield the Jaccard index, Sørensen index, Horn index of overlap, and the Morisita–Horn index as special cases.

Collaborative Filtering for Implicit Feedback Datasets

Yifan Hu, Yehuda Koren, Chris Volinsky

20083.2Kdoi:10.1109/icdm.2008.22

A common task of recommender systems is to improve customer experience through personalized recommendations based on prior implicit feedback. These systems passively track different sorts of user behavior, such as purchase history, watching habits and browsing activity, in order to model user preferences. Unlike the much more extensively researched explicit feedback, we do not have any direct input from the users regarding their preferences. In particular, we lack substantial evidence on which products consumer dislike. In this work we identify unique properties of implicit feedback datasets. We propose treating the data as indication of positive and negative preference associated with vastly varying confidence levels. This leads to a factor model which is especially tailored for implicit feedback recommenders. We also suggest a scalable optimization procedure, which scales linearly with the data size. The algorithm is used successfully within a recommender system for television shows. It compares favorably with well tuned implementations of other known methods. In addition, we offer a novel way to give explanations to recommendations given by this factor model.

GSTand its relatives do not measure differentiation

Lou Jost

2008· Molecular Ecology2.5Kdoi:10.1111/j.1365-294x.2008.03887.x

G(ST) and its relatives are often interpreted as measures of differentiation between subpopulations, with values near zero supposedly indicating low differentiation. However, G(ST) necessarily approaches zero when gene diversity is high, even if subpopulations are completely differentiated, and it is not monotonic with increasing differentiation. Likewise, when diversity is equated with heterozygosity, standard similarity measures formed by taking the ratio of mean within-subpopulation diversity to total diversity necessarily approach unity when diversity is high, even if the subpopulations are completely dissimilar (no shared alleles). None of these measures can be interpreted as measures of differentiation or similarity. The derivations of these measures contain two subtle misconceptions which cause their paradoxical behaviours. Conclusions about population differentiation, gene flow, relatedness, and conservation priority will often be wrong when based on these fixation indices or similarity measures. These are not statistical issues; the problems persist even when true population frequencies are used in the calculations. Recent advances in the mathematics of diversity identify the misconceptions, and yield mathematically consistent descriptive measures of population structure which eliminate the paradoxes produced by standard measures. These measures can be directly related to the migration and mutation rates of the finite-island model.

PARTITIONING DIVERSITY INTO INDEPENDENT ALPHA AND BETA COMPONENTS

Lou Jost

2007· Ecology2.3Kdoi:10.1890/06-1736.1

Existing general definitions of beta diversity often produce a beta with a hidden dependence on alpha. Such a beta cannot be used to compare regions that differ in alpha diversity. To avoid misinterpretation, existing definitions of alpha and beta must be replaced by a definition that partitions diversity into independent alpha and beta components. Such a unique definition is derived here. When these new alpha and beta components are transformed into their numbers equivalents (effective numbers of elements), Whittaker's multiplicative law (alpha x beta = gamma) is necessarily true for all indices. The new beta gives the effective number of distinct communities. The most popular similarity and overlap measures of ecology (Jaccard, Sorensen, Horn, and Morisita-Horn indices) are monotonic transformations of the new beta diversity. Shannon measures follow deductively from this formalism and do not need to be borrowed from information theory; they are shown to be the only standard diversity measures which can be decomposed into meaningful independent alpha and beta components when community weights are unequal.

Apache Hadoop YARN

Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal +4 more

20131.8Kdoi:10.1145/2523616.2523633

The initial design of Apache Hadoop [1] was tightly focused on running massive, MapReduce jobs to process a web crawl. For increasingly diverse companies, Hadoop has become the data and computational agorá---the de facto place where data and computational resources are shared and accessed. This broad adoption and ubiquitous usage has stretched the initial design well beyond its intended target, exposing two key shortcomings: 1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs' control flow, which resulted in endless scalability concerns for the scheduler.

Mussel-Inspired Adhesives and Coatings

Bruce P. Lee, Phillip B. Messersmith, Jacob N. Israelachvili, J. Herbert Waite

2011· Annual Review of Materials Research1.6Kdoi:10.1146/annurev-matsci-062910-100429

Mussels attach to solid surfaces in the sea. Their adhesion must be rapid, strong, and tough, or else they will be dislodged and dashed to pieces by the next incoming wave. Given the dearth of synthetic adhesives for wet polar surfaces, much effort has been directed to characterizing and mimicking essential features of the adhesive chemistry practiced by mussels. Studies of these organisms have uncovered important adaptive strategies that help to circumvent the high dielectric and solvation properties of water that typically frustrate adhesion. In a chemical vein, the adhesive proteins of mussels are heavily decorated with Dopa, a catecholic functionality. Various synthetic polymers have been functionalized with catechols to provide diverse adhesive, sealant, coating, and anchoring properties, particularly for critical biomedical applications.

Collection tree protocol

Omprakash Gnawali, Rodrigo Fonseca, Kyle Jamieson, David Moss +1 more

20091.4Kdoi:10.1145/1644038.1644040

This paper presents and evaluates two principles for wireless routing protocols. The first is datapath validation: data traffic quickly discovers and fixes routing inconsistencies. The second is adaptive beaconing: extending the Trickle algorithm to routing control traffic reduces route repair latency and sends fewer beacons.

Crowdsourcing systems on the World-Wide Web

AnHai Doan, Raghu Ramakrishnan, Alon Halevy

2011· Communications of the ACM1.4Kdoi:10.1145/1924421.1924442

The practice of crowdsourcing is transforming the Web and giving rise to a new field.

ZooKeeper: wait-free coordination for internet-scale systems

P. G. Hunt, Mahadev Konar, Flavio Junqueira, Benjamin Reed

20101.3K

In this paper, we describe ZooKeeper, a service for coordinating processes of distributed applications. Since ZooKeeper is part of critical infrastructure, ZooKeeper aims to provide a simple and high performance kernel for building more complex coordination primitives at the client. It incorporates elements from group messaging, shared registers, and distributed lock services in a replicated, centralized service. The interface exposed by Zoo-Keeper has the wait-free aspects of shared registers with an event-driven mechanism similar to cache invalidations of distributed file systems to provide a simple, yet powerful coordination service. The ZooKeeper interface enables a high-performance service implementation. In addition to the wait-free property, ZooKeeper provides a per client guarantee of FIFO execution of requests and linearizability for all requests that change the ZooKeeper state. These design decisions enable the implementation of a high performance processing pipeline with read requests being satisfied by local servers. We show for the target workloads, 2:1 to 100:1 read to write ratio, that ZooKeeper can handle tens to hundreds of thousands of transactions per second. This performance allows ZooKeeper to be used extensively by client applications. 1

Vaccination greatly reduces disease, disability, death and inequity worldwide

André Fe, Robert Booy, H.L. Bock, John D. Clemens +4 more

2008· Bulletin of the World Health Organization1.3Kdoi:10.2471/blt.07.040089

In low-income countries, infectious diseases still account for a large proportion of deaths, highlighting health inequities largely caused by economic differences. Vaccination can cut health-care costs and reduce these inequities. Disease control, elimination or eradication can save billions of US dollars for communities and countries. Vaccines have lowered the incidence of hepatocellular carcinoma and will control cervical cancer. Travellers can be protected against "exotic" diseases by appropriate vaccination. Vaccines are considered indispensable against bioterrorism. They can combat resistance to antibiotics in some pathogens. Noncommunicable diseases, such as ischaemic heart disease, could also be reduced by influenza vaccination. Immunization programmes have improved the primary care infrastructure in developing countries, lowered mortality in childhood and empowered women to better plan their families, with consequent health, social and economic benefits. Vaccination helps economic growth everywhere, because of lower morbidity and mortality. The annual return on investment in vaccination has been calculated to be between 12% and 18%. Vaccination leads to increased life expectancy. Long healthy lives are now recognized as a prerequisite for wealth, and wealth promotes health. Vaccines are thus efficient tools to reduce disparities in wealth and inequities in health.

Finding high-quality content in social media

Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis +1 more

20081.2Kdoi:10.1145/1341531.1341557

The quality of user-generated content varies drastically from excellent to abuse and spam. As the availability of such content increases, the task of identifying high-quality content sites based on user contributions --social media sites -- becomes increasingly important. Social media in general exhibit a rich variety of information sources: in addition to the content itself, there is a wide array of non-content information available, such as links between items and explicit quality ratings from members of the community. In this paper we investigate methods for exploiting such community feedback to automatically identify high quality content. As a test case, we focus on Yahoo! Answers, a large community question/answering portal that is particularly rich in the amount and types of content and social interactions available in it. We introduce a general classification framework for combining the evidence from different sources of information, that can be tuned automatically for a given social media type and quality definition. In particular, for the community question/answering domain, we show that our system is able to separate high-quality items from the rest with an accuracy close to that of humans

Collaborative filtering with temporal dynamics

Yehuda Koren

20091.2Kdoi:10.1145/1557019.1557072

Customer preferences for products are drifting over time. Product perception and popularity are constantly changing as new selection emerges. Similarly, customer inclinations are evolving, leading them to ever redefine their taste. Thus, modeling temporal dynamics should be a key when designing recommender systems or general customer preference models. However, this raises unique challenges. Within the eco-system intersecting multiple products and customers, many different characteristics are shifting simultaneously, while many of them influence each other and often those shifts are delicate and associated with a few data instances. This distinguishes the problem from concept drift explorations, where mostly a single concept is tracked. Classical time-window or instance-decay approaches cannot work, as they lose too much signal when discarding data instances. A more sensitive approach is required, which can make better distinctions between transient effects and long term patterns. The paradigm we offer is creating a model tracking the time changing behavior throughout the life span of the data. This allows us to exploit the relevant components of all data instances, while discarding only what is modeled as being irrelevant. Accordingly, we revamp two leading collaborative filtering recommendation approaches. Evaluation is made on a large movie rating dataset by Netflix. Results are encouraging and better than those previously reported on this dataset.

[Transobturator urethral suspension: mini-invasive procedure in the treatment of stress urinary incontinence in women].

E. Delorme

2001· PubMed1.1K

Transobturator tape is an artificial tape designed for urethral suspension to treat female stress urinary incontinence. This tape has two original features: its non-woven polypropylene structure is coated with silicone on the urethral surface in order to limit retraction of polypropylene and to establish a barrier to extension of periurethral fibrosis. transmuscular insertion, through the obturator and puborectalis muscles, reproduces the natural suspension fascia of the urethra while preserving the retropubic space. A preliminary study (40 implantations) confirmed the feasibility of this operation, the low morbidity (one complication: sepsis) and the encouraging results between 3 and 12 months; in the treatment of isolated incontinence (16 patients), no postoperative dysuria has been observed; 15 patients are totally continent and 1 patient is improved; in the treatment of prolapse associated with frank or potential incontinence (24 patients), transient postoperative dysuria was observed in 4 cases, with no postoperative incontinence.

Structure and evolution of online social networks

Ravi Kumar, Jasmine Novak, Andrew Tomkins

20061.1Kdoi:10.1145/1150402.1150476

In this paper, we consider the evolution of structure within large online social networks. We present a series of measurements of two such networks, together comprising in excess of five million people and ten million friendship links, annotated with metadata capturing the time of every event in the life of the network. Our measurements expose a surprising segmentation of these networks into three regions: singletons who do not participate in the network; isolated communities which overwhelmingly display star structure; and a giant component anchored by a well-connected core region which persists even in the absence of stars. We present a simple model of network growth which captures these aspects of component structure. The model follows our ex-perimental results, characterizing users as either passive members of the network; inviters who encourage offline friends and acquain-tances to migrate online; and linkers who fully participate in the social evolution of the network.

Anomaly Detection and Localization in Crowded Scenes

Weixin Li, Vijay Mahadevan, Nuno Vasconcelos

2013· IEEE Transactions on Pattern Analysis and Machine Intelligence1.0Kdoi:10.1109/tpami.2013.111

The detection and localization of anomalous behaviors in crowded scenes is considered, and a joint detector of temporal and spatial anomalies is proposed. The proposed detector is based on a video representation that accounts for both appearance and dynamics, using a set of mixture of dynamic textures models. These models are used to implement 1) a center-surround discriminant saliency detector that produces spatial saliency scores, and 2) a model of normal behavior that is learned from training data and produces temporal saliency scores. Spatial and temporal anomaly maps are then defined at multiple spatial scales, by considering the scores of these operators at progressively larger regions of support. The multiscale scores act as potentials of a conditional random field that guarantees global consistency of the anomaly judgments. A data set of densely crowded pedestrian walkways is introduced and used to evaluate the proposed anomaly detector. Experiments on this and other data sets show that the latter achieves state-of-the-art anomaly detection results.

Classification using intersection kernel support vector machines is efficient

Subhransu Maji, Alexander C. Berg, Jitendra Malik

2008938doi:10.1109/cvpr.2008.4587630

Straightforward classification using kernelized SVMs requires evaluating the kernel for a test vector and each of the support vectors. For a class of kernels we show that one can do this much more efficiently. In particular we show that one can build histogram intersection kernel SVMs (IKSVMs) with runtime complexity of the classifier logarithmic in the number of support vectors as opposed to linear for the standard approach. We further show that by precomputing auxiliary tables we can construct an approximate classifier with constant runtime and space requirements, independent of the number of support vectors, with negligible loss in classification accuracy on various tasks. This approximation also applies to 1 - chi 2 and other kernels of similar form. We also introduce novel features based on a multi-level histograms of oriented edge energy and present experiments on various detection datasets. On the INRIA pedestrian dataset an approximate IKSVM classifier based on these features has the current best performance, with a miss rate 13% lower at 10 -6 False Positive Per Window than the linear SVM detector of Dalal & Triggs. On the Daimler Chrysler pedestrian dataset IKSVM gives comparable accuracy to the best results (based on quadratic SVM), while being 15times faster. In these experiments our approximate IKSVM is up to 2000times faster than a standard implementation and requires 200times less memory. Finally we show that a 50times speedup is possible using approximate IKSVM based on spatial pyramid features on the Caltech 101 dataset with negligible loss of accuracy.

Robust Disambiguation of Named Entities in Text

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau +4 more

2011862

Disambiguating named entities in natural-language text maps mentions\nof ambiguous names onto canonical entities like people or places,\nregistered in a knowledge base such as DBpedia or YAGO. This paper\npresents a robust method for collective disambiguation, by\nharnessing context from knowledge bases and using a new form of\ncoherence graph. It unifies prior approaches into a comprehensive\nframework that combines three measures: the prior probability of an\nentity being mentioned, the similarity between the contexts of a\nmention and a candidate entity, as well as the coherence among\ncandidate entities for all mentions together. The method builds a\nweighted graph of mentions and candidate entities, and computes a\ndense subgraph that approximates the best joint mention-entity\nmapping. Experiments show that the new method significantly\noutperforms prior methods in terms of accuracy, with robust behavior\nacross a variety of inputs.

Time–motion analysis and physiological data of elite under-19-year-old basketball players during competition

Nidhal Ben Abdelkrim, Saloua El Fazâa, Jalila El Ati

2006· British Journal of Sports Medicine820doi:10.1136/bjsm.2006.032318

The physical demands of modern basketball were assessed by investigating 38 elite under-19-year-old basketball players during competition. Computerised time-motion analyses were performed on 18 players of various positions. Heart rate was recorded continuously for all subjects. Blood was sampled before the start of each match, at half time and at full time to determine lactate concentration. Players spent 8.8% (1%), 5.3% (0.8%) and 2.1% (0.3%) of live time in high "specific movements", sprinting and jumping, respectively. Centres spent significantly lower live time competing in high-intensity activities than guards (14.7% (1%) v 17.1% (1.2%); p<0.01) and forwards (16.6% (0.8%); p<0.05). The mean (SD) heart rate during total time was 171 (4) beats/min, with a significant difference (p<0.01) between guards and centres. Mean (SD) plasma lactate concentration was 5.49 (1.24) mmol/l, with concentrations at half time (6.05 (1.27) mmol/l) being significantly (p<0.001) higher than those at full time (4.94 (1.46) mmol/l). The changes to the rules of basketball have slightly increased the cardiac efforts involved during competition. The game intensity may differ according to the playing position, being greatest in guards.

Orthogonal Laplacianfaces for Face Recognition

Deng Cai, Xiaofei He, Jiawei Han, H.-J. Zhang

2006· IEEE Transactions on Image Processing796doi:10.1109/tip.2006.881945

Following the intuition that the naturally occurring face data may be generated by sampling a probability distribution that has support on or near a submanifold of ambient space, we propose an appearance-based face recognition method, called orthogonal Laplacianface. Our algorithm is based on the locality preserving projection (LPP) algorithm, which aims at finding a linear approximation to the eigenfunctions of the Laplace Beltrami operator on the face manifold. However, LPP is nonorthogonal, and this makes it difficult to reconstruct the data. The orthogonal locality preserving projection (OLPP) method produces orthogonal basis functions and can have more locality preserving power than LPP. Since the locality preserving power is potentially related to the discriminating power, the OLPP is expected to have more discriminating power than LPP. Experimental results on three face databases demonstrate the effectiveness of our proposed algorithm.

Search all NobleBlocks papers mentioning “Yahoo (United Kingdom)” →