Beijing Academy of Artificial Intelligence
otherBeijing, China
Research output, citation impact, and the most-cited recent papers from Beijing Academy of Artificial Intelligence (China). Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from Beijing Academy of Artificial Intelligence
Object detection, as of one the most fundamental and challenging problems in computer vision, has received great attention in recent years. Over the past two decades, we have seen a rapid technological evolution of object detection and its profound impact on the entire computer vision field. If we consider today’s object detection technique as a revolution driven by deep learning, then, back in the 1990s, we would see the ingenious thinking and long-term perspective design of early computer vision. This article extensively reviews this fast-moving research field in the light of technical evolution, spanning over a quarter-century’s time (from the 1990s to 2022). A number of topics have been covered in this article, including the milestone detectors in history, detection datasets, metrics, fundamental building blocks of the detection system, speedup techniques, and recent state-of-the-art detection methods.
Neural language representation models such as BERT pre-trained on large-scale corpora can well capture rich semantic patterns from plain text, and be fine-tuned to consistently improve the performance of various NLP tasks. However, the existing pre-trained language models rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better language understanding. We argue that informative entities in KGs can enhance language representation with external knowledge. In this paper, we utilize both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE), which can take full advantage of lexical, syntactic, and knowledge information simultaneously. The experimental results have demonstrated that ERNIE achieves significant improvements on various knowledge-driven tasks, and meanwhile is comparable with the state-of-the-art model BERT on other common NLP tasks. The source code and experiment details of this paper can be obtained from https:// github.com/thunlp/ERNIE.
ChatGPT, an artificial intelligence generated content (AIGC) model developed by OpenAI, has attracted world-wide attention for its capability of dealing with challenging language understanding and generation tasks in the form of conversations. This paper briefly provides an overview on the history, status quo and potential future development of ChatGPT, helping to provide an entry point to think about ChatGPT. Specifically, from the limited open-accessed resources, we conclude the core techniques of ChatGPT, mainly including large-scale language models, in-context learning, reinforcement learning from human feedback and the key technical steps for developing Chat-GPT. We further analyze the pros and cons of ChatGPT and we rethink the duality of ChatGPT in various fields. Although it has been widely acknowledged that ChatGPT brings plenty of opportunities for various fields, mankind should still treat and use ChatGPT properly to avoid the potential threat, e.g., academic integrity and safety challenge. Finally, we discuss several open problems as the potential development of ChatGPT.
Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for better reconstruction, we propose a novel Hybrid Attention Transformer (HAT). It combines both channel attention and window-based self-attention schemes, thus making use of their complementary advantages of being able to utilize global statistics and strong local fitting capability. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally adopt a same-task pre-training strategy to exploit the potential of the model for further improvement. Extensive experiments show the effectiveness of the proposed modules, and we further scale up the model to demonstrate that the performance of this task can be greatly improved. Our overall method significantly outperforms the state-of-the-art methods by more than 1dB.
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, Jie Tang. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022.
Abstract With the prevalence of pre-trained language models (PLMs) and the pre-training–fine-tuning paradigm, it has been continuously shown that larger models tend to yield better performance. However, as PLMs scale up, fine-tuning and storing all the parameters is prohibitively costly and eventually becomes practically infeasible. This necessitates a new branch of research focusing on the parameter-efficient adaptation of PLMs, which optimizes a small portion of the model parameters while keeping the rest fixed, drastically cutting down computation and storage costs. In general, it demonstrates that large-scale models could be effectively stimulated by the optimization of a few parameters. Despite the various designs, here we discuss and analyse the approaches under a more consistent and accessible term ‘delta-tuning’, where ‘delta’ a mathematical notation often used to denote changes, is borrowed to refer to the portion of parameters that are ‘changed’ during training. We formally describe the problem and propose a unified categorization criterion for existing delta-tuning methods to explore their correlations and differences. We also discuss the theoretical principles underlying the effectiveness of delta-tuning and interpret them from the perspectives of optimization and optimal control. Furthermore, we provide a holistic empirical study on over 100 natural language processing tasks and investigate various aspects of delta-tuning. With comprehensive study and analysis, our research demonstrates the theoretical and practical properties of delta-tuning in the adaptation of PLMs.
As an emerging biomedical image processing technology, medical image segmentation has made great contributions to sustainable medical care. Now it has become an important research direction in the field of computer vision. With the rapid development of deep learning, medical image processing based on deep convolutional neural networks has become a research hotspot. This paper focuses on the research of medical image segmentation based on deep learning. First, the basic ideas and characteristics of medical image segmentation based on deep learning are introduced. By explaining its research status and summarizing the three main methods of medical image segmentation and their own limitations, the future development direction is expanded. Based on the discussion of different pathological tissues and organs, the specificity between them and their classic segmentation algorithms are summarized. Despite the great achievements of medical image segmentation in recent years, medical image segmentation based on deep learning has still encountered difficulties in research. For example, the segmentation accuracy is not high, the number of medical images in the data set is small and the resolution is low. The inaccurate segmentation results are unable to meet the actual clinical requirements. Aiming at the above problems, a comprehensive review of current medical image segmentation methods based on deep learning is provided to help researchers solve existing problems.
Transformers have recently shown superior performances on various vision tasks. The large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts. Nevertheless, simply enlarging receptive field also gives rise to several concerns. On the one hand, using dense attention e.g., in ViT, leads to excessive memory and computational cost, and features can be influenced by irrelevant parts which are beyond the region of interests. On the other hand, the sparse attention adopted in PVT or Swin Transformer is data agnostic and may limit the ability to model long range relations. To mitigate these issues, we propose a novel deformable selfattention module, where the positions of key and value pairs in selfattention are selected in a data-dependent way. This flexible scheme enables the self-attention module to focus on relevant re-gions and capture more informative features. On this basis, we present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks. Extensive experi-ments show that our models achieve consistently improved results on comprehensive benchmarks. Code is available at https://github.com/LeapLabTHU/DAT.
Importance: Cancers are a leading cause of mortality, accounting for nearly 10 million annual deaths worldwide, or 1 in 6 deaths. Cancers also negatively affect countries' economic growth. However, the global economic cost of cancers and its worldwide distribution have yet to be studied. Objective: To estimate and project the economic cost of 29 cancers in 204 countries and territories. Design, Setting, and Participants: A decision analytical model that incorporates economic feedback in assessing health outcomes associated with the labor force and investment. A macroeconomic model was used to account for (1) the association of cancer-related mortality and morbidity with labor supply; (2) age-sex-specific differences in education, experience, and labor market participation of those who are affected by cancers; and (3) the diversion of cancer treatment expenses from savings and investments. Data were collected on April 25, 2022. Main Outcomes and Measures: Economic cost of 29 cancers across countries and territories. Costs are presented in international dollars at constant 2017 prices. Results: The estimated global economic cost of cancers from 2020 to 2050 is $25.2 trillion in international dollars (at constant 2017 prices), equivalent to an annual tax of 0.55% on global gross domestic product. The 5 cancers with the highest economic costs are tracheal, bronchus, and lung cancer (15.4%); colon and rectum cancer (10.9%); breast cancer (7.7%); liver cancer (6.5%); and leukemia (6.3%). China and the US face the largest economic costs of cancers in absolute terms, accounting for 24.1% and 20.8% of the total global burden, respectively. Although 75.1% of cancer deaths occur in low- and middle-income countries, their share of the economic cost of cancers is lower at 49.5%. The relative contribution of treatment costs to the total economic cost of cancers is greater in high-income countries than in low-income countries. Conclusions and Relevance: In this decision analytical modeling study, the macroeconomic cost of cancers was found to be substantial and distributed heterogeneously across cancer types, countries, and world regions. The findings suggest that global efforts to curb the ongoing burden of cancers are warranted.
Prompt tuning, which only tunes continuous prompts with a frozen language model, substantially reduces per-task storage and memory usage at training. However, in the context of NLU, prior work reveals that prompt tuning does not perform well for normal-sized pretrained models. We also find that existing methods of prompt tuning cannot handle hard sequence labeling tasks, indicating a lack of universality. We present a novel empirical finding that properly optimized prompt tuning can be universally effective across a wide range of model scales and NLU tasks. It matches the performance of finetuning while having only 0.1%-3% tuned parameters. Our method P-Tuning v2 is an implementation of Deep Prompt Tuning Given the universality and simplicity of P-Tuning v2, we believe it can serve as an alternative to finetuning and a strong baseline for future research. 1
Dynamic neural network is an emerging research topic in deep learning. Compared to static models which have fixed computational graphs and parameters at the inference stage, dynamic networks can adapt their structures or parameters to different inputs, leading to notable advantages in terms of accuracy, computational efficiency, adaptiveness, etc. In this survey, we comprehensively review this rapidly developing area by dividing dynamic networks into three main categories: 1) sample-wise dynamic models that process each sample with data-dependent architectures or parameters; 2) spatial-wise dynamic networks that conduct adaptive computation with respect to different spatial locations of image data; and 3) temporal-wise dynamic models that perform adaptive inference along the temporal dimension for sequential data such as videos and texts. The important research problems of dynamic networks, e.g., architecture design, decision making scheme, optimization technique and applications, are reviewed systematically. Finally, we discuss the open problems in this field together with interesting future research directions.
Retinal screening contributes to early detection of diabetic retinopathy and timely treatment. To facilitate the screening process, we develop a deep learning system, named DeepDR, that can detect early-to-late stages of diabetic retinopathy. DeepDR is trained for real-time image quality assessment, lesion detection and grading using 466,247 fundus images from 121,342 patients with diabetes. Evaluation is performed on a local dataset with 200,136 fundus images from 52,004 patients and three external datasets with a total of 209,322 images. The area under the receiver operating characteristic curves for detecting microaneurysms, cotton-wool spots, hard exudates and hemorrhages are 0.901, 0.941, 0.954 and 0.967, respectively. The grading of diabetic retinopathy as mild, moderate, severe and proliferative achieves area under the curves of 0.943, 0.955, 0.960 and 0.972, respectively. In external validations, the area under the curves for grading range from 0.916 to 0.970, which further supports the system is efficient for diabetic retinopathy grading.
Alzheimer's disease (AD) is a progressive and irreversible brain degenerative disorder. Mild cognitive impairment (MCI) is a clinical precursor of AD. Although some treatments can delay its progression, no effective cures are available for AD. Accurate early-stage diagnosis of AD is vital for the prevention and intervention of the disease progression. Hippocampus is one of the first affected brain regions in AD. To help AD diagnosis, the shape and volume of the hippocampus are often measured using structural magnetic resonance imaging (MRI). However, these features encode limited information and may suffer from segmentation errors. Additionally, the extraction of these features is independent of the classification model, which could result in sub-optimal performance. In this study, we propose a multi-model deep learning framework based on convolutional neural network (CNN) for joint automatic hippocampal segmentation and AD classification using structural MRI data. Firstly, a multi-task deep CNN model is constructed for jointly learning hippocampal segmentation and disease classification. Then, we construct a 3D Densely Connected Convolutional Networks (3D DenseNet) to learn features of the 3D patches extracted based on the hippocampal segmentation results for the classification task. Finally, the learned features from the multi-task CNN and DenseNet models are combined to classify disease status. Our method is evaluated on the baseline T1-weighted structural MRI data collected from 97 AD, 233 MCI, 119 Normal Control (NC) subjects in the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. The proposed method achieves a dice similarity coefficient of 87.0% for hippocampal segmentation. In addition, the proposed method achieves an accuracy of 88.9% and an AUC (area under the ROC curve) of 92.5% for classifying AD vs. NC subjects, and an accuracy of 76.2% and an AUC of 77.5% for classifying MCI vs. NC subjects. Our empirical study also demonstrates that the proposed multi-model method outperforms the single-model methods and several other competing methods.
It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format. Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing tackling both redundancy and dependency for efficient and effective representation learning. Finally, we flexibly stack our blocks into a new powerful backbone, and adopt it for various vision tasks from image to video domain, from classification to dense prediction. Without any extra training data, our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1 K classification task. With only ImageNet-1 K pre-training, it can simply achieve state-of-the-art performance in a broad range of downstream tasks. It obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, 60.9/71.2 top-1 accuracy on Something-Something V1/V2 video classification tasks, 53.8 box AP and 46.4 mask AP on COCO object detection task, 50.8 mIoU on ADE20 K semantic segmentation task, and 77.4 AP on COCO pose estimation task. Moreover, we build an efficient UniFormer with a concise hourglass design of token shrinking and recovering, which achieves 2-4[Formula: see text] higher throughput than the recent lightweight models.
Convolution and self-attention are two powerful techniques for representation learning, and they are usually considered as two peer approaches that are distinct from each other. In this paper, we show that there exists a strong underlying relation between them, in the sense that the bulk of computations of these two paradigms are in fact done with the same operation. Specifically, we first show that a traditional convolution with kernel size k × k can be decomposed into k <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> individual 1 × 1 convolutions, followed by shift and summation operations. Then, we interpret the projections of queries, keys, and values in self-attention module as multiple 1 × 1 convolutions, followed by the computation of attention weights and aggregation of the values. Therefore, the first stage of both two modules comprises the similar operation. More importantly, the first stage contributes a dominant computation complexity (square of the channel size) comparing to the second stage. This observation naturally leads to an elegant integration of these two seemingly distinct paradigms, i.e., a mixed model that enjoys the benefit of both self-Attention and Convolution (ACmix), while having minimum compu-tational overhead compared to the pure convolution or self-attention counterpart. Extensive experiments show that our model achieves consistently improved results over com-petitive baselines on image recognition and downstream tasks. Code and pre-trained models will be released at https://github.com/LeapLabTHU/ACmix and https://gitee.com/mindspore/models.
Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, Maosong Sun. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
Abstract Recently deep learning (DL), as a new data‐driven technique compared to conventional approaches, has attracted increasing attention in geophysical community, resulting in many opportunities and challenges. DL was proven to have the potential to predict complex system states accurately and relieve the “curse of dimensionality” in large temporal and spatial geophysical applications. We address the basic concepts, state‐of‐the‐art literature, and future trends by reviewing DL approaches in various geosciences scenarios. Exploration geophysics, earthquakes, and remote sensing are the main focuses. More applications, including Earth structure, water resources, atmospheric science, and space science, are also reviewed. Additionally, the difficulties of applying DL in the geophysical community are discussed. The trends of DL in geophysics in recent years are analyzed. Several promising directions are provided for future research involving DL in geophysics, such as unsupervised learning, transfer learning, multimodal DL, federated learning, uncertainty estimation, and active learning. A coding tutorial and a summary of tips for rapidly exploring DL are presented for beginners and interested readers of geophysics.
Graph convolutional networks (GCNs), which generalize CNNs to more generic non-Euclidean structures, have achieved remarkable performance for skeleton-based action recognition. However, there still exist several issues in the previous GCN-based models. First, the topology of the graph is set heuristically and fixed over all the model layers and input data. This may not be suitable for the hierarchy of the GCN model and the diversity of the data in action recognition tasks. Second, the second-order information of the skeleton data, i.e., the length and orientation of the bones, is rarely investigated, which is naturally more informative and discriminative for the human action recognition. In this work, we propose a novel multi-stream attention-enhanced adaptive graph convolutional neural network (MS-AAGCN) for skeleton-based action recognition. The graph topology in our model can be either uniformly or individually learned based on the input data in an end-to-end manner. This data-driven approach increases the flexibility of the model for graph construction and brings more generality to adapt to various data samples. Besides, the proposed adaptive graph convolutional layer is further enhanced by a spatial-temporal-channel attention module, which helps the model pay more attention to important joints, frames and features. Moreover, the information of both the joints and bones, together with their motion information, are simultaneously modeled in a multi-stream framework, which shows notable improvement for the recognition accuracy. Extensive experiments on the two large-scale datasets, NTU-RGBD and Kinetics-Skeleton, demonstrate that the performance of our model exceeds the state-of-the-art with a significant margin.
OBJECTIVE: We aimed to evaluate the performance of the newly developed deep learning Radiomics of elastography (DLRE) for assessing liver fibrosis stages. DLRE adopts the radiomic strategy for quantitative analysis of the heterogeneity in two-dimensional shear wave elastography (2D-SWE) images. DESIGN: A prospective multicentre study was conducted to assess its accuracy in patients with chronic hepatitis B, in comparison with 2D-SWE, aspartate transaminase-to-platelet ratio index and fibrosis index based on four factors, by using liver biopsy as the reference standard. Its accuracy and robustness were also investigated by applying different number of acquisitions and different training cohorts, respectively. Data of 654 potentially eligible patients were prospectively enrolled from 12 hospitals, and finally 398 patients with 1990 images were included. Analysis of receiver operating characteristic (ROC) curves was performed to calculate the optimal area under the ROC curve (AUC) for cirrhosis (F4), advanced fibrosis (≥F3) and significance fibrosis (≥F2). RESULTS: AUCs of DLRE were 0.97 for F4 (95% CI 0.94 to 0.99), 0.98 for ≥F3 (95% CI 0.96 to 1.00) and 0.85 (95% CI 0.81 to 0.89) for ≥F2, which were significantly better than other methods except 2D-SWE in ≥F2. Its diagnostic accuracy improved as more images (especially ≥3 images) were acquired from each individual. No significant variation of the performance was found if different training cohorts were applied. CONCLUSION: DLRE shows the best overall performance in predicting liver fibrosis stages compared with 2D-SWE and biomarkers. It is valuable and practical for the non-invasive accurate diagnosis of liver fibrosis stages in HBV-infected patients. TRIAL REGISTRATION NUMBER: NCT02313649; Post-results.
Automatic diagnosing lung cancer from computed tomography scans involves two steps: detect all suspicious lesions (pulmonary nodules) and evaluate the whole-lung/pulmonary malignancy. Currently, there are many studies about the first step, but few about the second step. Since the existence of nodule does not definitely indicate cancer, and the morphology of nodule has a complicated relationship with cancer, the diagnosis of lung cancer demands careful investigations on every suspicious nodule and integration of information of all nodules. We propose a 3-D deep neural network to solve this problem. The model consists of two modules. The first one is a 3-D region proposal network for nodule detection, which outputs all suspicious nodules for a subject. The second one selects the top five nodules based on the detection confidence, evaluates their cancer probabilities, and combines them with a leaky noisy-OR gate to obtain the probability of lung cancer for the subject. The two modules share the same backbone network, a modified U-net. The overfitting caused by the shortage of the training data is alleviated by training the two modules alternately. The proposed model won the first place in the Data Science Bowl 2017 competition.