哈尔滨工业大学

关毅

发布日期:2024-05-10 浏览次数:

基本信息 科学研究 教育教学 论文专著 新建主栏目 http://wi.hit.edu.cn 名称 IEEE高级会员,AAAS、ACM会员,CCF高级会员,CAA会员。长期从事自然语言处理相关研究。曾主持、参加过多项国家自然科学基金重点及面上、八六三项目重点及面上项目,以及国际国内合作项目的研究。在IEEE TRANSACTION on SMC等国际期刊,《中国科学》等国内一级期刊和核心期刊,ACL等国际国内会议上发表论文一百余篇,提出了系统相似及其测度理论等多项创新成果。中国专利七项,美国专利两项。拥有面向移动平台的自主产权输入法产品WI输入法,后成为国产飞机专用输入法,获得2023年度中国航空工业集团有限公司科学技术奖三等奖。后转入医疗健康领域的人工智能技术研发,提出了中文电子病历语料加工规范及标注语料库,开发了医疗实体与实体关系抽取系统,研究了医疗本体的构建及自动诊疗的概率图模型,其参与的黑龙江省大数据环境下电子健康基础理论与方法研究项目获得黑龙江省高校科技进步二等奖。作为主要研制人参加微软拼音输入法、BOPOMOFO智能汉字输入系统、Weniwen智能搜索引擎等项目研发。出版专著一部。主要研究方向包括:医疗健康信息学,自然语言处理,知识工程。承担哈工大研究生专业必修课《自然语言处理》等课程的教学工作。 工作经历 名称 时间 工作经历 2000年 香港科技大学电气与电子工程系人类语言技术中心任副研究员 2001年 香港Weniwen有限公司任研究科学家 2001年5月 哈尔滨工业大学任教 2001年10月 特评为哈尔滨工业大学副教授 2006年 哈尔滨工业大学教授 2007年 哈尔滨工业大学博士生导师 教育经历 名称 1988年至1992年在天津大学计算机科学与工程系软件专业获工学学士学位。 1992年至1995年在哈尔滨工业大学计算机应用专业获得硕博连读资格 1995年至1999年在哈尔滨工业大学计算机应用专业获得博士学位 科研项目 项目名称 面向语句间语义相似度计算基于词主体自治学习的强化学习机制研究 项目来源 国家自然科学基金 开始时间 2009-01-01 结束时间 2012-12-01 项目经费 32万 担任角色 负责 项目类别 横向项目 项目状态 完成 简单介绍 项目名称 非常规突发事件网络舆情分析方法和预警机制的研究 项目来源 国家自然科学基金重点 开始时间 2009-01-01 结束时间 2012-12-01 项目经费 35 担任角色 参与 项目类别 横向项目 项目状态 完成 简单介绍 项目名称 下一代信息检索系统 项目来源 国家自然科学基金重点 开始时间 2008-01-01 结束时间 2011-12-01 项目经费 190 担任角色 参与 项目类别 纵向项目 项目状态 完成 简单介绍 项目名称 基于一种新的系统相似度度量的文本情感倾向性研究 项目来源 微软教育部语言语音重点实验室开放基金项目 开始时间 2010-01-01 结束时间 2012-06-01 项目经费 4万 担任角色 负责 项目类别 横向项目 项目状态 进行中 简单介绍 项目名称 面向IOS平台的语句输入系统WI 输入法研究 项目来源 自选 开始时间 2010-11-01 结束时间 2019-12-01 项目经费 担任角色 负责 项目类别 横向项目 项目状态 进行中 简单介绍 项目名称 淘宝购物网站中针对产品节点的信息挖掘技术研究 项目来源 淘宝网 开始时间 2010-09-01 结束时间 2011-03-01 项目经费 15 担任角色 负责 项目类别 纵向项目 项目状态 进行中 简单介绍 项目名称 阿里巴巴浅层句法分析技术研究 项目来源 阿里巴巴公司 开始时间 2009-01-01 结束时间 2009-11-01 项目经费 15 担任角色 负责 项目类别 纵向项目 项目状态 进行中 简单介绍 项目名称 富士通博客或bbs情感倾向性分析技术研究 项目来源 富士通研发中心 开始时间 2008-10-01 结束时间 2009-06-01 项目经费 15 担任角色 负责 项目类别 纵向项目 项目状态 进行中 简单介绍 项目名称 隐式用户兴趣挖掘技术研究 项目来源 myspace公司 开始时间 2007-12-01 结束时间 2008-12-01 项目经费 10 担任角色 负责 项目类别 纵向项目 项目状态 进行中 简单介绍 项目名称 问答式信息检索的理论与方法研究 项目来源 国家自然科学基金重点 开始时间 2006-01-01 结束时间 2009-12-01 项目经费 190 担任角色 参与 项目类别 纵向项目 项目状态 完成 简单介绍 项目名称 面向智能化信息检索的危险式人工免疫网络理论与方法研究 项目来源 国家自然科学基金青年基金 开始时间 2006-01-01 结束时间 2009-12-01 项目经费 24 担任角色 负责 项目类别 纵向项目 项目状态 完成 简单介绍 项目名称 新加坡词法分析国际合作项目 项目来源 新加坡信息通信研究院 开始时间 2008-01-01 结束时间 2008-12-01 项目经费 25 担任角色 负责 项目类别 纵向项目 项目状态 进行中 简单介绍 项目名称 网站主题分析、标引与检索技术研究 项目来源 微软基金 开始时间 2006-06-01 结束时间 2007-06-01 项目经费 5 担任角色 负责 项目类别 纵向项目 项目状态 进行中 简单介绍 项目名称 面向特定领域的词典获取和统计语言模型的建立 项目来源 微软基金 开始时间 2004-06-01 结束时间 2006-06-01 项目经费 6.5 担任角色 负责 项目类别 纵向项目 项目状态 进行中 简单介绍 项目名称 网络信息的通用开放语义类名实体自动识别与标注研究 项目来源 哈工大校基金 开始时间 2003-06-01 结束时间 2005-06-01 项目经费 2 担任角色 负责 项目类别 纵向项目 项目状态 完成 简单介绍 项目名称 基于粗糙集大规模语料库语言学知识发现模型研究 项目来源 国家自然基金 开始时间 2002-01-01 结束时间 2004-12-01 项目经费 19 担任角色 参与 项目类别 纵向项目 项目状态 完成 简单介绍 项目名称 面向奥运智能信息服务的语料加工、文摘、检索技术研究 项目来源 863重点项目 开始时间 2003-12-01 结束时间 2005-12-01 项目经费 30 担任角色 参与 项目类别 纵向项目 项目状态 完成 简单介绍 项目名称 联通客服问答系统 项目来源 八达集团 开始时间 2002-06-01 结束时间 2003-06-01 项目经费 12 担任角色 负责 项目类别 纵向项目 项目状态 进行中 简单介绍 项目名称 手机操作系统智能输入 项目来源 富士通公司 开始时间 2002-03-01 结束时间 2003-06-01 项目经费 1800万日元 担任角色 参与 项目类别 纵向项目 项目状态 进行中 简单介绍 项目名称 基于内容的网络信息压缩及摘要自动生成技术 项目来源 网络安全项目 开始时间 2001-10-01 结束时间 2002-10-01 项目经费 60 担任角色 参与 项目类别 纵向项目 项目状态 完成 简单介绍 项目名称 智能化中文信息处理平台 项目来源 863 开始时间 2001-10-01 结束时间 2002-10-01 项目经费 60 担任角色 参与 项目类别 纵向项目 项目状态 完成 简单介绍 奖项成果 奖项名称 WI 输入法 获奖时间 2011 完成人 关毅 阎于闻 周春波 贾祯 田作辉等 所获奖项 2010中国互联网创新产品评选最佳技术创新提名奖 简单介绍 WI输入法是哈尔滨工业大学计算机学院语言技术研究中心网络智能研究室开发的iPhone / iPad /iPodtouch平台上的智能拼音语句输入法。它支持语句输入,全拼智能按键纠错,模糊音输入,简拼输入以及多种双拼输入方式。 研究领域 名称 健康信息学 智能化信息检索 网络挖掘 自然语言处理 认知语言学 讲授课程 名称 研究生专业必修课《自然语言处理》 论文期刊 论文标题 基于电子商务用户行为的同义词识别 作者 张书娟,董喜双,关毅 发表时间 期刊名称 中文信息学报 期卷 第26卷,第3期 简单介绍 本文研究了电子商务领域同义词的自动识别问题。针对该领域新词多、错别字多、近义词多的用词特点,提出基于用户行为的同义词识别方法。首先通过并列关系符号切分商品标题和基于SimRank思想聚集查询两种方法获取候选集合,进而获取两词的字面特征以及标题、查询、点击等用户行为特征,然后借助Gradient Boosting Decision Tree(GBDT)模型判断是否同义。实验表明同义词识别准确率达到了54.46%,高于SVM近4个百分点。 论文标题 基于最大熵模型和最小割模型的中文词与句褒贬极性分析 作者 董喜双,邹启波,关毅,高翔,闫铭 发表时间 期刊名称 第三届中文倾向性分析评测(COAE2011) 期卷 简单介绍 本文运用最大熵模型和最小割模型预测中文词和句子的褒贬极性。词级情感分析首先构建领域情感词典,然后根据领域情感词典提取候选词,并使用最大熵模型预测候选词的极性,最后采用最小割模型优化极性结果。句级情感分析首先根据领域情感词典识别观点句,将观点句切分成短句并基于规则提取特征,应用最大熵模型预测短句的极性,最后根据短句的极性预测长句的极性。 论文标题 基于购物网站用户搜索日志的商品词发现 作者 杨锦锋,吕新波,关毅,周春波 发表时间 期刊名称 计算机应用与软件 期卷 2011,28(11 简单介绍 商品词是电子商务领域描述商品的新词。本文主要介绍了基于购物网站用户搜索日志的商品词发现的方法。该方法从搜索日志中提取用户查询,对查询进行分词,采用N元递增分步算法和串频统计,计算候选串的条件概率,选择候选商品词。为了降低人工审核的成本,我们只对产出商品词的准确率进行评价。我们利用该方法在手机、面霜和香水三类商品的搜索日志上进行了实验,最高准确率达到92.58%。 论文标题 Automatically Generating Questions from Queries for Community-based Question Answering 作者 Zhao, Shiqi and Wang, Haifeng and Li, Chao and Liu, Ting and Guan, Yi 发表时间 期刊名称 Proceedings of 5th International Joint Conference on Natural Language Processing 期卷 简单介绍 This paper proposes a method that automatically generates questions from queries for community-based question answering (cQA) services. Our query-to-question generation model is built upon templates induced from search engine query logs. In detail, we first extract pairs of queries and user-clicked questions from query logs, with which we induce question generation templates. Then, when a new query is submitted, we select proper templates for the query and generate questions through template instantiation. We evaluated the method with a set of short queries randomly selected from query logs, and the generated questions were judged by human annotators. Experimental results show that, the precision of 1-best and 5- best generated questions is 67% and 61%, respectively, which outperforms a baseline method that directly retrieves questions for queries in a cQA site search engine. In addition, the results also suggest that the proposed method can improve the search of cQA archives. 论文标题 电子商务中针对产品的摘要挖掘技术研究 作者 季知祥,董喜双,关毅 发表时间 期刊名称 2011信息技术与管理科学国际学术研讨会 期卷 简单介绍 In this paper, we present a novel approach for mining the summary of e-commerce products by using their description text. Product summary is composed of phrases from different aspects having independent meaning, which is different from the traditional multi-document summarization constructed by selecting sentences. Firstly, after extracting the body from the text, splitting text into sentence and removing repeated sentences, the sentences are clustered into a sub-topics used for describing the product from various aspects. Then the sentences divided by segmentation words are used to obtain candidate phrases. Finally, the phrases are classified by the Maximum Entropy model, and the highest score of phrase in each category will be extracted to form the product summary. The experiment indicated that precision as high as 90%. 论文标题 基于X2统计和词情感分类相结合的中文情感词挖掘 作者 张书娟,朱力,关毅,董喜双 发表时间 期刊名称 2011信息技术与管理科学国际学术研讨会 期卷 简单介绍 Sentiment lexicon is constructed by sentiment score counting of Chinese characters, semantic similaritycalculation, and Pointwise Mutual Information. To enrich the lexicon, we combine Chi-square statistics and wordsentiment classification to mine sentiment words that are not contained in the lexicon. The average precision ofpolarity judgment of sentiment words is improved by 3%. 论文标题 基于最大熵马尔科夫模型和条件随机域模型的汉语组块分析技术研究 作者 李超,关毅,李生 发表时间 期刊名称 2011信息技术与管理科学国际学术研讨会 期卷 简单介绍 In this paper, we present a Chinese chunking method, in which the chunking problem is transformedinto sequential labeling process by applying Maximum Entropy Markov Models and ConditionalRandom Field. Maximum Entropy Markov Model achieved the F-measure of 93.2% with the help of candidatetags selection, which can significantly reduce Error caused by lable bias and save testing time. When weuse Conditional Random Field Models, Maximum Entropy Markov Models took the place of ConditionalRandom Fields to select effective feature template, this method can save more than 80% time. ConditionalRandom Fields achieved the F-measure of 93.4%. 论文标题 中文情感词倾向消歧 作者 孙慧 关毅 董喜双 发表时间 期刊名称 第六届全国信息检索学术会议论文集(CCIR 2010) 期卷 简单介绍 文本情感倾向性分析的基础是词汇情感倾向分析,本文针对基于词典的词汇情感倾向性分析方法中对情感词倾向绝对化标注问题,提出了一种获取上下文相关的词汇情感倾向方法。同时针对目前缺少包含上下文相关情感词标注资源的问题,使用最大熵交叉验证和手工校正结合的方法加以构造,并在此基础上构造了上下文相关的特征集合用来预测情感词在上下文中的情感倾向。实验表明,此种方法与基于词典的词语情感倾向性分析方法相比,F值提高了4.9%。 论文标题 Selecting Optimal Feature Template Subset for CRFs 作者 Xingjun Xu and Guanglu Sun and Yi Guan and. Xishuang Dong and Sheng Li 发表时间 期刊名称 Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP2010) 期卷 简单介绍 Conditional Random Fields (CRFs) are the state-of-the-art models for sequential labe-ling problems. A critical step is to select optimal feature template subset before em-ploying CRFs, which is a tedious task. To improve the efficiency of this step, we pro-pose a new method that adopts the maxi-mum entropy (ME) model and maximum entropy Markov models (MEMMs) instead of CRFs considering the homology be-tween ME, MEMMs, and CRFs. Moreover, empirical studies on the efficiency and ef-fectiveness of the method are conducted in the field of Chinese text chunking, whose performance is ranked the first place in task two of CIPS-ParsEval-2009. 论文标题 HIT_LTRC at TREC 2010 Blog Track: Faceted Blog Distillation 作者 Jinfeng Yang, Xishuang Dong, Yi Guan, Chengzhen Huang, Sheng Wang 发表时间 期刊名称 Proceedings of TREC 2010 期卷 简单介绍 This paper describes our participation in the faceted blog distillation task at Blog Track 2010. In our approach, indri toolkit is ap-plied for basic topic relevance retrieval. Then the Maximum Entropy (ME) model is adopted to judge the relevance of each blog to specified facet. Feed faceted relevance is calculated by integrating the average relev-ance of all blogs within a feed and the av-erage relevance of the most relevant N blogs. Two implementations are applied to calculate feed faceted relevance. Experi-mental results on Blogs08 dataset show the effectiveness of our approach. 论文标题 Complete Syntactic Analysis Based on Multi-level Chunking 作者 ZhiPeng Jiang and Yu Zhao and Yi Guan and. Chao Li and Sheng Li 发表时间 期刊名称 Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP2010) 期卷 简单介绍 This paper describes a complete syntactic analysis system based on multi-level chunking. On the basis of the correct se- quences of Chinese words provided by CLP2010, the system firstly has a Part-of- speech (POS) tagging with Conditional Random Fields (CRFs), and then does the base chunking and complex chunking with Maximum Entropy (ME), and finally gene- rates a complete syntactic analysis tree. The system took part in the Complete Sen- tence Parsing Track of the Task 2 Chinese Parsing in CLP2010, achieved the F-1 measure of 63.25% on the overall analysis, ranked the sixth; POS accuracy rate of 89.62%, ranked the third. 论文标题 Learning of humanoid robot walk parameters based on FSR 作者 Yuan, Quan-De, Hong, Bing-Rong, Guan, Yi , Ke, Wen-De 发表时间 期刊名称 China Journal of Harbin Institute of Technology (New Series) 期卷 2010 17 SU 简单介绍 论文标题 网页结构树相似度计算 作者 祁钰;关毅;吕新波;岳淑珍 发表时间 期刊名称 黑龙江大学自然科学学报 期卷 2009年第05期 简单介绍 本文提出了一种针对网页结构树的相似度计算方法,首先把网页标签结构表示成树,然后通过动态规划算法,计算两棵树之间的距离,以此来衡量两个网页之间的相似程度。实验证明本方法可以正确区分同类网页和不同类网页。 论文标题 基于最大熵模型的汉语基本块分析技术研究 作者 李超 孙健 关毅 徐兴军 侯磊 李生 发表时间 期刊名称 中文信息学会句法分析评测(CIPS-ParsEval-2009) 期卷 简单介绍 本文论述了一个应用最大熵马尔科夫模型序列化标注块的边界、成分信息和应用最大熵模型分类识别块的关系信息的汉语基本块分析方法。为有效减少识别错误,重点探讨了候选标签筛选、难点关系识别等改进措施。集成上述方法的系统,边界、成分标记识别F 值达到93.196%,关系标记识别F 值达到92.103%,在中文信息学会句法分析评测(CIPS-ParsEval-2009)任务2:汉语基本块分析中取得第一名。 论文标题 基于最大熵模型的中文词与句情感分析研究 作者 董喜双 关毅 李本阳 陈志杰 发表时间 期刊名称 第二届中文倾向性分析评测(COAE2009) 期卷 简单介绍 本文将研究焦点对准喜、怒、哀、惧四类情感分析问题,重点解决中文词、句的情感分析问题。将词的情感分析处理为候选词情感分类问题。首先通过词性过滤获得候选词,进而根据特征模板获取候选词情感特征,然后应用最大熵模型判断候选词情感类别,最后应用中性词典、倾向性词典、复句词表、否定词表过滤候选情感词分类错误得到情感词集合。句的情感分析首先根据情感词典和倾向词典提取词特征,并采用规则提取词序列特征,然后采用最大熵模型对句子进行情感分类。在COAE2009评测中词与句情感分析取得较好结果。 论文标题 An Overview of Learning to Rank for Information Retrieval 作者 Dong, X.; Chen, X.; Guan, Y.; Xu, Z.; Li, S. 发表时间 期刊名称 Proc. WRI World Congress on Computer Science and Information Engineering 期卷 2009年 简单介绍 This paper presents an overview of learning to rank.It includes three parts: related concepts including thedefinitions of ranking and learning to rank; a summaryof pointwise models, pairwise models, and listwisemodels; estimation measures such as NormalizedDiscount Cumulative Gain and Mean AveragePrecision, respectively. Considering the deficiency thatcurrent learning to rank models lack of continuallearning ability, we present a new continual learningidea that combines multi-agent autonomy learningmechanism with molecular immune mechanism forranking. 论文标题 基于Swarm的人工免疫网络算法研究 作者 杜新凯 关毅 岳淑珍 徐兴军 发表时间 期刊名称 微计算机信息 期卷 2008年18期 简单介绍 智能化信息检索是网络时代最重要的应用之一。现有的机器学习理论与方法难以适应网络环境下数据的动态性和用户兴趣的多样性,成为智能化信息检索研究中的一个薄弱环节。本文通过学习和借鉴自然免疫系统的特征和原理,利用Swarm软件平台,设计和实现了一个人工免疫网络算法。该算法建立在对自然免疫系统的现有理解之上,具备自然免疫系统的主要特征,并被成功的应用于解决一个简单的模式识别问题。最后展望了将人工免疫系统这一新的机器学习机制应用到智能化信息检索系统中的前景。 论文标题 基于词聚类特征的统计中文组块分析模型 作者 孙广路 王晓龙 关毅 发表时间 期刊名称 电子学报 期卷 2008,36(12 简单介绍 提出了一种基于信息熵的层次词聚类算法,并将该算法产生的词簇作为特征应用到中文组块分析模型中.词聚类算法基于信息熵的理论,利用中文组块语料库中的词及其组块标记作为基本信息,采用二元层次聚类的方法形成具有一定句法功能的词簇.在聚类过程中,设计了优化算法节省聚类时间.用词簇特征代替传统的词性特征应用到组块分析模型中,并引入名实体和仿词识别模块,在此基础上构建了基于最大熵马尔科夫模型的中文组块分析系统.实验表明,本文的算法提升了聚类效率,产生的词簇特征有效地改进了中文组块分析系统的性能. 论文标题 A New Measurement of Systematic Similarity 作者 Yi Guan, Xiaolong Wang, and Qiang Wang 发表时间 期刊名称 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS 期卷 VOL. 38, N 简单介绍 The relationship of similarity may be the most universalrelationship that exists between every two objects in eitherthe material world or the mental world. Although similarity modelinghas been the focus of cognitive science for decades, manytheoretical and realistic issues are still under controversy. In thispaper, a new theoretical framework that conforms to the natureof similarity and incorporates the current similarity models intoa universal model is presented. The new model, i.e., the systematicsimilarity model, which is inspired by the contrast model ofsimilarity and structure mapping theory in cognitive psychology,is the universal similarity measurement that has many potentialapplications in text, image, or video retrieval. The text relevanceranking experiments undertaken in this research tentatively showthe validity of the new model. 论文标题 Recent advances on NLP research in Harbin Institute of Technology 作者 Tiejun Zhao, Yi Guan, Ting Liu, Qiang Wang 发表时间 期刊名称 Frontiers of Computer Science in China 期卷 1(4): 413- 简单介绍 In the sixties of the last century, the researchers of Harbin Institute of Technology (HIT) attemptedto do relevant research in natural language processing. After more-than-40-year-effort, HIT has already established3 research laboratories for Chinese information processing, i.e. the Machine Intelligence and TranslationLaboratory (MI&T Lab), the Intelligent Technology and Natural Language Processing Laboratory (ITNLP) and theInformation Retrieval Laboratory (IR-Lab). At present it has a well-balanced research team of over 200 persons,including tutors of Ph. D candidate, professors, associate professors, lecturers, Ph. D and Master candidates etc.,and the research interests have extended to language processing, machine translation, text retrieval and other fields.Besides, during the course of the scientific research, HIT has accumulated a batch of key techniques and dataresources, won many prizes in the technical evaluations at home and abroad, and has become one of the mostimportant natural language processing bases for teaching and scientific research in China now. 论文标题 基于多知识源的中文词法分析系统 作者 姜维 王晓龙 关毅 赵健 发表时间 期刊名称 计算机学报 期卷 2007年第1期 简单介绍 汉语词法分析是中文自然语言处理的首要任务.文中深入研究中文分词、词性标注、命名实体识别所面临的问题及相互之间的协作关系,并阐述了一个基于混合语言模型构建的实用汉语词法分析系统.该系统采用了多种语言模型,有针对性地处理词法分析所面临的各个问题.其中分词系统参加了2005 年第二届国际汉语分词评测,在微软亚洲研究院、北京大学语料库开放测试中,分别获得犉量度为97.2% 与96.7%.而在北京大学标注的《人民日报》语料库的开放评测中,词性标注获得96.1%的精确率,命名实体识别获得的犉量度值为88.6%. 论文标题 基于特征类别属性分析的文本分类器分类噪声裁剪方法 作者 王强 关毅 王晓龙 发表时间 期刊名称 自动化学报 期卷 2007年08期 简单介绍 提出一种应用文本特征的类别属性进行文本分类过程中的类别噪声裁剪(Eliminating class noise, ECN) 的算法. 算法通过分析文本关键特征中蕴含的类别指示信息, 主动预测待分类文本可能归属的类别集, 从而减少参与决策的分类器数目,降低分类延迟, 提高分类精度. 在中、英文测试语料上的实验表明, 该算法的F 值分别达到0.76 与0.93, 而且分类器运行效率也有明显提升, 整体性能较好. 进一步的实验表明, 此算法的扩展性能较好, 结合一定的反馈学习策略, 分类性能可进一步提高, 其F 值可达到0.806 与0.943. 论文标题 A Probabilistic Approach to Syntax-based Reordering for Statistical Machine Translation 作者 Chi-Ho Li, Dongdong Zhang, Mu Li, Ming Zhou,Minghui Li, Yi Guan 发表时间 期刊名称 Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics 期卷 简单介绍 Inspired by previous preprocessing approaches to SMT, this paper proposes anovel, probabilistic approach to reorderingwhich combines the merits of syntax andphrase-based SMT. Given a source sentenceand its parse tree, our method generates,by tree operations, an n-best list of reordered inputs, which are then fed to standard phrase-based decoder to produce theoptimal translation. Experiments show that,for the NIST MT-05 task of Chinese-toEnglish translation, the proposal leads toBLEU improvement of 1.56%. 论文标题 基于标题类别语义识别的文本分类算法研究 作者 王强 关毅 王晓龙 发表时间 期刊名称 电子与信息学报 期卷 第29卷第12期 简单介绍 本文提出了一种基于标题类别语义识别的文本分类算法。算法利用基于类别信息的特征选择策略构造分类的特征空间,通过识别文本标题中的特征词的类别语义来预测文本的候选类别,最后在候选类别空间中用分类器执行分类操作。实验表明该算法在有效降低分类候选数目的基础上可显著提高文本分类的精度,通过对类别空间表示效率指标的验证,进一步表明该算法有效地提高了文本表示空间的性能。 论文标题 Using Maximum Entropy Model to Extract Protein-Protein Interaction Information from Biomedical Literature 作者 Chengjie Sun, Lei Lin, Xiaolong Wang, Yi Guan 发表时间 期刊名称 Lecture Notes in Computer Science of Advanced Intelligent Computing Theories and Applications with Aspects of Theoretical and Me 期卷 Volume 46 简单介绍 Protein-Protein interaction (PPI) information play a vitalrole in biological research. This work proposes a two-step machine learningbased method to extract PPI information from biomedical literature.Both steps use Maximum Entropy (ME) model. The first step is designedto estimate whether a sentence in a literature contains PPI information.The second step is to judge whether each protein pair in a sentence hasinteraction. Two steps are combined through adding the outputs of thefirst step to the model of the second step as features. Experiments showthe method achieves a total accuracy of 81.9% in BC–PPI corpus andthe outputs of the first step can effectively prompt the performance ofthe PPI information extraction. 论文标题 Rich features based Conditional Random Fields for biological named entities recognition 作者 Chengjie Sun, Yi Guan, Xiaolong Wang, Lei Lin 发表时间 期刊名称 Computers in Biology and Medicine archive 期卷 Volume 37, 简单介绍 Biological named entity recognition is a critical task for automatically mining knowledge from biological literature. In this paper, this taskis cast as a sequential labeling problem and Conditional Random Fields model is introduced to solve it. Under the framework of Conditional onRandom Fields model, rich features including literal, context and semantics are involved. Among these features, shallow syntactic features are?rst introduced, which effectively improve the model’s performance. Experiments show that our method can achieve an F-measure of 71.2%in an open evaluation data, which is better than most of state-of-the-art systems. 论文标题 Exploiting residue-level and profile-level interface propensities for usage in binding sites prediction of proteins 作者 Qiwen Dong, Xiaolong Wang, Lei Lin, Yi Guan 发表时间 期刊名称 BMC Bioinformatics 期卷 2007; 8 简单介绍 BackgroundRecognition of binding sites in proteins is a direct computational approach to the characterization of proteins in terms of biological andbiochemical function. Residue preferences have been widely used in many studies but the results are often not satisfactory. Althoughdifferent amino acid compositions among the interaction sites of different complexes have been observed, such differences have notbeen integrated into the prediction process. Furthermore, the evolution information has not been exploited to achieve a more powerfulpropensity.ResultIn this study, the residue interface propensities of four kinds of complexes (homo-permanent complexes, homo-transient complexes,hetero-permanent complexes and hetero-transient complexes) are investigated. These propensities, combined with sequence profilesand accessible surface areas, are inputted to the support vector machine for the prediction of protein binding sites. Such propensities are further improved by takingevolutional information into consideration, which results in a class of novel propensities at the profile level, i.e. the binary profiles interface propensities. Experimentis performed on the 1139 non-redundant protein chains. Although different residue interface propensities among different complexes are observed, the improvementof the classifier with residue interface propensities can be negligible in comparison with that without propensities. The binary profile interface propensities cansignificantly improve the performance of binding sites prediction by about ten percent in term of both precision and recall.ConclusionAlthough there are minor differences among the four kinds of complexes, the residue interface propensities cannot provide efficient discrimination for thecomplicated interfaces of proteins. The binary profile interface propensities can significantly improve the performance of binding sites prediction of protein, whichindicates that the propensities at the profile level are more accurate than those at the residue level. 论文标题 基于支持向量机的音字转换模型 作者 姜维 关毅 王晓龙 刘秉权 发表时间 期刊名称 中文信息学报 期卷 简单介绍 针对n—gram在音字转换中不易融合更多特征,本文提出了一种基于支持向量机(svm)的音字转换模型,有效提供可以融合多种知识源的音字转换框架。同时,svm优越的泛化能力减轻了传统模型易于过度拟合的问题,而通过软间隔分类又在一定程度上克服小样本中噪声问题。此外,本文利用粗糙集理论提取复杂特征以及长距离特征,并将其融合于svm模型中,克服了传统模型难于实现远距离约束的问题。实验结果表明,基于svm音字转换模型比传统采用绝对平滑算法的trigram模型精度提高了1.2%;增加远距离特征的 论文标题 一个基于免疫机制的在线机器学习算法 作者 何晏成 关毅 岳淑珍 发表时间 期刊名称 第三届全国信息检索与内容安全学术会议 期卷 简单介绍 本文在免疫应答机制和免疫网络理论等人体免疫原理的基础上,提出了一个新的在线机器学习算法,并将其运用在智能化信息检索系统的知识库参数调节上,实验结果表明该算法具有很好的适应性和从动态环境中持续学习的能力。 论文标题 A Maximum Entropy Chunking Model with N-fold Template Correction 作者 Sun Guanglu, Guan Yi, Wang Xiaolong 发表时间 期刊名称 Journal of Electronics(China) 期卷 2007 24 (5 简单介绍 This letter presents a new chunking method based on M3J(imum Entropy fME)model withⅣ_fold template correction mode1.First two types of m achine learning models are described.Based onthe analysis of the two models,then the chunking model which combines the profits of conditionalprobability model and rule based model is proposed.The selection of features and rule templates in thechunking model is discussed.Experimental results for the CoNLL一2000 corpus show that this approachachieves impressive accuracy in terms of the F-score:92.93%.Compared with the ME model and MEMarkov model,the new chunking model achieves better performance 论文标题 An Improved Feature Representation Method for Maximum Entropy Model Data Mining 作者 Guan Yi, Zhao Jian 发表时间 期刊名称 Data Mining Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International Conference 期卷 简单介绍 In maximum entropy model (MEM), features are typi- cally represented by either 0-1 binary-valued function or real-valued function. However, both representations only examine the impact of specific value of some attributes but not their types. Such negligence not only causes the de- creasing of classification precision, but also slows the con- vergence speed of the generalized iterative scaling (GIS) al- gorithm, as will become more apparent to incomplete data. In this paper, an improved feature representation method is presented. The feature is composed of two parts: the first one is for specific value of an attribute; the second one is for the type of corresponding attribute. The experi- mental results on Mushroom dataset of UCI data repository showed that the average classifying precisions on incom- plete dataset and complete dataset were improved by 1.5% and 3.0% respectively, and the average convergence speed was improved by 42.9% and 90.7% respectively. 论文标题 Biomedical Named Entities Recognition Using Conditional Random Fields Model 作者 Chengjie Sun, Yi Guan, Xiaolong Wang, Lei Lin 发表时间 期刊名称 Lecture Notes in Computer Science of Fuzzy Systems and Knowledge Discovery 期卷 Volume 422 简单介绍 Biomedical named entity recognition is a critical task for automatically mining knowledge from biomedical literature. In this paper, we introduce Conditional Random Fields model to recognize biomedical named entities from biomedical literature. Rich features including literal, context and semantics are involved in Conditional Random Fields model. Shallow syntactic features are first introduced to Conditional Random Fields model and do boundary detection and semantic labeling at the same time, which effectively improve the model’s performance. Experiments show that our method can achieve an F-measure of 71.2% in JNLPBA test data and which is better than most of state-of-the-art system. 论文标题 Exploring Efficient Feature Inference and Compensation In Text Classification 作者 Qiang Wang, Yi Guan and Xiaolong Wang 发表时间 期刊名称 Journal of Chinese Language and Computing 期卷 2006. 16 ( 简单介绍 This paper explores the feasibility of constructing an integrated framework for featureinference and compensation (FIC) in text classification. In this framework, featureinference is to devise intelligent pre-fetching mechanisms that allow for prejudging thecandidate class labels to unseen documents using the category information linked tofeatures, while feature compensation is to revise the current accepted feature set bylearning new or removing incorrect feature values through the classifier results. Thefeasibility of the novel approach has been examined with SVM classifiers on ChineseLibrary Classification (CLC) and Reuters-21578 dataset. The experimental results arereported to evaluate the effectiveness and efficiency of the proposed FIC approach. 论文标题 A Novel Feature Selection Method Based on Category Information Analysis for Class Prejudging in Text Classification 作者 Qiang Wang, Yi Guan, XiaoLong Wang and Zhiming Xu 发表时间 期刊名称 International Journal of Computer Science and Network Security 期卷 2006. 6 简单介绍 This paper presents a new feature selection algorithm withthe category information analysis in text classification.The algorithm obscure or reduce the noises of text featuresby computing the feature contribution with word anddocument frequency and introducing variance mechanismto mine the latent category information. The algorithm isdistinguished from others by providing a pre-fetchingtechnique for classifier while it is compatible withefficient feature selection, which means that the classifiercan actively prejudge the candidate class labels to unseendocuments using the category information linked tofeatures and classify them in the candidate class space toretrench time expenses. The experimental results onChinese and English corpus show that the algorithmachieves a high performance. The F measure is 0.73 and0.93 respectively and the run efficiency of classifier isimproved greatly. 论文标题 SVM-Based Spam Filter with Active and Online Learning 作者 Qiang Wang, Yi Guan, Xiaolong Wang 发表时间 期刊名称 Proceeding of Text REtrieval Conference on Spam Filtering Task(TREC2006) 期卷 简单介绍 A realistic classification model for spam filtering should not only take account of the fact that spam evolvesover time, but also that labeling a large number of examples for initial training can be expensive in terms ofboth time and money. This paper address the problem of separating legitimate emails from unsolicitedones with active and online learning algorithm, using a Support Vector Machines (SVM) as the baseclassifier. We evaluate its effectiveness using a set of goodness criteria on TREC2006 spam filteringbenchmark datasets, and promising results are reported. 论文标题 Answer Extraction Based on System Similarity Model and Stratified Sampling Logistic Regression in Rare Data 作者 Peng Li, Yi Guan, Xiaolong Wang, Yongdong Xu 发表时间 期刊名称 IJCSNS International Journal of Computer Science and Network Security 期卷 VOL.6 No.3 简单介绍 This paper provides a novel and efficient method forextracting exact textual answers from the returned documentsthat are retrieved by traditional IR system in large-scalecollection of texts. The main intended contribution of this paperis to propose System Similarity Model (SSM), which can beconsidered as an extension of vector space model (VSM) to rankpassages, and to apply binary logistic regression model (LRM),which seldom be used in IE to extract special information fromcandidate data sets. The parameters estimated for the datagathered with serious problem of data sparse, therefore we takestratified sampling method, and improve traditional logisticregression model parameters estimated methods. The series ofexperimental results show that the overall performance of oursystem is good and our approach is effective. Our system,Insun05QA1, participated in QA track of TREC 2005 andobtained excellent results. 论文标题 应用粗糙集理论提取特征的词性标注模型 作者 姜维 王晓龙 关毅 徐志明 发表时间 期刊名称 高技术通讯 期卷 2006年10期 简单介绍 针对词性标注中的复杂特征提取问题,应用粗糙集理论(rough sets),有效地挖掘了包括长距离特征在内的复杂特征,并有效地处理了语料库噪声问题。最后,将这些特征融合于最大熵模型中,训练时按模型整体性能为其分配权重。开放实验表明:增加粗规则后获得96.29%的标注精度,相比原有模型提高了0.83%。 论文标题 A Pragmatic Chinese Word Segmentation Approach Based on Mixing Models 作者 Jiang Wei, Guan Yi, Wang Xiao-Long 发表时间 期刊名称 International Journal of Computational Linguistics and Chinese Language Processing 期卷 Volume 11( 简单介绍 A pragmatic Chinese word segmentation approach is presented in this paper basedon mixing language models. Chinese word segmentation is composed of severalhard sub-tasks, which usually encounter different difficulties. The authors apply thecorresponding language model to solve each special sub-task, so as to takeadvantage of each model. First, a class-based trigram is adopted in basic wordsegmentation, which applies the Absolute Discount Smoothing algorithm toovercome data sparseness. The Maximum Entropy Model (ME) is also used toidentify Named Entities. Second, the authors propose the application of rough setsand average mutual information, etc. to extract special features. Finally, somefeatures are extended through the combination of the word cluster and thethesaurus. The authors’ system participated in the Second International ChineseWord Segmentation Bakeoff, and achieved 96.7 and 97.2 in F-measure in the PKUand MSRA open tests, respectively. 论文标题 Conditional Random Fields Based Label Sequence and Information Feedback 作者 Wei Jiang, Yi Guan, Xiao-Long Wang 发表时间 期刊名称 Lecture Notes in Computer Science of Natural Language Processing and Expert Systems 期卷 Volume 41 简单介绍 Part-of-speech (POS) tagging and shallow parsing are sequencemodeling problems. While HMM and other generative models are not the mostappropriate for the task of labeling sequential data. Compared with HMM,Maximum Entropy Markov models (MEMM) and other discriminative finitestatemodels can easily fused more features, however they suffer from the labelbias problem. This paper presents a method of Chinese POS tagging andshallow parsing based on conditional random fields (CRF), as newdiscriminative sequential models, which may incorporate many rich featuresand well avoid the label bias problem. Moreover, we propose the informationfeedback from syntactical analysis to lexical analysis, since natural languageshould be a multi-knowledge interaction in nature. Experiments show that CRFapproach achieves 0.70% F-score improvement in POS tagging and 0.67%improvement in shallow parsing. And we also confirm the effectiveness ofinformation feedback to some complicated multi-class words. 论文标题 Applying Rough Sets in Word Segmentation Disambiguation Based on Maximum Entropy Model 作者 Jiang, W., X.-L. Wang, Y. Guan, and G.-H. Liang 发表时间 期刊名称 Journal of Harbin Institute of Technology (New Series) 期卷 13(1) 简单介绍 To solve the complicated feature extraction and longdistance dependency problem in Word SegmentationDisambiguation (WSD), this paper proposes to apply roughsets in WSD based on the Maximum Entropy model. Firstly,rough set theory is applied to extract the complicatedfeatures and long distance features, even from noise orinconsistent corpus. Secondly, these features are added intothe Maximum Entropy model, consequently, the featureweights can be assigned according to the performance ofthe whole disambiguation model. Finally, the semanticlexicon is adopted to build class-based rough set features toovercome data sparseness. The experiment indicated thatour method performed better than previous models, whichgot top rank in WSD in 863 Evaluation in 2003. Thissystem ranked first and second respectively in MSR andPKU open test in the Second International Chinese WordSegmentation Bakeoff held in 2005. 论文标题 Improving Feature extraction in Named Entity Recognition based on Maximum Entropy Model 作者 Jiang, W., Y. Guan, and X.-L. Wang 发表时间 期刊名称 2006 International Conference on Machine Learning and Cybernetics (ICMLC2006) 期卷 简单介绍 A new method of improving feature extraction for NamedEntity Recognition is proposed in this paper. First of all, thecontext features and the entity features are extracted by thecorresponding algorithm. The triggers extracted by MutualInformation, Information Gain, Average Mutual Informationetc, are adopted to enhance the context features. And roughset theory is used to extract the entity features. Secondly, wordcluster method is presented to improve the approach ofexpanding features, which make us select features more easily,and overcome the sparse data problem effectively. Finally, allthe features are added into the maximum entropy model. Theexperiments have confirmed that our method is effective. Theabove method has been used in our word segmenter, whichparticipated in the International SIGHAN-2005 Evaluation,and ranked first in open test in MSR corpus. 论文标题 Improving Sequence Tagging using Machine-Learning Techniques 作者 Wei Jiang, Xiao-Long Wang, Yi Guan 发表时间 期刊名称 2006 International Conference on Machine Learning and Cybernetics (ICMLC2006) 期卷 简单介绍 This paper presents an excel sequence tagging approachbased on the combined machine learning methods. Firstly,conditional random fields (CRF) is presented as a new kind ofdiscriminative sequential model, it can incorporate many richfeatures, and well avoid the label bias problem that is thelimitation of maximum entropy Markov models (MEMM) andother discriminative finite-state models. Secondly, supportvector machine is improved to adapt the sequential taggingtask. Finally, these improved models and other existing modelsare combined together, which have achieved thestate-of-the-art performance. Experimental results show thatCRF approach achieves 0.70% improvement in POS taggingand 0.67% improvement in shallow parsing. Moreover, ourcombination method achieves F-measure 93.73% and 93.69%in above two tasks respectively, which is better than anysub-model. 论文标题 An Improved Unknown Word Recognition Model based on Multi-Knowledge Source Method 作者 Jiang, W., Y. Guan, and X.-L. Wang 发表时间 期刊名称 6th International Conference on Intelligent Systems Design and Applications (ISDA@#%06) 期卷 vol 2, 20 简单介绍 Unknown word recognition (UWR) is a difficultand foundational task in lexical processing and content-basedunderstanding. And it can improve many text-based processingapplications, such as Information Extraction, Question Answersystem, Electronic Meeting System. However the unified dealingapproach is difficult to exploit more domain knowledge features,so the performance cannot be further improved easily, sinceUWR has been proved to be NP-hard problem. This paperpresents a novel method for UWR task, which divides the UWRinto several hard sub-tasks that usually encountering differentdifficulties, accordingly, several language models are adopted tosolve the special sub-tasks, so as to exert the ability of each modelin addressing special problems. Firstly, a class-based trigram isused in basic word segmentation, aided with absolute smoothingalgorithm to overcome data sparseness. And Maximum EntropyModel (ME) is used to recognize Named Entity. New worddetection adopts variance and Conditional Random Fieldsalgorithm. Secondly, Multi-Knowledge features are effectivelyextracted and utilized in whole processing. Our systemparticipated in the Second International Chinese WordSegmentation Bakeoff (SIGHAN2005), and got the overallperformance 97.2% F-measure in MSRA open test. 论文标题 A Pragmatic Chinese Word Segmentation System 作者 Jiang, W., Y. Guan, and X.-L. Wang 发表时间 期刊名称 proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing 期卷 简单介绍 This paper presents our work for participationin the Third International ChineseWord Segmentation Bakeoff. We applyseveral processing approaches accordingto the corresponding sub-tasks, which areexhibited in real natural language. In oursystem, Trigram model with smoothingalgorithm is the core module in wordsegmentation, and Maximum Entropymodel is the basic model in Named EntityRecognition task. The experimentindicates that this system achieves Fmeasure96.8% in MSRA open test in thethird SIGHAN-2006 bakeoff. 论文标题 基于条件随机域的词性标注模型 作者 姜维 关毅 王晓龙 发表时间 期刊名称 计算机工程与应用 期卷 21期 简单介绍 词性标注主要面临兼类词消歧以及未知词标注的难题,传统隐马尔科夫方法不易融合新特征,而最大熵马尔科夫模型存在标注偏置等问题。本文引入条件随机域建立词性标注模型,易于融合新的特征,并能解决标注偏置的问题。此外,又引入长距离特征有效地标注复杂兼类词,以及应用后缀词与命名实体识别等方法提高未知词的标注精度。在条件随机域模型框架下,本文进一步探讨了融合模型的方法及性能。词性标注开放实验表明,条件随机域模型获得了96.10%的标注精度。 论文标题 A Novel Dynamic Adaptive Method Based on Artificial Immune System in Chinese Named Entity Recognition 作者 Wei Jiang, Yi Guan, XiaoLong Wang 发表时间 期刊名称 International Journal of Computer Science and Network Security 期卷 Vol. 6 No. 简单介绍 Named Entity recognition(NER), as a task of providing important semantic information, is a critical first step in information extraction and question answer system. The NER has been proved to be NP-hard problem, and the existing methods usually adopt supervised or unsupervised learning model, as a result, there is still a distance away from the required performance in real application, however the system can hardly improved with the model being applied. The paper proposes a novel method based on artificial immune system(AIS) for NER. We apply clonal selection principle and affinity maturation of the vertebrate immune response, where the secondary immune response has high performance than the primary immune response, and the similar antigens may have a good immunity. We also introduce the reinforcement learning method into our system to tune the immune response, and the context features are exploied by the maximum entropy principle. The experimental results indicate that our method exhibits a good performance and implements the dynamic learning behavior 论文标题 InsunQA06 on QA track of TREC2006 作者 Zhao, Y., Xu, Z., Li, P., & Guan, Y 发表时间 期刊名称 Fifteenth Text REtrieval Conference (TREC 2006). 期卷 简单介绍 This is the second time that our group takes part in the QA track of TREC. We developed a question-answering system, named InsunQA06, based on our Insun05QA system, and with InsunQA06 we participated in the Main Task, which submitted answers to three types of questions: factoid questions, list questions and others questions.The structure of InsunQA06 is similar with the structure of Insun05QA. Towards Insun05QA, the main difference of InsunQA06 is that new methods are developed and used in answer extraction module, for factoid and “others” questions. And external knowledge such as knowledge from Internet plays more important role in answer extraction. Besides that, we accomplished our documents retrieval module based on Indri, instead of SMART in InsunQA06. 论文标题 Classifying Incomplete Data based on Maximum Entropy Model with New Feature Compensating 作者 Zhao Jian, Xiao-long Wang, Guan Yi, Lin Lei 发表时间 期刊名称 Journal of Electronics 期卷 2006, Vol. 简单介绍 For incomplete data classifying, MEM (Maximum Entropy Model) trained by GIS(Generalized Iterative Scaling) algorithm utilizes a global unique compensating feature to offsetthe effect of missing attributes of some samples in order to satisfy the constraint of GIS. However,this kind of compensating strategy neglects a fact that different features have different effects onclassification result. Hence, in this paper, an improved compensating strategy, taking effects ofboth different feature types and label types into accounts, is firstly proposed to overcome theshortage of traditional method. Experimental results on Mushroom data set from UCI datarepository show that the new method is feasible and effective. The average error rate is reduced byabout 68.3% and 33.5% respectively on two kinds of experimental datasets. 论文标题 Research on Chinese Named Entity Recognition Base on Conditional Random Fields 作者 Zhao Jian , Xiao-long Wang, Guan Yi, Xu Zhiming 发表时间 期刊名称 Journal of Electronics 期卷 2006, Vol. 简单介绍 Chinese named entity recognition (CNER) is an important and difficult task in Chineseinformation processing domain. In this paper, a new probabilistic model, conditional randomfields (CRF), which is very fit for labeling sequence data, is firstly introduced to the task of CNER.Unlike the generative model, CRF does not make effort on the observation modeling and canutilize rich overlapped features; moreover it can avoid the label bias problem of discriminativemodel. In order to perform CNER, special features including morphology Features, n-gramfeatures, lexicon features and combined features are selected to capture informative trait ofChinese language. Experiments of 6-fold cross validation on half-year People’s Daily show that,among four typical kinds of probabilistic models, CRF outperform the other three models. Thisapproach can achieve an overall F-measure around 85%. 论文标题 一种改进的Wu-Manber 多模式匹配算法及应用 作者 孙晓山 王强 关毅 王晓龙 发表时间 期刊名称 中文信息学报 期卷 2006年02期 简单介绍 本文针对Wu-Manber多模式匹配算法在处理后缀模式情况下的不足,给出了一种改进的后缀模式处理算法,减少了匹配过程中字符比较的次数,提高了算法的运行效率。本文在随机选择的TREC2000的52,067篇文档上进行了全文检索实验, 对比了Wu-Manber算法、使用后缀模式的改进算法、不使用后缀模式的简单改进等三种算法的匹配过程中字符比较的次数。实验结果说明,本文的改进能够比较稳定的减少匹配过程中字符比较的次数,提高匹配的速度和效率。 论文标题 中文名实体识别:基于词触发对的条件随机域方法 作者 赵健 王晓龙 关毅 徐志明 发表时间 期刊名称 高技术通讯 期卷 2006年08期 简单介绍 首次把条件随机域(CRF)模型应用到了中文名实体识别中,且根据中文的特点,定义了多种特征模板。同时,为了解决长距离约束问题,将词语触发对融合到了CRF模型中。提出了基于词语方差(word variance)的选词方法,在词语相关性计算上,采用了平均互信息(AMI)方法和 统计量方法。通过在半年人民日报上的测试,结果表明在采用相同特征集合的条件下,条件随机域模型较其他概率模型有更好的性能表现;融合长距离触发对的条件随机域模型可以使系统的F量度提高约1.38%。 论文标题 Chinese Word Segmentation based on Mixing Model 作者 Jiang, W., J. Zhao, Y. Guan, and Z.-M. Xu 发表时间 期刊名称 The 4th SIGHAN Workshop 期卷 简单介绍 This paper presents our recent work for participation in the Second International Chinese Word Segmentation Bakeoff. According to difficulties, we divide word segmentation into several sub-tasks, which are solved by mixed language models, so as to take advantage of each approach in addressing special problems. The experiment indicated that this system achieved 96.7% and 97.2% in F-measure in PKU and MSR open test respectively. 论文标题 基于数据挖掘思想的网页正文抽取方法的研究 作者 蒲宇达,关毅,王强 发表时间 期刊名称 第三届学生计算语言学研讨会论文集 期卷 简单介绍 为了把自然语言处理技术有效的运用到网页文档中,本文提出了一种依靠数据挖掘思想,从中文新闻类网页中抽取正文内容的方法. 论文标题 融合聚类触发对特征的最大熵词性标注模型 作者 赵岩,王晓龙,刘秉权,关毅 发表时间 期刊名称 计算机研究与发展 期卷 2006,43(00 简单介绍 为解决传统HMM词性标注模型不能包含远距离词特征的问题,提出了形如"WA→WB/TB"的触发对来承载远距离词特征信息,并采用平均互信息量度对触发对特征进行选择.在最大熵框架下,将选择后的触发对特征加入到词性标注系统中. 论文标题 文档聚类综述 作者 刘远超,王晓龙,徐志明,关毅 发表时间 期刊名称 中文信息学报 期卷 2006,20(00 简单介绍 聚类作为一种自动化程度较高的无监督机器学习方法 ,近年来在信息检索、多文档自动文摘等领域获得了广泛的应用。本文首先讨论了文档聚类的应用背景和体系结构 ,然后对文档聚类算法、聚类空间的构造和降维方法、文档聚类中的语义问题进行了综述。最后还介绍了聚类质量评测问题 论文标题 K-NN 与 SVM 相融合的文本分类技术研究 作者 王强,王晓龙,关毅,徐志明 发表时间 期刊名称 高技术通讯 期卷 2005,15(00 简单介绍 本文提出了一种改进的 K-NN (K Nearest Neighbor)与 SVM (Support Vector Machine)相融合的文本分类算法。该算法利用文本聚类描述 K-NN 算法中文本类别的内部结构,用 sigmoid 函数对 SVM输出结果进行概率转换,同时引入 CLA(Classifier’s Local Accuracy)技术进行分类可信度分析以实现两种算法的融合。实验表明该算法综合了 K-NN 与 SVM 在分类问题中的优势,既有效地降低了分类候选的数目,又相应地提高了文本分类的精度,具有较好的性能。 论文标题 基于矢量空间模型和最大熵模型的词义问题解决策略 作者 赵岩,王晓龙,刘秉权,关毅 发表时间 期刊名称 高技术通讯 期卷 2005,15(00 简单介绍 词义问题是自然语言处理中的核心问题之一,尤其在汉语这种轻语法、重意义的语言中更是如此。本文针对单义词的词义问题构建了融合触发对(trigger pair)的矢量空间模型用来进行词义相似度的计算,并以此为基础进行了词语的聚类;针对多义词的词义问题应用融合远距离上下文信息的最大熵模型进行了有导词义消歧的研究。为克服以往词义消歧评测中通过人工构造带有词义标记的测试例句而带来的覆盖程度小、主观影响大等问题,本文将模型的评测直接放到了词语聚类和分词歧义这两个实际的应用中。分词歧义的消解正确率达到了 92%,词语聚类的结果满足进一步应用的需要。 论文标题 论系统相似的度量 作者 关毅,王晓龙,王强 发表时间 期刊名称 全国第八届计算语言学联合学术会议 (JSCL-2005) 论文集 期卷 简单介绍 本文阐明了系统相似度计算的基本原理,提出了一种新的系统相似度计算函数,论证了该函数的代数特点。作为系统相似度计算的应用之一,本文进而提出了一种新的信息检索模型-系统相似模型,论证了向量空间模型为该模型的特例,且该模型能有效地弥补向量空间模型的缺陷。 论文标题 基于 Cover 级别的中文信息检索技术的研究 作者 包刚,关毅,王强,赵健 发表时间 期刊名称 计算机工程与应用 期卷 2005,41(02 简单介绍 信息检索系统如果能较精确地定位于文章中用户关心的部分必将提高用户的检索效率。基于 Cover级别的检索策略就是针对上述问题提出的。基于 Cover 级别的检索策略以用户查询的关键词集合作为输入,在被检索文档中找到包含关键词集合的最短文本片断集作为输出。本文采用了一种经过改进的基于 Cover 级别的检索策略,对系统返回的文本片断作了限制,并在检索过程中使用了贪心算法(Greedy Algorithm)的思想,最后将其应用到中文信息检索系统中。实验证明,采用改进的策略比原有的基于 Cover 级别的检索策略在返回有效结果个数和平均排序倒数(MRR)等指标上都有了提高。 论文标题 多文档文摘中基于语义相似度的最大边缘相关技术研究 作者 刘寒磊,关毅,徐永东 发表时间 期刊名称 全国第八届计算语言学联合学术会议 (JSCL-2005) 期卷 简单介绍 多文档自动文摘致力于从多篇文档中将全面、简洁的摘要性文档呈现给用户,提高用户获取信息的效率。本文提出了基于语句级语义相似度的最大边缘相关方法来选取文摘句,为生成高质量的文摘提供文摘单元支持。实验结果表明,与基于相关度大小排序选择文摘句的方法相比,系统的精确率和召回率明显提高;直观的评测可以看出该方法使生成文摘内容间的冗余度大大降低,信息覆盖面更广,概括性和可读性较强,能够达到较好的质量。 论文标题 Automatic Text Summarization Based on Lexical Chains 作者 Yanmin Chen, Xiaolong Wang, Guan Yi 发表时间 期刊名称 ICNC (1) 期卷 简单介绍 The method of lexical chains is the first time introduced to generatesummaries from Chinese texts. The algorithm which computes lexical chainsbased on the HowNet knowledge database is modified to improve the performanceand suit Chinese summarization. Moreover, the construction rules of lexicalchains are extended, and relationship among more lexical items is used. Thealgorithm constructs lexical chains first, and then strong chains are identified andsignificant sentences are extracted from the text to generate the summary. Evaluationresults show that the performance of the system has a notable improvementboth in precision and recall compared to the original system1. 论文标题 基于上下文平均互信息的问句查询扩展模型 作者 邵兵,关毅,王强,王晓龙,任瑞春 发表时间 期刊名称 第二届全国学生计算语言学研讨会 期卷 简单介绍 信息检索中存在用词歧义的问题,在中文自然语言查询处理中,表达差异问题更加突出。提出了一种基于上下文互信息的问句查询扩展模型,模型首先对训练集文档中的词或词组进行相关分析,计算每对词或词组间的互信息,然后再利用中文语义网与同义词资源进行中文信息检索的查询扩展。实验结果表明,该方法适宜改进Web 上的信息检索,相对一般的查询扩展算法可以大幅度提高各项指标。 论文标题 A Study of Semi-discrete Matrix Decomposition for LSI in Automated Text Categorization 作者 Qiang Wang, Xiaolong Wang, Guan Yi 发表时间 期刊名称 A Study of Semi-discrete Matrix Decomposition for LSI in Automated Text Categorization 期卷 IJCNLP 简单介绍 This paper proposes the use of Latent Semantic Indexing (LSI) techniques,decomposed with semi-discrete matrix decomposition (SDD) method,for text categorization. The SDD algorithm is a recent solution to LSI, whichcan achieve similar performance at a much lower storage cost. In this paper,LSI is used for text categorization by constructing new features of category ascombinations or transformations of the original features. In the experiments ondata set of Chinese Library Classification we compare accuracy to a classifierbased on k-Nearest Neighbor (k-NN) and the result shows that k-NN based onLSI is sometimes significantly better. Much future work remains, but the resultsindicate that LSI is a promising technique for text categorization. 论文标题 A Maximum Entropy Markov Model for Chunking 作者 Guang-Lu Sun, Yi Guan, Xiao-Long Wang, Jian Zha 发表时间 期刊名称 Proceedings of the Fourth International Conference on Machine Learning and Cybernetics 期卷 简单介绍 This paper presents a new chunking method based on maximum entropy Markov models (MEMM). MEMM is described in detail that combines transition probabilities and conditional probabilities of states effectively. The conditional probabilities of states are estimated by maximum entropy (ME) theory. The transition probabilities of the states are estimated by N-gram model in which interpolation smoothing algorithm is utilized on the basis of analyzing chunking spec. Experiment results show that this approach achieves an impressive performance: 92.53% in F-score on the open data sets of CoNLL-2000 shared task. The performance of the algorithm is close to the state-of-the-art. 论文标题 Automatic and efficient recognition of proper nouns based on maximum entropy model 作者 Peng Li, Yi Guan, Xiao Long Wang, Jun Sun 发表时间 期刊名称 ICMLC2005 期卷 简单介绍 This paper presents a high performance method to identify English proper nouns (PNs) based on maximum entropy model (MaxEnt). Most traditional PNs recognition systems use lexical resources such as name list, as new names are constantly coming into existence, these are necessarily incomplete. Therefore machine learning methods are used to identify PNs automatically. In the framework of MaxEnt model, semantic and lexical information of surrounding words and word itself acting as atomic features comprises feature templates and forms feature without requiring extra expert knowledge. The test on WSJ of Penn Treebank II shows that this method guarantees high precision and recall, and at the same time it can reduce the quantity of features dramatically, downsize system space consumption, and decrease the time of training and testing, so as to improve the efficiency considerably. The method in this paper can be transformed to identify other specific noun easily because the principle of methods is universal. 论文标题 Extracting answers to natural language questions from large-scale corpus 作者 Peng Li, Xiao Long Wang, Yi Guan, Yu Ming Zhao 发表时间 期刊名称 Proceedings of 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering 期卷 简单介绍 This paper provides a novel and tractable method for extracting exact textual answers from the returned documents that are retrieved by traditional IR system in large-scale collection of texts. In our approach, WordNet and Web information are employed to improve the performance as external auxiliary resources, then some NLP technologies are used to constitute the empirical answer ranking formula, such as POS tagging, Named Entity recognition, and parser etc. The method involves automatically ranking passages with System Similarity Model, automatically downloading related Web pages by means of Web crawler, and automatically mining answers with empirical formula from candidate answer sets. The series of experimental results show that the overall performance of our system is good and the structure of the system is reasonable. 论文标题 Analyzing the Incomplete Data based on the Improved Maximum Entropy Model 作者 Jian Zhao, XiaoLong Wang, Yi Guan, Lei Lin 发表时间 期刊名称 International Journal of Information Technology 期卷 vol.11, no 简单介绍 :. When the MEM (Maximum Entropy Model) trained by GIS (Generalized Iterative Scaling) algorithm was used to analyze the incomplete data, in order to satisfy the constraint of GIS, a global unique compensating feature was introduced to offset the effect of missing attributes of some samples on classification result. However, this kind of compensating strategy neglected a basic fact that different features had different effect on classification result. In this paper, an improved compensating strategy was proposed to overcome the shortage of traditional method: took effects of both different feature types and label types into account. Experiment results on Mushroom data set coming from UCI data repository showed that the new method was feasible and effective. The average error rate was reduced by about 33.5%. 论文标题 Using category-based semantic field for text categorization 作者 Qiang Wang, XiaoLong Wang, Yi Guan, ZhiMing Xu 发表时间 期刊名称 The 4th International Conference on Machine Learning and Cybernetics(ICMLC) 期卷 简单介绍 This paper proposes a new document representation method to text categorization. It applies Category-based Semantic Field (CBSF) theory for text categorization to gain a more efficient representation of documents. The lexical chain is introduced to compute CBSF and Hownet? used as a lexical database. In particular, the title of each document functions as a clue to forecast the potential CBSF of the test document. Combined with classifier, this approach is examined in text categorization and the result indicates that it performs better than conventional methods with features chosen on the basis of bag-of-words (BOW) system, on the same task. 论文标题 Domain-Specific Term Extraction and Its Application in Text Classification 作者 Tao Liu, Xiao-long Wang, Yi Guan, Zhi-ming Xu, Qiang Wang, 发表时间 期刊名称 Proceedings of 8th Joint Conference on Information Sciences (JCIS2005) 期卷 简单介绍 A statistical method is proposed for domain-specific term extraction from domain comparative corpora. It takes distribution of a candidate word among domains and within a domain into account. Entropy impurity is used to measure distribution of a word among domains and within a domain. Normalization step is added into the extraction process to cope with unbalanced corpora. So it characterizes attributes of domain-specific term more precisely and more effectively than previous term extraction approaches. Domain-specific terms are applied in text classification as the feature space. Experiments show that it achieves better performance than traditional methods for feature selection. 论文标题 蛋白质二级结构预测: 基于词条的最大熵马尔科夫方法 作者 董启文,王晓龙,林磊,关毅,赵健 发表时间 期刊名称 中国科学 C 辑 生命科学 2005 期卷 35 (1): 8 简单介绍 提出了一种新的蛋白质二级结构预测方法. 该方法从氨基酸序列中提取出和自然语言中的“词”类似的与物种相关的蛋白质二级结构词条, 这些词条形成了蛋白质二级结构词典, 该词典描述了氨基酸序列和蛋白质二级结构之间的关系. 预测蛋白质二级结构的过程和自然语言中的分词和词性标注一体化的过程类似. 该方法把词条序列看成是马尔科夫链, 通过Viterbi算法搜索每个词条被标注为某种二级结构类型的最大概率, 其中使用词网格描述分词的结果, 使用最大熵马尔科夫模型计算词条的二级结构概率. 蛋白质二级结构预测的结果是最优的分词所对应的二级结构类型. 在 4 个物种的蛋白质序列上对这种方法进行测试, 并和 PHD 方法进行比较. 试验结果显示, 这种方法的 Q3 准确率比 PHD 方法高 3.9%, SOV 准确率比 PHD 方法高 4.6%. 结合BLAST 搜索的局部相似的序列可以进一步提高预测的准确率. 在 50 个 CASP5 目标蛋白质序列上进行测试的结果是: Q3 准确率为 78.9%, SOV 准确率为 77.1%. 论文标题 Insun05QA on QA Track of TREC 2005 作者 Yuming Zhao,Yi Guan, ZhiMing Xu, Peng Li 发表时间 期刊名称 Proceedings of TREC 2005 期卷 简单介绍 This is the first time that our group takes part in the QA track. At TREC2005, the system we developed, Insun05QA, participated in the Main Task, which submitted answers to three types of questions: factoid questions, list questions and others questions. And we also submitted the document ranking which our answers are generated from. A new sentence similarity calculating method is used in our Insun05QA system. It can be considered as an extension of vector space model. And our QA system incorporates several useful tools. These tools include WordNet, developed by Princeton University, Minipar by Dekang Lin, and GATE, developed by University of Sheffield. Moreover, external knowledge such as knowledge from Internet is also widely used in our system. Since it is the first time that we take part in QA track and the preparing time is limited, we concentrate on the processing of factoid questions. And the methods we developed to process list and others questions are generated from the method used to process factoid questions. 论文标题 一种基于粗糙集增量式规则学习的问题分类方法研究 作者 李鹏;王晓龙;关毅 发表时间 期刊名称 电子与信息学报 期卷 2008年05期 简单介绍 该文提出一种基于粗糙集增量式规则自动学习来实现问题分类的方法,通过深入提取问句特征并采用决策表形式构建训练语料,利用机器学习的方法自动获取分类规则。与其他方法相比优势在于,用于分类的规则自动生成,并采用粗糙集理论的简约方法获得优化的最小规则集;首次在问题分类中引入增量式学习理念,不但提高了分类精度,而且避免了繁琐的重新训练过程,大大提高了学习速度,并且提高了分类的可扩展性和适应性。对比实验表明,该方法分类精度高,适应性好。在国际TREC2005Q/A实际评测中表现良好。 论文标题 基于统计的网页正文信息抽取方法的研究 作者 孙承杰, 关毅 发表时间 期刊名称 中文信息学报 期卷 2004年第18卷0 简单介绍 为了把自然语言处理技术有效的运用到网页文档中,本文提出了一种依靠统计信息,从中文新闻类网页中抽取正文内容的方法。该方法先根据网页中的HTML 标记把网页表示成一棵树,然后利用树中每个结点包含的中文字符数从中选择包含正文信息的结点。该方法克服了传统的网页内容抽取方法需要针对不同的数据源构造不同的包装器的缺点,具有简单、准确的特点,试验表明该方法的抽取准确率可以达到95 %以上。采用该方法实现的网页文本抽取工具目前为一个面向旅游领域的问答系统提供语料支持,很好的满足了问答系统的需求。 论文标题 面向专业网站的中文问答系统研究 作者 关毅,王晓龙,赵岩,赵健 发表时间 期刊名称 Proceedings of the 20th International Conference on Computer Processing of Oriental Languages 期卷 简单介绍 问答系统是一种大量运用自然语言处理技术的新型信息检索系统,正在成为自然语言处理领域和信息检索领域中的一个引人注目的研究热点。本文在论述了面向专业网站的中文问答系统的几个基本问题:定义、概况、国内外研究现状之后,介绍了哈工大计算机应用教研室开发的问答系统实验平台,提出了以系统相似为基础的问答系统的基本原理,从而把应用于这一特定信息检索技术的各项自然语言处理技术理顺到系统化、理论化的轨道。 论文标题 基于统计的汉语词汇间语义相似度计算 作者 关毅,王晓龙 发表时间 期刊名称 语言计算与基于内容的文本处理——全国第七届计算语言学联合学术会议论文集, 期卷 简单介绍 语义相似是词汇间的纂本关系之一,汉语词汇间语义相似的定量化研究对于信息检索、统计语言模型等自然语言处理的应用技术具有重要的指导意义。本文定义了语义相似度的数学模型,进而描述了基于相关嫡的汉语词汇间语义相似度计算方法。初步实验表明,该方法是一种理论基础严整,实践上行之有效的方法 论文标题 基于短语的汉语N-gram语言模型研究 作者 刘秉权,王晓龙,王轩,关毅 发表时间 期刊名称 863计划智能计算机主题学术会议 期卷 简单介绍 N-gram统计语言模型因其鲁棒性强、简洁、有效等特点成为当前的主流语言建模技术,但其本身存在难以克服的缺点:不能有效处理长距离语言约束;统计信息有时也不能反映真实的语言规律. 论文标题 汉语大词表 N—gram 统计语言模型构造算法 作者 徐志明,王晓龙,关毅 发表时间 期刊名称 计算机应用研究 期卷 1999,16(00 简单介绍 本文提出了汉语大词表的N-gram统计语言模型构造技术,根据信息论的观点,给出了自然语言处理中各种应用中的统计语言建模的统一框架描述,提出了一种汉语大词表的Trigram语言模型构造算法。把构造的Trigram语言模型应用于大词表非特定人孤立词语音识别系统中,系统识别率达到82%。 论文标题 基于转移的音字转换纠错规则获取技术 作者 关毅,王晓龙,张凯 发表时间 期刊名称 计算机研究与发展 期卷 1999,36(00 简单介绍 文中描述了一种在音字转换系统中从规模不限的在线文本中自动获取纠错规则的机器学习技术.该技术从音字转换结果中自动获取误转换结果及其相应的上下文信息,从而生成转移规则集.该转移规则集应用于音字转换的后处理模块,使音字转换系统的转换正确率进一步提高,并使系统具备了很强的灵活性和可扩展性. 论文标题 基于统计的计算语言模型 作者 关毅,张凯,付国宏 发表时间 期刊名称 计算机应用研究 期卷 1999,16(00 简单介绍 论文标题 语音识别语言理解模型 作者 徐志明,王晓龙,张凯,关毅,孙玉琦 发表时间 期刊名称 第五届全国人机语音通讯学术会议论文集 期卷 简单介绍 本文提出了一种规则与统计相结合的计算语言模型应用于语言识别后端处理的技术,把基于统计的大词表Markov统计模型与语言规则量化模型集成在一个语言理解系统, 讨论了两种计算语言模型的互补性与结合机制 论文标题 基于统计与规则相结合的汉语计算语言模型及其在语音识别中的应用 作者 关毅,王晓龙. 发表时间 期刊名称 高技术通讯 期卷 1998,8(004 简单介绍 把基于统计的语料概率统计方法与基于规则的自然语言理解方法结合起来, 提出了一种新的汉语计算语言模型, 并把该模型应用于语音识别后处理模块中, 取得了较理想的结果 论文标题 现代汉语计算语言模型中语言单位的频度—频级关系 作者 关毅,王晓龙,张凯 发表时间 期刊名称 中文信息学报 期卷 1999年02期 简单介绍 Zipf 定律是一个反映英文单词词频分布情况的普适性统计规律。我们通过实验发现 ,在现代汉语的字、词、二元对等等语言单位上 ,其频度与频级的关系也近似地遵循 Zipf定律 ,说明了 Zipf 定律对于汉语的不同层次的语言单位也是普遍适用的。本文通过实验证实了 Zipf 定律所反映的汉语语言单位频度 —频级关系 ,并进而深入讨论了它对于汉语自然语言处理的各项技术 ,尤其是建立现代汉语基于统计的计算语言模型所具有的重要指导意义。 论文标题 Clinical-decision support based on medical literature: A complex network approach 作者 Jingchi Jiang, Jichuan Zheng, Chao Zhao, Jia Su, Yi Guan, Qiubin Yu 发表时间 期刊名称 Physica A: Statistical Mechanics and its Applications 期卷 Volume 459, 1 October 2016, Pages 42–54 简单介绍 In making clinical decisions, clinicians often review medical literature to ensure the reliability of diagnosis, test, and treatment because the medical literature can answer clinical questions and assist clinicians making clinical decisions. Therefore, finding the appropriate literature is a critical problem for clinical-decision support (CDS). First, the present study employs search engines to retrieve relevant literature about patient records. However, the result of the traditional method is usually unsatisfactory. To improve the relevance of the retrieval result, a medical literature network (MLN) based on these retrieved papers is constructed. Then, we show that this MLN has small-world and scale-free properties of a complex network. According to the structural characteristics of the MLN, we adopt two methods to further identify the potential relevant literature in addition to the retrieved literature. By integrating these potential papers into the MLN, a more comprehensive MLN is built to answer the question of actual patient records. Furthermore, we propose a re-ranking model to sort all papers by relevance. We experimentally find that the re-ranking model can improve the normalized discounted cumulative gain of the results. As participants of the Text Retrieval Conference 2015, our clinical-decision method based on the MLN also yields higher scores than the medians in most topics and achieves the best scores for topics: #11 and #12. These research results indicate that our study can be used to effectively assist clinicians in making clinical decisions, and the MLN can facilitate the investigation of CDS. 论文标题 中文电子病历命名实体和实体关系标注体系及语料库构建 作者 杨锦锋,关毅,何彬,曲春燕,于秋滨,刘雅欣,赵永杰 发表时间 期刊名称 软件学报 期卷 简单介绍 电子病历是由医务人员撰写的面向患者个体描述医疗活动的记录, 蕴含了大量的医疗知识和患者的健康信息. 电子病历命名实体识别和实体关系抽取等信息抽取研究对于临床决策支持、循证医学实践和个性化医疗服务等具有重要意义, 而电子病历命名实体和实体关系标注语料库的构建是首当其冲的. 本文在调研了国内外电子病历命名实体和实体关系标注语料库构建的基础上, 结合中文电子病历特点, 提出适合中文电子病历的命名实体和实体关系的标注体系, 在医生的指导和参与下, 制定了命名实体和实体关系的详细标注规范, 构建了标注体系完整、 规模较大且一致性较高的标注语料库. 语料库包含病历文本992 份, 命名实体标注一致性达到0.922, 实体关系一致性达到0.895. 我们的工作为中文电子病历信息抽取后续研究打下了坚实的基础. 论文标题 中文电子病历命名实体语料库构建 作者 曲春燕,关毅,杨锦锋,赵永杰,刘雅欣 发表时间 期刊名称 高技术通讯 期卷 2015, 25(2) 简单介绍 针对中文电子病历命名实体语料标注空白的现状,研究了中文电子病历命名实体标注语料库的构建.参考2010年美国国家集成生物与临床信息学研究中心(I2B2)给出的电子病历命名实体类型及修饰类型的定义,在专业医生的指导下制定了详尽的中文电子病历标注规范;通过对大量中文电子病历的分析,提出了一套完整的中文电子病历命名实体标注方案,而且采用预标注和正式标注的方法,建立了一定规模的中文电子病历命名实体标注语料库,其标注语料的一致性达到了92%以上.该工作对中文电子病历的命名实体识别及信息抽取研究提供了可靠的数据支持,对医疗知识挖掘也有重要意义. 论文标题 CRFs based de-identification of medical records 作者 He B, Guan Y, Cheng J, et al 发表时间 期刊名称 Journal of biomedical informatics 期卷 简单介绍 De-identification is a shared task of the 2014 i2b2/UTHealth challenge. The purpose of this task is to remove protected health information (PHI) from medical records. In this paper, we propose a novel de-identifier, WI-deId, based on conditional random fields (CRFs). A preprocessing module, which tokenizes the medical records using regular expressions and an off-the-shelf tokenizer, is introduced, and three groups of features are extracted to train the de-identifier model. The experiment shows that our system is effective in the de-identification of medical records, achieving a micro-F1 of 0.9232 at the i2b2 strict entity evaluation level. 出版物 名称 王晓龙 关毅 《计算机自然语言处理》清华大学出版社 2005年 论著成果 名称 1995年,微软拼音输入法(与微软公司合作)主要参加人 1996年,Macintosh用BOPOMPOFO智能语句输入法(与日本佳能泰克(佳能公司子公司)公司合作)主要参加人 2000年,Weniwen智能中文搜索引擎 主要参加人 2002年,智能化中文信息处理平台 主要参加人 2003年,Insun_TC文本分类系统 主要负责人 2004年,面向体育、旅游领域的智能中文问答系统InsunTourQA 主要负责人 2005年,ICSU词法分析系统 主要负责人 2005年,InsunQA英文问答系统 主要负责人 2008年,面向博客bbs的中文情感极性分析系统(与富士通中国研发中心合作)第一负责人 2008年,myspace隐式用户兴趣挖掘系统(与myspace公司聚友网合作)第一负责人 2009年,中文浅层句法分析系统(与阿里巴巴公司合作)第一负责人 2010年,面向IOS的中文智能语句输入法WI输入法 第一负责人 2010年,电子病历管理系统(与哈尔滨医科大学第二附属医院合作)第一负责人

上一篇:曹文鑫     下一篇:陈政