アメリカ ウィスコンシン大学マディソン校の情報科学(Honors)・数学(Honors)・データ科学・統計学・日本語専攻の4年生のトゥ・ルイシュェン(zh_CN: 涂 睿轩, en: Ruixuan Tu)です。2025年5月に卒業予定で、自然言語処理(NLP)と大規模言語モデル(LLM)に関する修士・博士(5年)課程の機会を積極的に探しています

アイオワ州立大学情報科学科およびVectara社機械学習主任のForrest Sheng Bao教授、ウィスコンシン大学マディソン校電気・コンピュータ工学科・情報科学科・統計学科のRamya Korlakai Vinayak教授、ウィスコンシン大学マディソン校生物医学情報科学科・情報科学科のJunjie Hu教授の指導のもと研究を行っています。以前はウィスコンシン大学マディソン校情報科学科のJerry Zhu教授とも研究を行っていました。

研究分野

人間のようなLLM 2018年以降、LLMは大きな進歩を遂げ、人々の間で人気を集めていますが、幻覚、偏見、事実の不正確さなどの問題を考慮すると、必ずしも人間と同様のパフォーマンスを発揮するとは限りません。LLMを人間の期待と行動に合わせるため、以下のようなプロジェクトに取り組んでいます:

多言語NLPと計算言語学(日本語NLP) 日本語を専攻の一つとしていることから、NLPの知識と日本語言語学・古典日本語の授業を結びつけています。WakaGPTでは現代日本語から古典日本語への多言語転移学習を適用し、日本文学における形態素起源の分析には計算言語学のツールを応用しました。また、計算社会言語学の観点から、日本のメディア(ゲームやアニメ)における役割語をクラスタリング手法を用いて分析しました。

論文

査読あり

  1. Is Semantic Chunking Worth the Computational Cost?
    Renyi Qu, Ruixuan Tu, Forrest Sheng Bao
    Findings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies
    [arXiv] [PDF]

  Recent advances in Retrieval-Augmented Generation (RAG) systems have popularized semantic chunking, which aims to improve retrieval performance by dividing documents into semantically coherent segments. Despite its growing adoption, the actual benefits over simpler fixed-size chunking, where documents are split into consecutive, fixed-size segments, remain unclear. This study systematically evaluates the effectiveness of semantic chunking using three common retrieval-related tasks: document retrieval, evidence retrieval, and retrieval-based answer generation. The results show that the computational costs associated with semantic chunking are not justified by consistent performance gains. These findings challenge the previous assumptions about semantic chunking and highlight the need for more efficient chunking strategies in RAG systems.

  1. FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs
    Forrest Sheng Bao, Miaoran Li, Renyi Qu, Ge Luo, Erana Wan, Yujia Tang, Weisi Fan, Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Mike Qi, Ruixuan Tu, Chenyu Xu, Matthew Gonzales, Ofer Mendelevitch, Amin Ahmad
    Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies
    [arXiv] [PDF] [GitHub Repo]

  Summarization is one of the most common tasks performed by large language models (LLMs), especially in applications like Retrieval-Augmented Generation (RAG). However, existing evaluations of hallucinations in LLM-generated summaries, and evaluations of hallucination detection models both suffer from a lack of diversity and recency in the LLM and LLM families considered. This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs from 8 different families, with ground truth annotations by human experts. “Challenging” here means summaries on which popular, state-of-the-art hallucination detection models, including GPT-4o-as-a-judge, disagreed on. Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations. However, even the best hallucination detection models have near 50% accuracies on FaithBench, indicating lots of room for future improvement.

  1. DocAsRef: An Empirical Study on Repurposing Reference-based Summary Quality Metrics as Reference-free Metrics
    Forrest Sheng Bao*, Ruixuan Tu*, Ge Luo, Yinfei Yang, Hebi Li, Minghui Qiu, Youbiao He, and Cen Chen
    Findings of the Association for Computational Linguistics: EMNLP 2023
    (Presented the paper and the poster orally at 4th NewSumm Workshop in person as co-first-author)
    [ACL Anthology] [PDF] [Poster] [GitHub Repo]

  Automated summary quality assessment falls into two categories: reference-based and reference-free. Reference-based metrics, historically deemed more accurate due to the additional information provided by human-written references, are limited by their reliance on human input. In this paper, we hypothesize that the comparison methodologies used by some reference-based metrics to evaluate a system summary against its corresponding reference can be effectively adapted to assess it against its source document, thereby transforming these metrics into reference-free ones. Experimental results support this hypothesis. After being repurposed reference-freely, the zero-shot BERTScore using the pretrained DeBERTa-large-MNLI model of <0.5B parameters consistently outperforms its original reference-based version across various aspects on the SummEval and Newsroom datasets. It also excels in comparison to most existing reference-free metrics and closely competes with zero-shot summary evaluators based on GPT-3.5.

  1. Funix - The laziest way to build GUI apps in Python
    Forrest Sheng Bao, Mike Qi, Ruixuan Tu, Erana Wan
    Proceedings of the Python in Science Conference 2024
    [SciPy Proceedings] [PDF] [GitHub Repo]

  The rise of machine learning (ML) and artificial intelligence (AI), especially the generative AI (GenAI), has increased the need for wrapping models or algorithms into GUI apps. For example, a large language model (LLM) can be accessed through a string-to-string GUI app with a textbox as the primary input. Most of existing solutions require developers to manually create widgets and link them to arguments/returns of a function individually. This low-level process is laborious and usually intrusive. Funix automatically selects widgets based on the types of the arguments and returns of a function according to the type-to-widget mapping defined in a theme, e.g., bool to a checkbox. Consequently, an existing Python function can be turned into a GUI app without any code changes. As a transcompiler, Funix allows type-to-widget mappings to be defined between any Python type and any React component and its props, liberating Python developers to the frontend world without needing to know JavaScript/TypeScript. Funix further leverages features in Python or its ecosystem for building apps in a more Pythonic, intuitive, and effortless manner. With Funix, a developer can make it (a functional app) before they (competitors) fake it (in Figma or on a napkin).

Keywords: type hints, docstrings, transcompiler, frontend development

  1. A review in the core technologies of 5G: device-to-device communication, multi-access edge computing and network function virtualization
    Ruixuan Tu*, Ruxun Xiang*, Yang Xu, Yihan Mei
    International Journal of Communications, Network and System Sciences, 2019
    [SCIRP] [PDF]

  5G is a new generation of mobile networking that aims to achieve unparalleled speed and performance. To accomplish this, three technologies, Device-to-Device communication (D2D), multi-access edge computing (MEC) and network function virtualization (NFV) with ClickOS, have been a significant part of 5G, and this paper mainly discusses them. D2D enables direct communication between devices without the relay of base station. In 5G, a two-tier cellular network composed of traditional cellular network system and D2D is an efficient method for realizing high-speed communication. MEC unloads work from end devices and clouds platforms to widespread nodes, and connects the nodes together with outside devices and third-party providers, in order to diminish the overloading effect on any device caused by enormous applications and improve users’ quality of experience (QoE). There is also a NFV method in order to fulfill the 5G requirements. In this part, an optimized virtual machine for middle-boxes named ClickOS is introduced, and it is evaluated in several aspects. Some middle boxes are being implemented in the ClickOS and proved to have outstanding performances.

プレプリント

現時点ではなし

授業のプロジェクト

  1. WakaGPT: Classical Japanese Poem Generator
    Ruixuan Tu
    Full-mark final paper for STAT 453 (Deep Learning) @ UW–Madison, Spring 2024
    [PDF] [Slide]

  Waka is a traditional Japanese poem that is usually in a certain mora sequence format. However, generating waka is challenging for general-purpose LLMs like GPT-4 due to lack of data in classical Japanese and this kind of poetry, as well as the usual format restrictions. In this paper, we present WakaGPT, a waka composer based on Japanese GPT-2 and the base models it is fine-tuned on. By self-supervised and semi-supervised training, we are able to generate waka poems with correct grammar and format.

  1. Analysis of Post-Meiji Word Origins in Japanese Literature: An approach in computational linguistics
    Ruixuan Tu
    A-mark final paper for ASIAN 434 (Japanese Linguistics) @ UW–Madison, Fall 2023
    [PDF] [Slide]

  We have analyzed the distribution of origins of morphemes on Aozora Bunko dataset over all morphemes, parts of speech, and origins. For the analysis, we have used morpheme analysis tools MeCab and Juman++ by Kyoto University, and based on UniDic data, we fine-tuned DeBERTa-v2-base-Japanese to classify the origins of morphemes into three categories: native, Sino-Japanese (SJ), and mixed. The hypothesis was that the Japanese government advocates the usage of SJ and native words before/in WWII, and western culture becomes more popular after WWII, but as a result from this analysis, we can even see some preferences toward native words, contradicting the hypothesis.

  1. Cluster Analysis of Role Languages in Visual Novel Game AIR
    Ruixuan Tu
    A-mark final paper for ASIAN 358 (Japanese Sociolinguistics) @ UW–Madison, Fall 2024
    [PDF] [Slide]

  Through our analysis of the visual novel game AIR, most keywords “特徴語” from our method could be recognized as“yakuwarigo” that represents characteristics of specific individuals or groups, but might not the reverse side (not all “yakuwarigo” are keywords that could be found). From our method, we have observed non-female language, casual female language, formal and polite female language, and dialectal language as clusters. We also found that different groups of script authors might affect extracted keywords.

Method: We apply agglomerative hierarchical clustering (Ward’s method + euclidean distance) on word frequency vectors for every speaker, and then extract significant keywords by CoS (coefficient of specialization) >2.

  1. Optimizing Bike-Sharing Systems: A Machine Learning Approach to Predict Station Imbalances
    Ruixuan Tu, Larissa Xia, Steven Haworth, Jackson Wegner
    1st Most Creative or Interesting Project and 2nd Best Visualizations for STAT 451 (Machine Learning) @ UW–Madison, Summer 2024
    [PDF] [Slide]

  This study analyzes Divvy Bike Station, Trip Data, and American Community Survey Data to predict bike station flow imbalances (overflow/underflow). The key questions are: How can demographic data and machine learning predict bike availability? Is the status of existing stations a reliable indicator for nearby stations? Using Logistic Regression, Decision Tree, SVM for demographic data, and kNN for geographic data, with Recursive Feature Elimination and Grid Search with Cross-Validation, SVM was the most effective. The status of existing Divvy stations reliably predicts the status of nearby stations.

職歴

Textea Inc
ソフトウェア開発エンジニア インターン(2022年5月 — 2022年9月)

ウィスコンシン大学マディソン校

プロジェクト

KDE Connect (Apple Continuityのような体験) (2018年11月 — 現在)

所属学会

受賞歴

Ruixuan Tu (トゥ・ルイシュェン)

論文