top of page

Arisyn – A Novel NL2SQL Technology Distinct from Large Language Models

Updated: Dec 15

I. Wide Application Scenarios of Arisyn

Background Recap

As mentioned in previous articles, "Arisyn aims to achieve automatic data association in the field of data integration." It addresses the challenge of automatic association for relational data across multiple tables.

Is this problem widely applicable, or is it a pseudo-proposition without practical demand? Let’s explore:

01 Relational Data Remains One of the Most Critical Data Assets

While emerging technologies like large language models (LLMs) and big data platforms can process diverse data types (e.g., documents, images, audio, and video) – as seen in multimodal generative AI (e.g., text-to-video, voice interaction) – their outputs are often open-ended, subjective, and sometimes suffer from "hallucinations." Thus, they are suitable for reference or auxiliary work but inadequate for scenarios requiring strict precision.

In industries such as banking, financial securities, transportation, transactions, finance, manufacturing, and energy, core business data must be managed using structured relational data to ensure accuracy, consistency, and compliance.

02 Data Construction Is Inherently Decentralized

(1) Database Normalization Requirements: Relational database design paradigms mandate reasonable data splitting to avoid excessive redundancy. Redundant data not only duplicates collection efforts but also undermines data consistency. Additionally, data items from different business processes, collected by different users at different times, cannot be effectively maintained in a single table. Thus, data is inherently organized by objects or business activities and stored across multiple tables.

(2) Multi-System Data Sources: Informatization is an iterative process, with systems implemented sequentially. Even within a single system, modules may be deployed in phases. Furthermore, different application scenarios require distinct technical selections (e.g., business data, real-time data, and logs are managed via different technologies), leading to inherently decentralized data sources.

03 Integration Is the Most Effective Means to Unlock Data Value

Data integration is essential for deriving business insights. Examples include:

  • Integrating production and planning data to track plan completion;

  • Integrating production and sales data to identify overstock or unmet order demands;

  • Integrating production and financial data to analyze costs and profitability.

In summary, relational data integration will remain a critical application scenario for the foreseeable future. As long as this need exists, Arisyn will maintain broad adaptability.

II. Comparison Between Arisyn and LLM-Based Data Integration Methods

T2SQL (Text to SQL) and NL2SQL (Natural Language to SQL) refer to technologies that automatically generate data queries from text or natural language inputs. 本质上 (In essence), they are the same concept – converting semantic understanding into data operations via AI – and represent an active research direction in AI-driven data applications. Recent advancements in LLM technology have revitalized this field. Through research on technical reports from Alibaba, Tencent, and hands-on testing of open-source projects like DB-GPT, we observe that these LLM-based solutions share similar underlying logical frameworks. However, Arisyn adopts a fundamentally different approach.

Below is a comparative analysis of their implementation methods (focusing on practical application rather than low-level technical details):

1. LLM-Based Automatic Data Query: Requires Data Training

Consider a set of tables (T1, T2, ..., Tn) with varying numbers of columns (C1, C2, ..., Cn). Without context, raw data is meaningless. For example:

C1

C2

C3

C4

C5

C6

Orange

5

3

3

2

1

This data becomes usable only when columns are defined. Two possible interpretations are:

Interpretation 1: Warehouse Inventory Data

Fruit Type

Warehouse No.

Shelf No.

Stock Quantity

Shelf Life

Keeper ID

Orange

5

3

3

2

1

Interpretation 2: Hotel Information Data

Hotel Name

Popularity Rank

Star Rating

Operating Years

Remaining Rooms

Discount Available

Orange

5

3

3

2

1 (Yes)

This example illustrates that semantic understanding of tables and columns is prerequisite for data application. LLM-based NL2SQL relies on training data to achieve this understanding.

Take the Spider dataset – a benchmark for multi-database, multi-table, single-turn T2SQL tasks proposed by Yale University in 2018. It includes 10,181 natural language questions, 5,693 SQL statements, and covers 200+ databases across 138 domains (7,000 training questions, 1,034 development questions, 2,147 test questions). The training logic is simplified as follows:

  • Training Sample:

    Question 1: How many red lipsticks are in stock?

    Answer 1: SELECT amount FROM warehouse WHERE good_name='lipstick' AND color='red';

  • Test Query:

    Question: How many blue lipsticks are in stock?

    Output: SELECT amount FROM warehouse WHERE good_name='lipstick' AND color='blue';

LLM-based NL2SQL emphasizes deriving SQL from semantic and contextual understanding after training. However, this approach faces significant practical limitations:

  1. High Preparatory Cost: Sufficient training data is required to convert natural language to data operations.

  2. Poor Adaptability to New Data: The trained model cannot understand or use newly added data resources (i.e., new tables/columns, not new records).

  3. Insufficient Accuracy for Critical Applications: Even with training and optimization on known datasets, accuracy typically ranges from 80% to 90%, limiting its use to auxiliary roles.

In conclusion, LLM-based NL2SQL is suitable for scenarios with fixed data content and application methods. Its strengths lie in natural language understanding and generative content, not in data integration itself.

2. Arisyn’s Data Integration Method

Arisyn requires no training data. Instead, it generates inter-table relationships using an inter-table relationship analysis model, which infers associations based on data feature values rather than semantic understanding of tables or columns.

Let’s illustrate with two example tables:

Tab_1: Student Information

Name

Student_ID

CLASS

Age

Sex

Zhang San

2021_0001

2021_01

19

Male

Li Si

2021_0002

2021_01

18

Female

Wang Wu

2021_0003

2021_01

19

Male

Tab_2: Exam Scores

XH (Student ID)

KC (Course)

CJ (Score)

PM (Rank)

2021_0001

Math

135

18

2021_0001

Chinese

110

23

2021_0002

Math

120

25

2021_0002

Chinese

125

10

Arisyn identifies that Tab_1.Student_ID and Tab_2.XH share identical data feature values, establishing the association condition Tab_1.Student_ID = Tab_2.XH.

In practice, this analysis involves complex considerations. Arisyn uses an optimized analytical framework with an in-memory database of replicated data feature values to generate accurate inter-table relationships. A detailed technical breakdown will be provided in a subsequent article.

Key Differences Between Arisyn and LLM-Based NL2SQL

Aspect

Arisyn

LLM-Based NL2SQL

Training Data Requirement

None – infers relationships via data feature analysis

Requires large-scale labeled training data (questions + SQL)

Core Focus

Data integration (generating inter-table join conditions)

Natural language understanding + generative SQL

Adaptability to New Data

Seamlessly supports new tables/columns

Cannot use new data without retraining

Accuracy

Theoretically 100% (assuming high data quality)

80%–90% even after optimization

Application Scope

Broad (scales with more integrable data)

Narrow (limited to fixed data/application scenarios)

III. Potential for Integration of Arisyn and LLM Technology

LLMs excel at semantic understanding and generative content, while Arisyn offers advantages in low upfront effort and high accuracy for data association analysis. Currently, IntaLink requires users to explicitly specify target tables/columns and implement data usage logic (e.g., summation, counting).

An ideal integration would leverage:

  1. LLMs to interpret user natural language requests and convert them into target tables/columns;

  2. Arisyn to generate the required integrated dataset using the identified tables/columns;

  3. LLMs to present results in user-friendly formats (e.g., reports, charts, documents).

This synergy combines the strengths of both technologies, delivering a seamless, accurate, and scalable data integration and query solution.

 
 
 

Comments


bottom of page