Analyzing Open-Ended Survey Responses with AI: Text Mining vs LLM Coding

If you've ever run a customer survey, you know the feeling. The multiple-choice dashboard is crisp and ready to share. The open-text column, meanwhile, is sitting right there — hundreds or thousands of responses deep, completely unread. "We should really do something with the free-text answers" has been a standing agenda item since surveys were invented. And three weeks later, you're skim-reading them over coffee, hoping for a pattern to jump out. It usually doesn't.

Generative AI is the first credible shot at actually breaking this bottleneck. But — and this is the honest part — it's not the silver bullet the marketing implies. A 2024 peer-reviewed paper reports Claude hitting 93.9% accuracy, nearly matching human coders. Another 2024 paper finds general-purpose LLMs inadequate without fine-tuning. Both are correct; they tested different things. This piece walks through what text mining and LLM coding each actually buy you, where each one falls apart, and how to pick the combination that fits what you're trying to do.

1. Two Approaches to Open-Ended Analysis

The analysis of open-ended responses splits into two traditions.

Approach 1: Text Mining (word and co-occurrence based)

The classical approach: morphological / tokenization analysis → word frequency → co-occurrence network → sentiment. Strong at quantitative word-level trend analysis ("what terms appear most?"), weaker at contextual understanding.

Approach 2: LLM Coding (context and meaning based)

Feed each open response to a GPT / Claude / Gemini class model and have it classify against a predefined codebook. Since 2023, academic and industry research has begun characterizing how well this actually works.

2. What 2024 Research Says About LLM Coding — Accuracy and Limits

A cluster of 2024 peer-reviewed research has evaluated LLM coding performance with concrete, measurable results.

LLMs can approach human accuracy under the right conditions

Mellon et al. (2024), Research & Politics evaluated LLM coding of a UK social-survey "most important issue" open question. Claude-1.3 reached 93.9% accuracy, nearly matching the 94.7% accuracy of human coders. With sufficient sample size and a clear coding scheme, LLMs can plausibly reach human-comparable performance.

But results vary considerably by case

Conversely, a 2024 arXiv study analyzing German open-ended responses on survey motivation found that general-purpose LLMs produced inadequate accuracy, and only a fine-tuned model reached satisfactory levels. Language, topic complexity, and abstraction level of the categories all meaningfully shift achievable accuracy.

LLMs have structural weaknesses

A 2024 PMC paper maps out the structural limitations of LLM-based open-response analysis:

LLMs process each response in isolation — they don't have access to the respondent's other answers, tone, sarcasm, or follow-up context that human coders lean on
Poor handling of ambiguous responses — responses that human coders would resolve via context get classified semi-randomly by LLMs
High prompt sensitivity — the same data and the same model can produce materially different results under different prompts

These are repeatedly demonstrated structural limits of LLM coding.

A real-world failure case

A Langer Research white paper reports that a pilot using a leading AI tool on the 2024 Texas Education Poll open-text data produced significant misalignment with human coders, widespread misclassification, and failure to capture tone or directionality. A widely cited cautionary case showing that commercial AI tools don't all deliver at the level marketing suggests.

3. Two Tool Archetypes — Text Mining vs LLM-Integrated QDA

Tool choices cluster into two archetypes. Note that vendor materials describe positioning and capabilities, not independently validated benchmarks — they're useful for industry orientation rather than as guarantees of performance.

Archetype 1: Dedicated text-mining tools

Focused on tokenization + co-occurrence + frequency. Popular in parts of the survey research industry for quick trend snapshots. Comparison sites such as Thematic describe a broad universe of tools in this space, though most reporting notes their weakness on long-form and context-dependent interpretation.

Archetype 2: Traditional QDA tools integrating generative AI

Established QDA platforms are adding AI features:

NVivo (Lumivero) markets its AI Assistant with text summarization, coding suggestions, and sentiment analysis (per their own product materials)
MAXQDA similarly describes expanding AI support per comparison reviews
Delve and similar newer entrants lean more heavily on AI-first workflows

These descriptions come from vendor and comparison sites rather than independent benchmarks, but the direction of travel — combining classical text mining with LLM-based capabilities — is a widely shared industry trajectory for 2025.

4. Choosing an Approach in Practice

Taking the academic evidence and the industry positioning together, three axes tend to drive real-world approach selection.

Axis 1: Data volume

Under 500 responses: LLM coding one-by-one is economically reasonable; take advantage of contextual understanding
500 to several thousand: Hybrid — text mining for overall shape, LLM coding for targeted deep dives on interesting clusters
Tens of thousands+: Text mining for dimensionality reduction, LLM coding on a sampled subset

Axis 2: Purpose

Market trend monitoring: Text mining often sufficient
Segment-level issue surfacing (CX use): LLM coding's contextual strength matters
Quantify and trend over time: Define categories, then code (LLM + human) consistently across waves
Find a small number of important signals: Human review augmented by LLM

Axis 3: Accuracy requirements

Directly drives major decisions (exec reporting, product calls): Two-stage LLM + human review
Directional insight is enough: Text mining alone can work

Editorial take — what we'd actually ship

After two years of tracking this space through public cases and industry commentary, a handful of patterns have started to feel like the "obviously correct" defaults. The teams that get burned by AI-assisted open-text analysis almost always share one mistake: they tried to automate everything, and only discovered the limits of the approach after the bill arrived. The gap between vendor pitch and field reality is still real in 2026, so let us be blunt about this.

1. Don't skip the two-stage approach on large datasets. Text mining first for shape, then LLM for the clusters that actually matter. Jumping straight to full-LLM coding on tens of thousands of responses is how teams discover, three months in, that they've spent a small fortune on mediocre output that adds little beyond what the two-stage approach would have produced.

2. Don't feed a codebook "by vibes." "The LLM will figure it out" is the fastest way to destroy accuracy. Write out your categories, definitions, examples, and edge cases in prose, before you run anything. If that feels like a lot of work upfront — good. That's the work that was going to happen anyway; you just get to do it once cleanly instead of seven times in rework.

3. Don't skip the sample review. Re-code 5–10% by hand and actually measure agreement. "It looked reasonable when I scrolled through" is not a metric. This is the step that teams shortcut because "AI did it, so it must be fine" — and it's the step that makes or breaks whether you can defend the results in a stakeholder meeting.

4. Let ambiguous responses live in an "Other / Uncertain" bucket. Forcing a noisy response into a clean category just launders the noise into your charts. "100% coded" sounds impressive until you realize 20% of it is wrong. We'd much rather see "80% automated, 20% hand-coded" — that's the shape of an output you can actually trust.

5. How the Survey Tool Kicue Supports Open-Ended Analysis

Kicue ships with open-ended (OA / FA) question types and an authoring workflow oriented for reliable field operations:

OA / FA question types — supports short and long free-text fields (question type reference)
CSV / Excel export — export in formats ready for external analysis tools (NVivo / MAXQDA / dedicated text-mining platforms)
Bias-reducing authoring — character-count hints, clear required/optional labeling, UI tuned for higher completion rates
Fraud detection on free text — detects AI-generated responses pasted into open fields (fraud detection overview)

Upload your questionnaire file and the platform handles open-ended field design, collection, and export end-to-end.

Choosing the right tool — Free plan limits, branching support, AI capabilities, and CSV export vary widely across tools. See our free survey tool comparison to find the right fit for this approach.

Recap

Key decisions when analyzing open-ended responses with AI:

Two approaches — text mining (word/co-occurrence) and LLM coding (context/meaning) — with different strengths
LLMs can reach near-human accuracy, but only under conditions — adequate samples, clear codebook, well-designed prompts
Know the structural limits — isolation, ambiguity, prompt sensitivity
Commercial AI tools need verification in your context — public failure cases are real; measure before production
Two-stage analysis + sample review is becoming standard practice

Open-ended data has historically been under-analyzed because of scale. With AI in the toolkit, that's changing — but the winning pattern is knowing each approach's limits and keeping a human check in the loop, not blind automation.

References (10)

Academic & peer-reviewed research

Industry reports & vendor commentary

See how Kicue — a free survey tool designed for modern open-ended workflows — handles the quantitative side so you can invest time in qualitative depth.