If you've ever run a customer survey, you know the feeling. The multiple-choice dashboard is crisp and ready to share. The open-text column, meanwhile, is sitting right there — hundreds or thousands of responses deep, completely unread. "We should really do something with the free-text answers" has been a standing agenda item since surveys were invented. And three weeks later, you're skim-reading them over coffee, hoping for a pattern to jump out. It usually doesn't.
Generative AI is the first credible shot at actually breaking this bottleneck. But — and this is the honest part — it's not the silver bullet the marketing implies. A 2024 peer-reviewed paper reports Claude hitting 93.9% accuracy, nearly matching human coders. Another 2024 paper finds general-purpose LLMs inadequate without fine-tuning. Both are correct; they tested different things. This piece walks through what text mining and LLM coding each actually buy you, where each one falls apart, and how to pick the combination that fits what you're trying to do.
1. Two Approaches to Open-Ended Analysis
The analysis of open-ended responses splits into two traditions.
Approach 1: Text Mining (word and co-occurrence based)
The classical approach: morphological / tokenization analysis → word frequency → co-occurrence network → sentiment. Strong at quantitative word-level trend analysis ("what terms appear most?"), weaker at contextual understanding.
Approach 2: LLM Coding (context and meaning based)
Feed each open response to a GPT / Claude / Gemini class model and have it classify against a predefined codebook. Since 2023, academic and industry research has begun characterizing how well this actually works.
2. What 2024 Research Says About LLM Coding — Accuracy and Limits
A cluster of 2024 peer-reviewed research has evaluated LLM coding performance with concrete, measurable results.
LLMs can approach human accuracy under the right conditions
Mellon et al. (2024), Research & Politics evaluated LLM coding of a UK social-survey "most important issue" open question. Claude-1.3 reached 93.9% accuracy, nearly matching the 94.7% accuracy of human coders. With sufficient sample size and a clear coding scheme, LLMs can plausibly reach human-comparable performance.
But results vary considerably by case
Conversely, a 2024 arXiv study analyzing German open-ended responses on survey motivation found that general-purpose LLMs produced inadequate accuracy, and only a fine-tuned model reached satisfactory levels. Language, topic complexity, and abstraction level of the categories all meaningfully shift achievable accuracy.
LLMs have structural weaknesses
A 2024 PMC paper maps out the structural limitations of LLM-based open-response analysis:
- LLMs process each response in isolation — they don't have access to the respondent's other answers, tone, sarcasm, or follow-up context that human coders lean on
- Poor handling of ambiguous responses — responses that human coders would resolve via context get classified semi-randomly by LLMs
- High prompt sensitivity — the same data and the same model can produce materially different results under different prompts
These are repeatedly demonstrated structural limits of LLM coding.
A real-world failure case
A Langer Research white paper reports that a pilot using a leading AI tool on the 2024 Texas Education Poll open-text data produced significant misalignment with human coders, widespread misclassification, and failure to capture tone or directionality. A widely cited cautionary case showing that commercial AI tools don't all deliver at the level marketing suggests.
3. Two Tool Archetypes — Text Mining vs LLM-Integrated QDA
Tool choices cluster into two archetypes. Note that vendor materials describe positioning and capabilities, not independently validated benchmarks — they're useful for industry orientation rather than as guarantees of performance.
Archetype 1: Dedicated text-mining tools
Focused on tokenization + co-occurrence + frequency. Popular in parts of the survey research industry for quick trend snapshots. Comparison sites such as Thematic describe a broad universe of tools in this space, though most reporting notes their weakness on long-form and context-dependent interpretation.
Archetype 2: Traditional QDA tools integrating generative AI
Established QDA platforms are adding AI features:
- NVivo (Lumivero) markets its AI Assistant with text summarization, coding suggestions, and sentiment analysis (per their own product materials)
- MAXQDA similarly describes expanding AI support per comparison reviews
- Delve and similar newer entrants lean more heavily on AI-first workflows
These descriptions come from vendor and comparison sites rather than independent benchmarks, but the direction of travel — combining classical text mining with LLM-based capabilities — is a widely shared industry trajectory for 2025.
4. Choosing an Approach in Practice
Taking the academic evidence and the industry positioning together, three axes tend to drive real-world approach selection.
Axis 1: Data volume
- Under 500 responses: LLM coding one-by-one is economically reasonable; take advantage of contextual understanding
- 500 to several thousand: Hybrid — text mining for overall shape, LLM coding for targeted deep dives on interesting clusters
- Tens of thousands+: Text mining for dimensionality reduction, LLM coding on a sampled subset
Axis 2: Purpose
- Market trend monitoring: Text mining often sufficient
- Segment-level issue surfacing (CX use): LLM coding's contextual strength matters
- Quantify and trend over time: Define categories, then code (LLM + human) consistently across waves
- Find a small number of important signals: Human review augmented by LLM
Axis 3: Accuracy requirements
- Directly drives major decisions (exec reporting, product calls): Two-stage LLM + human review
- Directional insight is enough: Text mining alone can work
Editorial take — what we'd actually ship
After two years of tracking this space through public cases and industry commentary, a handful of patterns have started to feel like the "obviously correct" defaults. The teams that get burned by AI-assisted open-text analysis almost always share one mistake: they tried to automate everything, and only discovered the limits of the approach after the bill arrived. The gap between vendor pitch and field reality is still real in 2026, so let us be blunt about this.
1. Don't skip the two-stage approach on large datasets. Text mining first for shape, then LLM for the clusters that actually matter. Jumping straight to full-LLM coding on tens of thousands of responses is how teams discover, three months in, that they've spent a small fortune on mediocre output that adds little beyond what the two-stage approach would have produced.
2. Don't feed a codebook "by vibes." "The LLM will figure it out" is the fastest way to destroy accuracy. Write out your categories, definitions, examples, and edge cases in prose, before you run anything. If that feels like a lot of work upfront — good. That's the work that was going to happen anyway; you just get to do it once cleanly instead of seven times in rework.
3. Don't skip the sample review. Re-code 5–10% by hand and actually measure agreement. "It looked reasonable when I scrolled through" is not a metric. This is the step that teams shortcut because "AI did it, so it must be fine" — and it's the step that makes or breaks whether you can defend the results in a stakeholder meeting.
4. Let ambiguous responses live in an "Other / Uncertain" bucket. Forcing a noisy response into a clean category just launders the noise into your charts. "100% coded" sounds impressive until you realize 20% of it is wrong. We'd much rather see "80% automated, 20% hand-coded" — that's the shape of an output you can actually trust.
5. How the Survey Tool Kicue Supports Open-Ended Analysis
Kicue ships with open-ended (OA / FA) question types and an authoring workflow oriented for reliable field operations:
- OA / FA question types — supports short and long free-text fields (question type reference)
- CSV / Excel export — export in formats ready for external analysis tools (NVivo / MAXQDA / dedicated text-mining platforms)
- Bias-reducing authoring — character-count hints, clear required/optional labeling, UI tuned for higher completion rates
- Fraud detection on free text — detects AI-generated responses pasted into open fields (fraud detection overview)
Upload your questionnaire file and the platform handles open-ended field design, collection, and export end-to-end.
Choosing the right tool — Free plan limits, branching support, AI capabilities, and CSV export vary widely across tools. See our free survey tool comparison to find the right fit for this approach.
Recap
Key decisions when analyzing open-ended responses with AI:
- Two approaches — text mining (word/co-occurrence) and LLM coding (context/meaning) — with different strengths
- LLMs can reach near-human accuracy, but only under conditions — adequate samples, clear codebook, well-designed prompts
- Know the structural limits — isolation, ambiguity, prompt sensitivity
- Commercial AI tools need verification in your context — public failure cases are real; measure before production
- Two-stage analysis + sample review is becoming standard practice
Open-ended data has historically been under-analyzed because of scale. With AI in the toolkit, that's changing — but the winning pattern is knowing each approach's limits and keeping a human check in the loop, not blind automation.
References (10)
Academic & peer-reviewed research
- Mellon, J., Bailey, J., Scott, R., Breckwoldt, J., Miori, M., & Schmedeman, P. (2024). Do AIs know what the most important issue is? Using language models to code open-text social survey responses at scale. Research & Politics.
- Framework-based qualitative analysis of free responses of Large Language Models: Algorithmic fidelity (2024). PMC.
- AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation (2024). arXiv.
- A Large Language Model Approach to Educational Survey Feedback Analysis (2024). International Journal of AI in Education.
- Large Language Model for Qualitative Research - A Systematic Mapping Study (2024). arXiv.
Industry reports & vendor commentary
See how Kicue — a free survey tool designed for modern open-ended workflows — handles the quantitative side so you can invest time in qualitative depth.
