How do parents use AI?

If you spend any time in parenting communities on Reddit or Facebook, you'll see parents talking about using ChatGPT. Some are asking it to write bedtime stories for their kids (myself included). Others are looking for advice on sleep training or potty training. Sam Altman has even said that he can't imagine parenting without ChatGPT. But I was curious what this actually looks like beyond anecdotes in online communities. How are parents really using AI tools? And what can we learn about the kinds of help that parents are looking for?

To answer these questions, I turned to WildChat-1M, a dataset of 1 million ChatGPT conversations collected with user consent. I built a classification pipeline to identify and categorize conversations where parents are seeking help with parenting-related topics. Out of 838,000 conversations, I found 117 that were clearly parenting-related.

Findings

The 117 parenting conversations fell into three broad categories: writing help, advice, and knowledge. A more detailed summary of individual conversations is available here.

I was a bit surprised by the distribution. I had expected the "advice" category to dominate, but with "writing help" comprising almost half of the conversations, it's clear that parents are primarily using ChatGPT as a productivity tool.

Reading through the "writing help" conversations, pretty much all of them seemed like good use cases for AI to help with a task that would either be more difficult or more time consuming to do as a human. I've personally used ChatGPT to write stories and songs for my kids, so it wasn't surprising to me that this use case came up; this is an obvious use case where an LLM would be better than (most) humans. I've luckily never had my insurance company deny coverage for healthcare treatment for my kids, but if I did, I would probably use ChatGPT to help draft the appeal. I'm also not Jewish, but it doesn't seem too surprising that parents would use ChatGPT to help with bar/bat mitzvah speeches. My favorite conversation from this section was a user asking ChatGPT:

"Write a small note to my son on the Christmas day eve pretending it is written by Santa. Mention the following shortly and make it warm and loving message: study well, Dont be stubborn , Listen to mommy and daddy, love your grand parents and be kind yo them. When they talk, talk to them nicely. Dont hit Chochi because she loves you alot. Say he is a loving, caring smart and a brilliant boy, he has so many good qualities and loved by all, very helpful and friendly personality. Wish him all the very best and become the best ever"

The conversations in the "advice" category I found a bit more interesting. I wasn't surprised necessarily that parents were relying on AI to help them parent, but reading through these conversations did make me a bit uncomfortable seeing how influential AI is in people's lives. Most conversations were fairly low-stakes: users asking ChatGPT how to potty train, users asking for advice on their children's college or preschool applications, tips on breastfeeding, etc. There were quite a few cases where the parent had already come to a decision, and was looking to ChatGPT to justify the decision, trusting ChatGPT as an authority figure. One example involved a conversation where a parent had seemingly already decided to give their baby a pescetarian diet, and asked ChatGPT to provide studies proving this was safe. ChatGPT found research studies that showed a vegetarian diet can provide sufficient nutrition for a baby, and extrapolated that a pescetarian diet should as well.

But many of these "advice" conversations were users asking ChatGPT to provide authoritative input on a specific situation. There was a conversation where a mom told ChatGPT how they handled a situation where they treated their son and daughter unequally and asked, "Does this make me a bad mom?". ChatGPT refused to directly answer this question, though provided the user with some questions for her to ask herself to determine if she was treating her children fairly. Another conversation where a parent described a contentious coparenting relationship with their partner where they each had children from a previous relationship, and asking ChatGPT for advice on how to navigate the coparenting dynamics and if it was even worth it to stay in the relationship. ChatGPT didn't directly answer the question of "is it worth it to stay in this relationship given the difficulties", but similar to the previous example, provided the user with some questions to ask themselves to help make the decision.

For the cases where users are seeking affirmation of a decision they've already made, it's probably best for AI models to not take into consideration the user's preferences when generating a response. But for the cases where users are seeking judgement on a specific, high-stakes situation, I'm not sure what the best solution is, but responding with questions for the user to ask themselves introspectively seems like a reasonable solution.

The "knowledge" category was the smallest. These types of conversations probably have the most overlap with what parents could also use Google for. Some of these were pretty straight forward, like asking ChatGPT to define a slang term that their kids were using or asking ChatGPT how to set up parental controls on their child's device. Others, especially medical related questions, were more complex. For example: giving ChatGPT a list of symptoms and asking for a psychiatric diagnosis or asking ChatGPT to interpret an ultrasound report. Of course, people have used Google / webMD to answer these questions in the past, but webMD won't directly tell you, "yes, your child has a psychiatric disorder". These conversations are a good example of why AI models need to handle medical questions carefully. In all of the medical conversations I reviewed here, ChatGPT avoided making a definitive diagnosis or suggested course of action, and appeared to mostly provide information neutrally and ask users to defer to a medical professional.

Overall, it actually seemed like ChatGPT handled the more sensitive conversations a bit better than I would have expected. OpenAI presumably has some sort of classifier that identifies more sensitive topics and have trained their models to err on the side of being "neutral and not harmful" vs reasoning about a situation and logically coming to some sort of assessment, but that's probably the right tradeoff considering the consequences of a model opining incorrectly.

Methodology

The pipeline for identifying parenting conversations had six stages:

1. Dataset ingestion

We pulled the full WildChat-1M dataset from HuggingFace, which contains roughly 838,000 ChatGPT conversations collected with user consent, with PII removed or anonymized.

2. Keyword pre-filter

A Python script scanned all 838,000 conversations for parenting-related regex patterns in user messages (e.g., "my son", "my daughter", "breastfeeding", "potty training"). Only 2,468 conversations (0.29%) matched and were forwarded to the classifier. This reduced API costs by more than ~300x, but likely came at the cost of filtering out some actual parenting conversations that didn't contain the specific keywords.

3. LLM classification: parenting or not?

Each of the 2,468 conversations was classified by Claude Haiku via the Anthropic API to determine: is this a genuine parenting conversation, or fiction/roleplay/etc? A Python script ran a system prompt defined in a config file against each conversation. 171 conversations were classified as genuine parenting. The vast majority of false positives were creative writing (720), fiction/roleplay (527), and students seeking help with homework (228).

4. LLM classification: parenting subcategory

Conversations identified as parenting-related were further subcategorized using a separate system prompt defined in the same config file, again using Claude Haiku via the API.

5. Manual curation

All 171 conversations were then reviewed by hand. 38 were removed as false positives — conversations that contained parenting keywords and the classifier misidentified as parenting, but were not actually parenting (e.g., teachers writing emails, grammar correction exercises, creative writing about parenting). Manual curation also resulted in some conversations moving from one subcategory to another. I also manually created some new categories based on trends I saw (e.g., "generate me a birthday message for my child", "write me a bar mitzvah speech").

6. Same-user deduplication

Using the hashed_ip field in WildChat, we identified cases where the same user had multiple near-identical conversations (e.g., one user had 10 conversations asking ChatGPT to explain 10 different slang terms for bowel-related incidents at school). These were consolidated into single entries with notes referencing the other conversation hashes, reducing the final count from 133 to 117.

Additional thoughts

I took a few shortcuts since I did this mostly just out of curiosity. This analysis identified only 117 parenting conversations in a dataset of 838,000. The true number is almost certainly higher, and the results may be systematically missing certain types of parenting conversations. If I had infinite time and money, then I would:

Use a larger dataset of conversations. The WildChat-4.8M dataset has about 3.2M conversations, and is more recent than the WildChat-1M dataset.
Skip the prefilter entirely and run the classifier against all conversations. This would bring in non-English conversations as well as English conversations that don't explicitly contain the keywords we filtered for. Note that it cost me about $30 to run the classifier on 2468 conversations, implying that running the pipeline on all 838,000 conversations would cost about $10,000. This would be even more expensive running the classification pipeline against a 3.2M conversation dataset.
Re-review the system prompts and see if there are any improvements that can be made. I had refined these a few times, but there's probably further improvements that can be made. At the very least, the subcategories could be better defined since I ended up creating a few subcategories during the manual review stage.
Use a more capable model than Claude Haiku. Opus is probably overkill, but Sonnet probably would be better. Comparing Haiku and Sonnet to see which classifies more accurately would be fairly time consuming. I'd probably have to establish a benchmark and manually rate classification accuracy to determine how well each model is at classifying the conversations.

The code for this project is available on GitHub.