Skip to content

prohibit duplicate key columns#7760

Open
ben-schwen wants to merge 6 commits into
masterfrom
duplicated_key_columns
Open

prohibit duplicate key columns#7760
ben-schwen wants to merge 6 commits into
masterfrom
duplicated_key_columns

Conversation

@ben-schwen
Copy link
Copy Markdown
Member

Closes #4888
Closes #4891

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 26, 2026

  • HEAD=duplicated_key_columns slower P<0.001 for memrecycle regression fixed in #5463
    Comparison Plot

Generated via commit 7253f69

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 2 minutes and 41 seconds
Installing different package versions 47 seconds
Running and plotting the test cases 5 minutes and 23 seconds

@MichaelChirico
Copy link
Copy Markdown
Member

Gemini identified some remaining ways for ambiguity to creep in:

dt = data.table(a=1:2, a.1=3:4, val=10:11)
dt[, .(a.1, sum(val)), keyby=.(a, a)]
# Key: <a, a.1>
#        a   a.1   a.1    V2
#    <int> <int> <int> <int>
# 1:     1     1     1    10
# 2:     2     2     2    11

dt = data.table(a=1:2, b=3:4, key="a")
dt[, .(a, a)]
# Key: <a>
#        a     a
#    <int> <int>
# 1:     1     1
# 2:     2     2
subset(dt, select=c(a, a))
# Key: <a>
#        a     a
#    <int> <int>
# 1:     1     1
# 2:     2     2

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.04%. Comparing base (d4974e9) to head (7253f69).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7760   +/-   ##
=======================================
  Coverage   99.04%   99.04%           
=======================================
  Files          87       87           
  Lines       17064    17087   +23     
=======================================
+ Hits        16901    16924   +23     
  Misses        163      163           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ben-schwen
Copy link
Copy Markdown
Member Author

Gemini identified some remaining ways for ambiguity to creep in:

dt = data.table(a=1:2, a.1=3:4, val=10:11)
dt[, .(a.1, sum(val)), keyby=.(a, a)]
# Key: <a, a.1>
#        a   a.1   a.1    V2
#    <int> <int> <int> <int>
# 1:     1     1     1    10
# 2:     2     2     2    11

dt = data.table(a=1:2, b=3:4, key="a")
dt[, .(a, a)]
# Key: <a>
#        a     a
#    <int> <int>
# 1:     1     1
# 2:     2     2
subset(dt, select=c(a, a))
# Key: <a>
#        a     a
#    <int> <int>
# 1:     1     1
# 2:     2     2

I've added these cases, but I'm sure we will encounter special versions of these again

@MichaelChirico
Copy link
Copy Markdown
Member

Yep, sharing because those look pretty easy to encounter in practice. Remaining ones will be more baroque. A fresh session finds nothing, so I think this is good now.

Comment thread inst/tests/tests.Rraw
id1 = sample(letters, 10) # reduced from 20 to 10
id2 = id1
date = 1:10 # and 40 to 10 to save ram, #5517
dt = setkey(data.table(CJ(date, id1, id2)), NULL)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test is somewhat confusing, is CJ(..., sorted=FALSE) not enough?

Comment thread R/data.table.R
if (verbose) {cat(timetaken(last.started.at),"\n"); flush.console()}
} else if (.by_result_is_keyable(x, keyby, bysameorder, byjoin, allbyvars, bysub)) {
setattr(ans, "sorted", names(ans)[seq_along(grpcols)])
if (!any(names(ans)[seq_along(grpcols)] %chin% duplicated_values(names(ans))))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save names(ans)[seq_along(grpcols)] to a variable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inconsistent behavior in keyed/unkeyed joins against duplicate columns keys are wrong/don't update if column names aren't unique

2 participants