Wikipedia:Bots/Requests for approval/ZackBot 7
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.
Operator: Zackmann08 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 20:49, Thursday, December 15, 2016 (UTC)
Automatic, Supervised, or Manual: Automatic
Programming language(s): Ruby
Source code available: User:ZackBot/single_cleanup
Function overview: Remove deprecated {{{Certification}}}
param from {{Infobox single}}
Links to relevant discussions (where appropriate): RfC: Should the "Certification" field be removed from Infobox single?, Template-protected edit request on 15 December 2016 & Use of ZackBot for Infobox single
Edit period(s): One time run
Estimated number of pages affected: All pages in Category:Pages using Infobox single with deprecated certification parameter (0)
Exclusion compliant (Yes/No): Yes
Already has a bot flag (Yes/No): yes
Function details: Goes through each transclusion that has been added to the category and removes the offending line of code.
Discussion
[edit]There is an interest in standardizing and streamlining music templates. In recent discussions here and here, the {{{Certification}}}
parameter was brought up by several editors. An RfC was opened and the consensus (unanimous) was to remove Certifications. Since the parameter appears on 3,500+ pages, it seems like a good task for a bot. ZackBot has performed similar tasks and a bot would greatly reduce the time to implement the change. —Ojorojo (talk) 23:32, 15 December 2016 (UTC)[reply]
- @Zackmann08: Hey! So a few concerns: Has anyone given thought that the certifications may only be listed in the infobox, and hence we are losing meaningful content with this bot? I think this is fine if everyone is on board with it, but I would link back to this BRFA in your edit summary for easy reference.I know in my request to be a bot approver I said I wouldn't do code reviews, but I do have some concerns with the regex. I see we're using
/\|\s*Certification\s*=.*\n/
to capture the certification parameter and value, replacing it with a blank string. What if the next parameter was on the same line [1], which is valid syntax? Then that parameter and value would also get removed. Or what if the certification parameter was lowercased [2]? I can help if you need :) I will also try to look over the infobox regex — MusikAnimal talk 07:02, 16 December 2016 (UTC)[reply]
- For Certifications, singles articles usually follow the guideline for albums. WP:ALBUM/CERT indicates placement within the article, tables to use, etc. ({{Infobox album}} does not have a field for Certifications). A review of FA single articles and some GAs shows that Certifications are included in the body of the article, even if also included in the infobox. For 3,500+ articles, some will probably not follow this practice. —Ojorojo (talk) 16:02, 16 December 2016 (UTC)[reply]
- @MusikAnimal: the first point I do not have a good solution for... You are 100% correct and I'd love your assistance with that. The second one however there are 2 things to do. First I can easily change it to
/\|\s*[Cc]ertification\s*=.*\n/
but it is worth noting that{{{certification}}}
is not a valid parameter and is not tracked by the category anyway. --Zackmann08 (Talk to me/What I been doing) 16:17, 16 December 2016 (UTC)[reply]- @Ojorojo: Would it help to automate detection of the certifications in the body of the article, and if present have the bot skip that page to allow for manual review? Or are we OK with risking the loss of this information?@Zackmann08: One solution is to use grouping to match what we want, and you can use
gsub
in conjunction. I think this is a well-rounded example. Note the/i
to make the regex case insensitive. So the code would be something likeupdated_text.gsub!(/(\n?\s*\|\s*Certification\s*=.*?)(?:\n|\|)/i)
. The.*?
makes the selection a non-greedy match, otherwise it may capture everything up until the end of a similar pattern [3] (the likelihood of a duplicate certification parameter is slim, but still). You'll almost always want to do non-greedy matching. Finally,(?:\n|\|)
means go until a new line or another pipe character is found, but don't include it as a group.Rubular is an excellent tool where we can build test cases to ensure all scenarios are covered. After this task (and task 4) I bet you'll have some re-usable regex for future tasks. Allow me to go over the infobox regex too, as a safeguard, before we begin a trial. In the meantime feel free to do a limited "dry run" as I explained in Task 4. This will help identify issues before any edits are made. Thanks for making your code open source! — MusikAnimal talk 18:23, 16 December 2016 (UTC)[reply]Amendment: Notice the first capture in [4] stops at|Platinum]]
, since there is a pipe in the wikilink! I've created a much more complicated solution here. This also preserves the template structure, removing any extraneous new lines. I can explain it over IRC, if you'd like. Parsing wikitext is obviously not fun :/ But the important thing is try to handle all possible scenarios, and again Rubular is your best friend for this :) — MusikAnimal talk 18:54, 16 December 2016 (UTC)[reply]- If it can be done without too much problem, your first suggestion (skip and manual review) would be preferable. The different ways that Certifications may appear in the body (prose vs tables vs links, etc.) would have to be dealt with. —Ojorojo (talk) 18:36, 16 December 2016 (UTC)[reply]
- Indeed, this could be quite tricky. Some templates may not contain the wording that we want to look for, but do on the rendered page. I brought it up as a thought, but if there is no overwhelming concern I think it's OK to omit this check — MusikAnimal talk 18:54, 16 December 2016 (UTC)[reply]
- If it can be done without too much problem, your first suggestion (skip and manual review) would be preferable. The different ways that Certifications may appear in the body (prose vs tables vs links, etc.) would have to be dealt with. —Ojorojo (talk) 18:36, 16 December 2016 (UTC)[reply]
- @Ojorojo: Would it help to automate detection of the certifications in the body of the article, and if present have the bot skip that page to allow for manual review? Or are we OK with risking the loss of this information?@Zackmann08: One solution is to use grouping to match what we want, and you can use
- @MusikAnimal: the first point I do not have a good solution for... You are 100% correct and I'd love your assistance with that. The second one however there are 2 things to do. First I can easily change it to
- For Certifications, singles articles usually follow the guideline for albums. WP:ALBUM/CERT indicates placement within the article, tables to use, etc. ({{Infobox album}} does not have a field for Certifications). A review of FA single articles and some GAs shows that Certifications are included in the body of the article, even if also included in the infobox. For 3,500+ articles, some will probably not follow this practice. —Ojorojo (talk) 16:02, 16 December 2016 (UTC)[reply]
@MusikAnimal: you freaking ROCK!!!! I'm swamped at my day job right now so I won't have a chance to really look at this until this afternoon... Let me ping you on IRC then and we can get a good solution. I love your tips on the regex... I'm still learning some of the complex features of regular expressions and really appreciate your help. @Ojorojo: no need to worry about the word "Certification" appearing elsewhere in the body. My code specifically ONLY looks inside of the infobox for the word. --Zackmann08 (Talk to me/What I been doing) 19:10, 16 December 2016 (UTC)[reply]
- Alright so after a nice little chat with MusikAnimal on WP:IRC we hashed out a good plan. There are some really complex edge cases here and we aren't too sure how many pages have these edge cases. Since we have the lovely tracking category, we are going to get the easy cases out of the way first. If we are left with only a few dozen after that... Well hell we can just do those by hand! The current REGEX for testing is shown here in demo form with examples. --Zackmann08 (Talk to me/What I been doing) 23:43, 16 December 2016 (UTC)[reply]
- Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. But first lets look over that regex again... I think you need to capture the whole line, or else you're removing the value and leaving the "Certification" parameter. Instead consider something like this. Before you start the trial, I would personally do a "dry run" over a few pages, as described in task 4, but that is up to you. Please also implement at least a 3 second delay (
sleep 3
), so we aren't unnecessarily going at full speed with potential issues. With this the trial will only take around two and a half minutes — MusikAnimal talk 21:15, 17 December 2016 (UTC)[reply]- Trial complete. with no issues found. --Zackmann08 (Talk to me/What I been doing) 22:11, 17 December 2016 (UTC)[reply]
- Approved for extended trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete. Looks good. Let's do a full run with the limited regex. You can speed it up this time, but leave at least a one second delay. Going to consider this run another "trial", because if there are a significant number of pages leftover, we'll want to do another trial for the new regex — MusikAnimal talk 23:15, 17 December 2016 (UTC)[reply]
- Trial complete. Took care of about 1000 pages or so. --Zackmann08 (Talk to me/What I been doing) 05:53, 18 December 2016 (UTC)[reply]
- @MusikAnimal: have a next iteration I would like to test out. See it on rubular. Just a reminder that we really don't care about the match groups because gsub replaces the whole line. I could make those non-capturing groups but that regex is already confusing enough to read... I tried to test out a number of cases pulled from actual pages and haven't found any false positives yet. This does NOT address templates in the Certification field, but does handle piped links. --Zackmann08 (Talk to me/What I been doing) 21:42, 18 December 2016 (UTC)[reply]
- Trial complete. Took care of about 1000 pages or so. --Zackmann08 (Talk to me/What I been doing) 05:53, 18 December 2016 (UTC)[reply]
- Approved for extended trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete. Looks good. Let's do a full run with the limited regex. You can speed it up this time, but leave at least a one second delay. Going to consider this run another "trial", because if there are a significant number of pages leftover, we'll want to do another trial for the new regex — MusikAnimal talk 23:15, 17 December 2016 (UTC)[reply]
- Trial complete. with no issues found. --Zackmann08 (Talk to me/What I been doing) 22:11, 17 December 2016 (UTC)[reply]
- Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. But first lets look over that regex again... I think you need to capture the whole line, or else you're removing the value and leaving the "Certification" parameter. Instead consider something like this. Before you start the trial, I would personally do a "dry run" over a few pages, as described in task 4, but that is up to you. Please also implement at least a 3 second delay (
- Approved for extended trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Looks good! This also doesn't handle the case where the certification is the last parameter of the template, but that's OK. Up to you if you want bundle that in with this, or handle those afterwards or by hand. Let's do a 50 edit trial with the new regex before letting it loose. Thanks for your patience with this! — MusikAnimal talk 22:37, 18 December 2016 (UTC)[reply]
- Trial complete. @MusikAnimal: Oh please. Thank YOU for YOUR patience and assistance!! Really do appreciate it. :-) Found one small bug in the first batch of ten (see rublar). Got it fixed and the remaining 40 went flawlessly. Ready to turn it loose with your approval. --Zackmann08 (Talk to me/What I been doing) 01:29, 19 December 2016 (UTC)[reply]
- Approved for extended trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete. Full run with the new regex. Hopefully there won't be too many leftover, but frankly I suspect there will be. When I quickly glanced over the category, it seems including a template in the certification field is common, unsurprisingly. Some usage is complex, such as Better the Devil You Know, and the presence of templates to me also suggests more valuable and possibly sourced information (e.g. {{cite web}}) may be lost that is not be repeated in the body of the article — MusikAnimal talk 06:26, 19 December 2016 (UTC)[reply]
- @MusikAnimal: hey we are getting there, bit by bit. :-) Thank you once again for your assistance! Script is running. --Zackmann08 (Talk to me/What I been doing) 17:48, 19 December 2016 (UTC)[reply]
- Approved for extended trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete. Full run with the new regex. Hopefully there won't be too many leftover, but frankly I suspect there will be. When I quickly glanced over the category, it seems including a template in the certification field is common, unsurprisingly. Some usage is complex, such as Better the Devil You Know, and the presence of templates to me also suggests more valuable and possibly sourced information (e.g. {{cite web}}) may be lost that is not be repeated in the body of the article — MusikAnimal talk 06:26, 19 December 2016 (UTC)[reply]
- Trial complete. @MusikAnimal: Oh please. Thank YOU for YOUR patience and assistance!! Really do appreciate it. :-) Found one small bug in the first batch of ten (see rublar). Got it fixed and the remaining 40 went flawlessly. Ready to turn it loose with your approval. --Zackmann08 (Talk to me/What I been doing) 01:29, 19 December 2016 (UTC)[reply]
@Zackmann08: Is the trial complete? Did you notice any errors? Before we go to the next step I want to propose something. Pinging Ojorojo as well. We're down to just pages that have some sort of template in the Certification field of the infobox. Reviewing them, I found some like Barrette (song) that have the certifications only listed in the infobox, and the information is sourced. As I mentioned before, this means the bot would be removing sourced content that does not exist anywhere else in the article. So my suggestion to Zackmann08 for the next run is to maybe look for the words "certification" or "certified" (case insensitive) anywhere outside the infobox, and if nothing is found, skip that page. Hopefully the remaining pages won't be as abundant, and we can fix them by hand. I personally feel it's worth the bit of effort to move the content to the body, which will involve a little copy editing. Someone took the time to add that sourced information, doesn't seem fair to deprecate the template parameter and remove the content if it can be salvaged. Thoughts? — MusikAnimal talk 23:18, 19 December 2016 (UTC)[reply]
- Trial complete. @MusikAnimal and Ojorojo: No issues found in the trial other than that there are 600+ pages remaining. Personally I am indifferent to how to remove the remaining information. There was an in depth discussion on the template talk page that agreed this information was not worthy of being displayed. I don't really see the merit to moving it to the article then. For example, (WAYYY over dramatising to make a point) if I was to add to the article the colors of the disc I think we would all agree that even if sourced that information would be irrelevant. I'm not sure that the Certification is of that much importance, even if sourced. That being said, I do agree with MusikAnimal's notion that in general it is bad to remove properly sourced information. Personally, I think the best thing to do now is to just let these remaining cases be removed by hand. However, a consensus should be reached about whether they are just blanket deleted from the infobox, or moved to the body of the article. I'm not certain that this is the best location for the discussion though... Should probably happen on the template talk page. Thoughts? --Zackmann08 (Talk to me/What I been doing) 00:34, 20 December 2016 (UTC)[reply]
- I've just started to review some of the changes. So far, it looks good:
- many of the fields that were removed were unused (blank)
- some had unuseful information: "None", "N/A", "X's songs", etc.
- several listed "Gold", "Golden", "Platinum", etc., but no certifying association or country
- many were not also included in the body of the article (and not referenced); from the hoaxes/vandalism I've seen, I wouldn't feel comfortable re-adding unsourced info
- only two so far that were referenced, but not included in the body, which is contrary to WP:INFOBOXPURPOSE ("the purpose of an infobox: to summarize (and not supplant) key facts that appear in the article (an article should remain complete with its summary infobox ignored)"). If it is not too much problem, skipping articles with "Certification" and "cite book", "cite web", "ref name=X/" would be preferable. MusikAnimal's suggestion (looking for "certification" outside the infobox) may also flag a lot of the other examples (blank fields, unuseful information, etc.). —Ojorojo (talk) 00:52, 20 December 2016 (UTC)[reply]
- I've looked at over 1,000 changes and about 11 were referenced, but not included in the body (I'll re-add these in the article body). Far more misuses of the Certification parameter were removed: "?", "TBA", chart position(s), producer(s) names (the preceding field), "duet of the year", record companies, "G", catalogue number, "none", "N/A", "—", etc. It doesn't seem like this should warrant a different approach. —Ojorojo (talk) 18:59, 20 December 2016 (UTC)[reply]
- @Ojorojo: You went through 1,000 pages? Bless you! :) So are you suggesting we shouldn't worry about checking if certifications are in the body, given you've done this work already? I noticed my example, Barrette (song), still has sourced content in the certification field but not in the body. I think doing this check won't be difficult to implement (I can help you Zackmann08), hence why I'm pushing for it — MusikAnimal talk 23:37, 20 December 2016 (UTC)[reply]
- I started with the ZackBot edit[5] and ended with[6] (the bot hasn't gotten to Barrette yet – the Cert field is still there). Just hovering over the diffs made it go fairly quickly. Still have a few re-adds to do. Are you thinking about the 625+ pages that remain? I haven't looked at these, just pages already edited by the bot. —Ojorojo (talk) 02:50, 21 December 2016 (UTC)[reply]
- I went through the next 500 ZackBot edits ([7] to [8]) and found one (1) certification with a reference that was removed that was not in the body of the article (several were also already in the body or rescued by AnomieBOT). I re-added it to the body of the article.[9] This doesn't seem to be an issue (less than 1% of bot edits) and can be easily corrected. Let ZackBot finish the remaining 625 pages and I'll make any necessary corrections. —Ojorojo (talk) 16:43, 21 December 2016 (UTC)[reply]
- I started with the ZackBot edit[5] and ended with[6] (the bot hasn't gotten to Barrette yet – the Cert field is still there). Just hovering over the diffs made it go fairly quickly. Still have a few re-adds to do. Are you thinking about the 625+ pages that remain? I haven't looked at these, just pages already edited by the bot. —Ojorojo (talk) 02:50, 21 December 2016 (UTC)[reply]
- @Ojorojo: You went through 1,000 pages? Bless you! :) So are you suggesting we shouldn't worry about checking if certifications are in the body, given you've done this work already? I noticed my example, Barrette (song), still has sourced content in the certification field but not in the body. I think doing this check won't be difficult to implement (I can help you Zackmann08), hence why I'm pushing for it — MusikAnimal talk 23:37, 20 December 2016 (UTC)[reply]
- I've looked at over 1,000 changes and about 11 were referenced, but not included in the body (I'll re-add these in the article body). Far more misuses of the Certification parameter were removed: "?", "TBA", chart position(s), producer(s) names (the preceding field), "duet of the year", record companies, "G", catalogue number, "none", "N/A", "—", etc. It doesn't seem like this should warrant a different approach. —Ojorojo (talk) 18:59, 20 December 2016 (UTC)[reply]
- I've just started to review some of the changes. So far, it looks good:
@Ojorojo: I think you underestimate how complicated it is for the bot to do what you are asking... If you are planning to just look over every single one of the remaining edits the bot would make I would suggest that you simply do the edits by hand. --Zackmann08 (Talk to me/What I been doing) 18:23, 21 December 2016 (UTC)[reply]
- @Zackmann08: I guess I'm not very good at explaining myself re:bots, etc. 1) I am not asking for anything additional that ZackBot7 has not already performed on this task so far. I am perfectly satisfied with the progress so far and see no need to change course. My last sentence above means for ZackBot7 to just continue to edit the remaining pages as it has already done on all the other pages. 2) I will review the remaining edits after they have been completed in the same manner I have reviewed the edits that have been made so far. What is wrong with double-checking to re-add references that have been removed? This seems to be a concern of MusikAnimal. The vast majority of the ZackBot7 edits are perfectly fine. Fifteen (15) minor changes to 1,500+ edits should be seen as should be seen as a very high success rate. If you prefer I don't review the work or re-add the handful of references, fine, I don't have a problem with that. The benefits of running the bot as is is much, much better than doing the rest by hand. Again, I'm not asking for anything to be changed; I was just trying to address MusikAnimal's concern. —Ojorojo (talk) 19:08, 21 December 2016 (UTC)[reply]
- Yes, if Ojorojo wants to review any final edits by ZackBot, then we don't need to worry about logic to make sure the certifications aren't repeated in the body. I get what Ojorojo is saying. Using WP:POPUPS it's easy to check the edits and quickly see if content was lost. Having to manually check the source of each page, one by one, obviously will take considerably more time. So let's move forward with running the bot on the outstanding pages. Zackmann08 I believe you need to tweak the regex to handle templates. There are some more complicated edge cases like Better the Devil You Know. You might even consider doing a supervised run, perhaps using the pry-byebug gem I told you about. Up to you! We're almost there :) Let me know when you've devised a plan, and we'll do one last final quick trial before letting it loose on the remaining 600+ pages — MusikAnimal talk 19:27, 21 December 2016 (UTC)[reply]
- (edit conflict)@Ojorojo: lol!! I think I'm the one who failed at explaining here... The issue is that what is left for the bot to address is cases where there is a template in the certification field... It is proving very challenging for the the bot to grab this field without completely breaking the template. Take a look at regular expressions and you will see some of the difficulty we are facing. My only point was that it would probably take less time for you to just do the remaining pages by hand than it would be me to write and test code to do it. --Zackmann08 (Talk to me/What I been doing) 19:29, 21 December 2016 (UTC)[reply]
- "I said, 'why do you have a banana in your ear?'" "What, I can't hear you, I've got a banana in my ear!" OK, I think we're on the same page now. I'll take your word for it and finish it up myself. Thanks for your efforts & Happy Holidays!. —Ojorojo (talk) 20:06, 21 December 2016 (UTC)[reply]
- @Ojorojo and MusikAnimal: great team work folks! Pleasure working with both of you. Have a very happy Holiday season and don't hesitate to reach out for future projects! GO TEAM!!! --Zackmann08 (Talk to me/What I been doing) 18:37, 22 December 2016 (UTC)[reply]
- "I said, 'why do you have a banana in your ear?'" "What, I can't hear you, I've got a banana in my ear!" OK, I think we're on the same page now. I'll take your word for it and finish it up myself. Thanks for your efforts & Happy Holidays!. —Ojorojo (talk) 20:06, 21 December 2016 (UTC)[reply]
- (edit conflict)@Ojorojo: lol!! I think I'm the one who failed at explaining here... The issue is that what is left for the bot to address is cases where there is a template in the certification field... It is proving very challenging for the the bot to grab this field without completely breaking the template. Take a look at regular expressions and you will see some of the difficulty we are facing. My only point was that it would probably take less time for you to just do the remaining pages by hand than it would be me to write and test code to do it. --Zackmann08 (Talk to me/What I been doing) 19:29, 21 December 2016 (UTC)[reply]
- Yes, if Ojorojo wants to review any final edits by ZackBot, then we don't need to worry about logic to make sure the certifications aren't repeated in the body. I get what Ojorojo is saying. Using WP:POPUPS it's easy to check the edits and quickly see if content was lost. Having to manually check the source of each page, one by one, obviously will take considerably more time. So let's move forward with running the bot on the outstanding pages. Zackmann08 I believe you need to tweak the regex to handle templates. There are some more complicated edge cases like Better the Devil You Know. You might even consider doing a supervised run, perhaps using the pry-byebug gem I told you about. Up to you! We're almost there :) Let me know when you've devised a plan, and we'll do one last final quick trial before letting it loose on the remaining 600+ pages — MusikAnimal talk 19:27, 21 December 2016 (UTC)[reply]
Approved. Per above. The remaining pages will be done by hand, so I think we're all done here. Happy holidays! — MusikAnimal talk 22:37, 21 December 2016 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.