Page MenuHomePhabricator

Provide easy script to reset Blazegraph
Closed, ResolvedPublic

Description

On migration/import of an existing Wikibase dataset, a method is required to reset the Query Service and re-import the complete RDF corpus.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

At Rhizome we used the following command to reset Blazegraph:

curl "http://localhost:9999/blazegraph/namespace/kb/sparql" --data-urlencode "update=DROP ALL; LOAD <file:///$path_to_ttl_file>;"

So, the service will start empty, so you shouldn't need to run a command to drop the data.

And also if you want to reload data you can just remove the docker volume the data is contained in and start again.

I'm not sure we really need a script to do this.

Is this kind of a duplicate of T186161?

Is this just about clearing blazegraph of all data? or about loading data into blazegraph that can't be loaded by the updater / from recent changes?

Vvjjkkii renamed this task from Provide easy script to reset Blazegraph to toaaaaaaaa.Jul 1 2018, 1:02 AM
Vvjjkkii raised the priority of this task from Low to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from toaaaaaaaa to Provide easy script to reset Blazegraph.Jul 2 2018, 1:35 PM
CommunityTechBot lowered the priority of this task from High to Low.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
This comment was removed by despens.

I am looking at runUpdate.sh inside the wdqs container, which I believe is documented here.

In order to import all statements, I tried to set a start time long in the past, yielding errors:

bash-4.4# pwd
/wdqs
bash-4.4# ./runUpdate.sh -v --start 20010101120000 -t 2 --verify -W https://staging.catalog.rhizome.org/ -U http://staging.catalog.rhizome.org/ --init
./runUpdate.sh: illegal option -- v
./runUpdate.sh: illegal option -- -
Updating via http://localhost:9999/bigdata/namespace/wdq/sparql
OpenJDK 64-Bit Server VM warning: Cannot open file /var/log/wdqs/wdqs-updater_jvm_gc.pid238.log due to No such file or directory

I> No access restrictor found, access to any MBean is allowed
Jolokia: Agent started with URL http://127.0.0.1:8778/jolokia/
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
Invalid argument:  com.lexicalscope.jewel.cli.ArgumentValidationException: Option only takes one value; cannot use [20010101120000]: --sparqlUrl -u value : URL to post updates and queries.
Unexpected Option: W
Unexpected Option: U
The options available are:
	[--batchSize -b value] : Number of recent changes fetched at a time.
	[--entityNamespaces value] : If specified must be numerical indexes of Item and Property namespaces that defined in Wikibase repository, comma separated.
	[--help] : Show this message
	[--idrange value] : If specified must be <start>-<end>. Ids are iterated instead of recent changes. Start and end are inclusive.
	[--ids value...] : If specified must be <id> or list of <id>, comma or space separated.
	[--init -I] : Initialize last update time to start time
	[--keepTypes] : Preserve all types
	[--labelLanguage value...] : Only import labels, aliases, and descriptions in these languages.
	[--pollDelay -d value] : Poll delay when no updates found
	[--singleLabel value...] : Only import a single label and description using the languages specified as a fallback list. If there isn't a label in any of the specified languages then no label is imported.  Ditto for description.
	[--skipSiteLinks] : Skip site links
	--sparqlUrl -u value : URL to post updates and queries.
	[--start -s value] : Start time in 2015-02-11T17:11:08Z or 20150211170100 format.
	[--tailPoller -T value] : Use secondary poller with given gap (seconds) to catch up missed updates
	[--threadCount -t value] : Thread count
	[--verbose -v] : Verbose mode
	[--verify -V] : Verify updates (may have performance impact)
	[--wikibaseHost -w value] : Wikibase host
	[--wikibaseScheme value] : Wikidata url scheme

Does that mean I am trying to run the wrong runUpdate.sh, or is this another version from the one documented?

Addshore added a subscriber: Smalyshev.

The docker image provides a run update script.
It can be seen at: https://github.com/wmde/wikibase-docker/blob/master/wdqs/0.3.1/runUpdate.sh
It is copied to /runUpdate.sh at: https://github.com/wmde/wikibase-docker/blob/master/wdqs/0.3.1/Dockerfile#L33

It looks like your running wdqs/runUpdater.sh which is indeed the script that you linked to the docs for.
@Smalyshev might be able to help you out with it more :)
I seem to remember there being something slightly confusing here.
I'll assign to @Smalyshev for now just until we get a response (feel free to unassign after)

It should be like this:

/runUpdate.sh -- -v --start 20010101120000 -t 2 --verify -W https://staging.catalog.rhizome.org/ -U http://staging.catalog.rhizome.org/ --init

Note the -- part. If you want to pass arguments to Updater directly, pass them after --. The ones that go before -- are for the script (which ultimately get translated to Updater arguments too). See https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#runUpdate.sh for more info.

Note that the script and Updater can have same arguments with different meaning (e.g. -t means different things) so watch where you put the args.

Thanks @Smalyshev! Indeed the double dashes were the issue!

I did run the command as you provided it inside the wdqs-updater container.

It doesn't seem to change the contents of Blazegraph, there is still almost no data available at WDQS, apart from a few items.

I wonder if the change data is expired somehow in my Wikibase? The output says Found start time in the RDF store: 2018-07-08T07:05:29Z and indeed not much has happened since that time, but everything happened before!

Would it make sense to erase Blazegraph and then start ./runUpdate.sh?

Here is the output of runUpdate.sh

wait-for-it.sh: waiting 120 seconds for wikibase.svc:80
wait-for-it.sh: wikibase.svc:80 is available after 0 seconds
wait-for-it.sh: waiting 120 seconds for wdqs.svc:9999
wait-for-it.sh: wdqs.svc:9999 is available after 0 seconds
Updating via http://wdqs.svc:9999/bigdata/namespace/wdq/sparql
OpenJDK 64-Bit Server VM warning: Cannot open file /var/log/wdqs/wdqs-updater_jvm_gc.pid99.log due to No such file or directory

Could not start Jolokia agent: java.net.BindException: Address in use
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
10:02:53.443 [main] INFO  org.wikidata.query.rdf.tool.Update - Checking where we left off
10:02:53.446 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - Checking for left off time from the updater
10:02:53.663 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - Found left off time from the updater
10:02:53.664 [main] INFO  org.wikidata.query.rdf.tool.Update - Found start time in the RDF store: 2018-07-08T07:05:29Z
10:02:54.014 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got 1 changes, from Q1166@77841@20180708070530|81717 to Q1166@77841@20180708070530|81717
10:02:54.259 [update 0] WARN  org.wikidata.query.rdf.tool.Updater - Contained error syncing.  Giving up on Q1166
org.wikidata.query.rdf.tool.rdf.Munger$BadSubjectException: Unrecognized subjects:  [https://staging.catalog.rhizome.org/wiki/Special:EntityData/Q1166, https://staging.catalog.rhizome.org/entity/Q1166, https://staging.catalog.rhizome.org/entity/statement/Q1166-83bdfae9-469c-aaa2-8241-823fa96365e5, https://staging.catalog.rhizome.org/entity/statement/Q1166-A687AA4E-AC6C-4BCA-8010-35C23EF783C3, https://staging.catalog.rhizome.org/entity/statement/Q1166-511bb92d-497e-f37f-7b8b-2cce2730042a, https://staging.catalog.rhizome.org/entity/statement/Q1166-85603605-496B-4865-8F1C-FBB19E478DFF, https://staging.catalog.rhizome.org/entity/statement/Q1166-21D3661A-56E5-45AF-8DCB-019D30E5E3AC, https://staging.catalog.rhizome.org/entity/statement/Q1166-B2C4587C-9318-4717-B5C9-32A13B1CC2AC, https://staging.catalog.rhizome.org/entity/statement/Q1166-4d8aad6f-429f-270e-ca33-4d032f5f7f5b, https://staging.catalog.rhizome.org/entity/statement/Q1166-fcdfc3a7-4c2e-d1a8-d7e0-3a5d523ca48e].  Expected only sitelinks and subjects starting with http://wikibase.svc/wiki/Special:EntityData/ and http://wikibase.svc/entity/
	at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.finishCommon(Munger.java:833)
	at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.munge(Munger.java:430)
	at org.wikidata.query.rdf.tool.rdf.Munger.munge(Munger.java:223)
	at org.wikidata.query.rdf.tool.Updater.handleChange(Updater.java:305)
	at org.wikidata.query.rdf.tool.Updater.lambda$handleChanges$0(Updater.java:188)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
10:02:54.342 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2018-07-08T07:05:30Z at (0.0, 0.0, 0.0) updates per second and (0.0, 0.0, 0.0) milliseconds per second
10:02:54.370 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got no real changes
10:02:54.370 [main] INFO  org.wikidata.query.rdf.tool.Updater - Sleeping for 10 secs
10:03:04.395 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got no real changes
10:03:04.396 [main] INFO  org.wikidata.query.rdf.tool.Updater - Sleeping for 10 secs
10:03:14.419 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got no real changes
10:03:14.419 [main] INFO  org.wikidata.query.rdf.tool.Updater - Sleeping for 10 secs
10:03:24.442 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got no real changes
10:03:24.443 [main] INFO  org.wikidata.query.rdf.tool.Updater - Sleeping for 10 secs
[...and so forth...]

This:

https://staging.catalog.rhizome.org/entity/statement/Q1166-fcdfc3a7-4c2e-d1a8-d7e0-3a5d523ca48e].  Expected only sitelinks and subjects starting with http://wikibase.svc/wiki/Special:EntityData/ and http://wikibase.svc/entity/

Looks like a sign that the concept URI base is not set up correctly - the service thinks the URI base is wikibase.svc but in fact it is staging.catalog.rhizome.org. Note that concept URI base and server URL are two different things - the former says how RDF looks like and the latter says where to go to fetch it, they can be completely different. See https://www.mediawiki.org/wiki/Wikidata_Query_Service/Implementation/Standalone for more docs about it.

After some consultation with @Tarrow

  1. I removed the query service volume (starting from scratch with Blazegraph)
  2. Updated all instances of wikibase.svc in docker-compose.yml to the full public domain name.
  3. Entered the wdqs-updater container and ran the updating command again.

This created the following error:

# docker exec -it dockercomposefiles_wdqs-updater_1 bash
bash-4.4# /runUpdate.sh -- -v --start 20010101120000 -t 2 --verify -W https://staging.catalog.rhizome.org/ -U http://staging.catalog.rhizome.org/ --init
wait-for-it.sh: waiting 120 seconds for staging.catalog.rhizome.org:80
wait-for-it.sh: staging.catalog.rhizome.org:80 is available after 0 seconds
wait-for-it.sh: waiting 120 seconds for wdqs.svc:9999
wait-for-it.sh: wdqs.svc:9999 is available after 0 seconds
Updating via http://wdqs.svc:9999/bigdata/namespace/wdq/sparql
OpenJDK 64-Bit Server VM warning: Cannot open file /var/log/wdqs/wdqs-updater_jvm_gc.pid202.log due to No such file or directory

Could not start Jolokia agent: java.net.BindException: Address in use
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
09:19:27.957 [main] INFO  org.wikidata.query.rdf.tool.Update - Checking where we left off
09:19:27.960 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - Checking for left off time from the updater
09:19:28.099 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - Found left off time from the updater
09:19:28.101 [main] ERROR org.wikidata.query.rdf.tool.Update - Error during initialization.
java.lang.IllegalStateException: RDF store reports the last update time is before the minimum safe poll time.  You will have to reload from scratch or you might have missing data.
	at org.wikidata.query.rdf.tool.Update.buildRecentChangePollerChangeSource(Update.java:168)
	at org.wikidata.query.rdf.tool.Update.buildChangeSource(Update.java:141)
	at org.wikidata.query.rdf.tool.Update.main(Update.java:65)
Exception in thread "main" java.lang.IllegalStateException: RDF store reports the last update time is before the minimum safe poll time.  You will have to reload from scratch or you might have missing data.
	at org.wikidata.query.rdf.tool.Update.buildRecentChangePollerChangeSource(Update.java:168)
	at org.wikidata.query.rdf.tool.Update.buildChangeSource(Update.java:141)
	at org.wikidata.query.rdf.tool.Update.main(Update.java:65)

I'm now wondering if runUpdater.sh is starting up the updater service in general, which seems to be already running:
Could not start Jolokia agent: java.net.BindException: Address in use

The exception java.lang.IllegalStateException: RDF store reports the last update time is before the minimum safe poll time. You will have to reload from scratch or you might have missing data. doesn't make sense to me... What should be reloaded? Isn't the command I issued all about ignoring the save poll time?

Blazegraph still seems to be empty, see test query.

This error means that the timestamp stored in the database is more than 30 days behind (can be changed with wikibaseMaxDaysBack property). In this case, you can:

  • Load a dump that is reasonably recent
  • Run Updater with -s DATE --init

I think I need more guidance on how to run this script. I did use the --start and --init switches before.

When calling ./runUpdate.sh -- -v -s 20010101120000 --init it doesn't seem possible to connect to Blazegraph because the updater already runs?

bash-4.4# ./runUpdate.sh -- -v -s 20010101120000 --init
Updating via http://localhost:9999/bigdata/namespace/wdq/sparql
OpenJDK 64-Bit Server VM warning: Cannot open file /var/log/wdqs/wdqs-updater_jvm_gc.pid355.log due to No such file or directory

Could not start Jolokia agent: java.net.BindException: Address in use
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
18:36:01.641 [main] INFO  o.w.q.rdf.tool.options.OptionsUtils - Verbose mode activated
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
18:36:02.229 [main] DEBUG o.w.query.rdf.tool.rdf.RdfRepository - Setting last updated time to Mon Jan 01 12:00:00 GMT 2001
18:36:02.242 [main] DEBUG o.w.query.rdf.tool.rdf.RdfRepository - Running SPARQL: DELETE {
  <http://www.wikidata.org> <http://schema.org/dateModified> ?o .
}
WHERE {
  <http://www.wikidata.org> <http://schema.org/dateModified> ?o .
};
INSERT DATA {
  <http://www.wikidata.org> <http://schema.org/dateModified> "2001-01-01T12:00:00.000Z"^^<xsd:dateTime> .
}

18:36:02.277 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 1, will retry
18:36:04.280 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 2, will retry
18:36:08.283 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 3, will retry
18:36:16.286 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 4, will retry
18:36:26.289 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 5, will retry
18:36:36.291 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 6, will fail
18:36:36.294 [main] ERROR org.wikidata.query.rdf.tool.Update - Error during initialization.
org.wikidata.query.rdf.tool.exception.FatalException: Error updating triple store
	at org.wikidata.query.rdf.tool.rdf.RdfRepository.execute(RdfRepository.java:732)
	at org.wikidata.query.rdf.tool.rdf.RdfRepository.updateLeftOffTime(RdfRepository.java:665)
	at org.wikidata.query.rdf.tool.Update.buildRecentChangePollerChangeSource(Update.java:156)
	at org.wikidata.query.rdf.tool.Update.buildChangeSource(Update.java:141)
	at org.wikidata.query.rdf.tool.Update.main(Update.java:65)
Caused by: com.github.rholder.retry.RetryException: Retrying failed to complete successfully after 6 attempts.
	at com.github.rholder.retry.Retryer.call(Retryer.java:174)
	at org.wikidata.query.rdf.tool.rdf.RdfRepository.execute(RdfRepository.java:721)
	... 4 common frames omitted
Caused by: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused
	at org.eclipse.jetty.client.util.FutureResponseListener.getResult(FutureResponseListener.java:118)
	at org.eclipse.jetty.client.util.FutureResponseListener.get(FutureResponseListener.java:101)
	at org.eclipse.jetty.client.HttpRequest.send(HttpRequest.java:639)
	at org.wikidata.query.rdf.tool.rdf.RdfRepository.lambda$execute$0(RdfRepository.java:722)
	at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78)
	at com.github.rholder.retry.Retryer.call(Retryer.java:160)
	... 5 common frames omitted
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.eclipse.jetty.io.SelectorManager.finishConnect(SelectorManager.java:340)
	at org.eclipse.jetty.io.SelectorManager$ManagedSelector.processConnect(SelectorManager.java:671)
	at org.eclipse.jetty.io.SelectorManager$ManagedSelector.processKey(SelectorManager.java:640)
	at org.eclipse.jetty.io.SelectorManager$ManagedSelector.select(SelectorManager.java:607)
	at org.eclipse.jetty.io.SelectorManager$ManagedSelector.run(SelectorManager.java:545)
	at org.eclipse.jetty.util.thread.NonBlockingThread.run(NonBlockingThread.java:52)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
	at java.lang.Thread.run(Thread.java:748)
Exception in thread "main" org.wikidata.query.rdf.tool.exception.FatalException: Error updating triple store
	at org.wikidata.query.rdf.tool.rdf.RdfRepository.execute(RdfRepository.java:732)
	at org.wikidata.query.rdf.tool.rdf.RdfRepository.updateLeftOffTime(RdfRepository.java:665)
	at org.wikidata.query.rdf.tool.Update.buildRecentChangePollerChangeSource(Update.java:156)
	at org.wikidata.query.rdf.tool.Update.buildChangeSource(Update.java:141)
	at org.wikidata.query.rdf.tool.Update.main(Update.java:65)
Caused by: com.github.rholder.retry.RetryException: Retrying failed to complete successfully after 6 attempts.
	at com.github.rholder.retry.Retryer.call(Retryer.java:174)
	at org.wikidata.query.rdf.tool.rdf.RdfRepository.execute(RdfRepository.java:721)
	... 4 more
Caused by: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused
	at org.eclipse.jetty.client.util.FutureResponseListener.getResult(FutureResponseListener.java:118)
	at org.eclipse.jetty.client.util.FutureResponseListener.get(FutureResponseListener.java:101)
	at org.eclipse.jetty.client.HttpRequest.send(HttpRequest.java:639)
	at org.wikidata.query.rdf.tool.rdf.RdfRepository.lambda$execute$0(RdfRepository.java:722)
	at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78)
	at com.github.rholder.retry.Retryer.call(Retryer.java:160)
	... 5 more
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.eclipse.jetty.io.SelectorManager.finishConnect(SelectorManager.java:340)
	at org.eclipse.jetty.io.SelectorManager$ManagedSelector.processConnect(SelectorManager.java:671)
	at org.eclipse.jetty.io.SelectorManager$ManagedSelector.processKey(SelectorManager.java:640)
	at org.eclipse.jetty.io.SelectorManager$ManagedSelector.select(SelectorManager.java:607)
	at org.eclipse.jetty.io.SelectorManager$ManagedSelector.run(SelectorManager.java:545)
	at org.eclipse.jetty.util.thread.NonBlockingThread.run(NonBlockingThread.java:52)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
	at java.lang.Thread.run(Thread.java:748)

it doesn't seem possible to connect to Blazegraph because the updater already runs?

The errors do indicate connection to Blazegraph fails, most likely either because Blazegraph is not listening to localhost:9999 or something in your setup prevents Updater from connecting there. It is completely fine to run more than one instance of Updater (though in most cases not recommended) so that would not be a problem.

Updating via http://localhost:9999/bigdata/namespace/wdq/sparql

Just a hunch but if you are using the docker images then this is probably wrong.

I think there is lots of ambiguity here...

For the record: I'm entering the WDQS container, not the WDQS-updater one:
root@wikibase-docker:~# docker exec -ti dockercomposefiles_wdqs_1 bash

From inside that docker, I'm executing the command:
bash-4.4# ./runUpdate.sh -- -w staging.catalog.rhizome.org -s 20010101000000

The connection fail seems to happen when the updater tries to connect to Rhizome's Wikibase:

bash-4.4# ./runUpdate.sh -- -w staging.catalog.rhizome.org -s 20010101000000
Updating via http://localhost:9999/bigdata/namespace/wdq/sparql
OpenJDK 64-Bit Server VM warning: Cannot open file /var/log/wdqs/wdqs-updater_jvm_gc.pid616.log due to No such file or directory

I> No access restrictor found, access to any MBean is allowed
Jolokia: Agent started with URL http://127.0.0.1:8778/jolokia/
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
10:48:42.193 [main] ERROR org.wikidata.query.rdf.tool.Update - Error during updater run.
java.lang.RuntimeException: org.apache.http.conn.HttpHostConnectException: Connect to staging.catalog.rhizome.org:443 [staging.catalog.rhizome.org/172.18.0.5] failed: Connection refused (Connection refused)

The connection to the Wikibase is refused. It points to an internal docker network IP address and tries to connect via HTTPS, but the docker setup doesn't provide HTTPS naturally.

However, if I'm running the updater without specifying the Wikibase host with the -w switch, it happily gets the triples of Wikidata proper into my modest Blazegraph instance:

bash-4.4# ./runUpdate.sh -- -s 20010101000000 --init
Updating via http://localhost:9999/bigdata/namespace/wdq/sparql
OpenJDK 64-Bit Server VM warning: Cannot open file /var/log/wdqs/wdqs-updater_jvm_gc.pid644.log due to No such file or directory

I> No access restrictor found, access to any MBean is allowed
Jolokia: Agent started with URL http://127.0.0.1:8778/jolokia/
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
10:55:06.656 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got 79 changes, from Q43576244@725717221@20180814105503|761278916 to Q55870317@725717323@20180814105519|761279016
10:55:15.531 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2018-08-14T10:55:19Z (next: 20180814105519|761279017) at (0.0, 0.0, 0.0) updates per second and (0.0, 0.0, 0.0) milliseconds per second
10:55:15.762 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got 77 changes, from Q23647613@725717330@20180814105520|761279026 to Q39895159@725717420@20180814105536|761279115
[...]

So to me it looks like the localhost:9999 doesn't seem too wrong, just the source of the triples cannot be set.

When removing the alias staging.catalog.rhizome.org from the wikibase container in the docker-compose file, the connection is made via the docker network and can be established. However, the contents of the Wikibase are still rejected:

bash-4.4# ./runUpdate.sh -- -w staging.catalog.rhizome.org -s 20010101000000 --init
Updating via http://localhost:9999/bigdata/namespace/wdq/sparql
OpenJDK 64-Bit Server VM warning: Cannot open file /var/log/wdqs/wdqs-updater_jvm_gc.pid125.log due to No such file or directory

I> No access restrictor found, access to any MBean is allowed
Jolokia: Agent started with URL http://127.0.0.1:8778/jolokia/
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
11:04:25.240 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got 26 changes, from Q2596@77721@20180130201206|81595 to Q4977@77814@20180215170648|81689
11:04:25.896 [update 4] WARN  org.wikidata.query.rdf.tool.Updater - Contained error syncing.  Giving up on Q4966
org.wikidata.query.rdf.tool.rdf.Munger$BadSubjectException: Unrecognized subjects:  [https://staging.catalog.rhizome.org/entity/statement/Q4966-3e9eee06-4352-81b8-04cc-c1526542629e, https://staging.catalog.rhizome.org/entity/statement/Q4966-2beb1833-409f-0b9f-c075-571cc6b78eb0, https://staging.catalog.rhizome.org/entity/statement/Q4966-69862d13-4a1e-5770-c402-c37cf7441093, https://staging.catalog.rhizome.org/entity/statement/Q4966-9b394b4f-4c87-6aad-a016-042256e4d990, https://staging.catalog.rhizome.org/entity/Q4966, https://staging.catalog.rhizome.org/value/2b3197a343b1554b824b915aa6ffd70f].  Expected only sitelinks and subjects starting with http://staging.catalog.rhizome.org/wiki/Special:EntityData/ and http://staging.catalog.rhizome.org/entity/
	at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.finishCommon(Munger.java:833)
	at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.munge(Munger.java:430)
	at org.wikidata.query.rdf.tool.rdf.Munger.munge(Munger.java:223)
	at org.wikidata.query.rdf.tool.Updater.handleChange(Updater.java:305)
	at org.wikidata.query.rdf.tool.Updater.lambda$handleChanges$0(Updater.java:188)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
11:04:25.897 [update 6] WARN  org.wikidata.query.rdf.tool.Updater - Contained error syncing.  Giving up on Q4967

[...repeated for hundreds of Q-ids...]

Any idea what is going on here? Is the data in the Wikibase itself structured wrongly?

Any idea what is going on here?

Yes, you're using https://staging.catalog.rhizome.org/ in your data but set the concept uri to http://staging.catalog.rhizome.org/. Use --conceptUri to set the correct URI. See https://www.mediawiki.org/wiki/Wikidata_Query_Service/Implementation/Standalone

Thank you @Smalyshev!

It seems like the --conceptUri switch is a part of munger. It is not accepted as a parameter for runUpdate.sh.

After checking this, I did also modify the Wikibase's LocalSettings.php to explicitely use http (without 's') for concept URLs:

$wgWBRepoSettings['conceptBaseUri'] = 'http://staging.catalog.rhizome.org/entity/';

The API delivers RDF with the http protocol used in all local name spaces, too: http://staging.catalog.rhizome.org/wiki/Special:EntityData/Q1996.ttl

Now, running the updater still only updates 26 edits:

bash-4.4# ./runUpdate.sh -- -w staging.catalog.rhizome.org -s 20010101000000 --init
Updating via http://localhost:9999/bigdata/namespace/wdq/sparql
OpenJDK 64-Bit Server VM warning: Cannot open file /var/log/wdqs/wdqs-updater_jvm_gc.pid119.log due to No such file or directory

I> No access restrictor found, access to any MBean is allowed
Jolokia: Agent started with URL http://127.0.0.1:8778/jolokia/
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
05:57:09.390 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got 26 changes, from Q2596@77721@20180130201206|81595 to Q4977@77814@20180215170648|81689
05:57:09.809 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2018-02-15T17:06:48Z (next: 20180215170711|81690) at (0.0, 0.0, 0.0) updates per second and (0.0, 0.0, 0.0) milliseconds per second
05:57:09.885 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Skipping change with bogus title:  Main Page
05:57:09.887 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got 9 changes, from Q4977@77815@20180215170711|81690 to Q1166@77841@20180708070530|81717
05:57:09.976 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2018-07-08T07:05:30Z at (0.0, 0.0, 0.0) updates per second and (0.0, 0.0, 0.0) milliseconds per second
05:57:10.008 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got no real changes
05:57:10.009 [main] INFO  org.wikidata.query.rdf.tool.Updater - Sleeping for 10 secs
05:57:20.036 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got no real changes

So I guess the whole change history is not available in the Wikibase API after migrating the database from the previous install.

Would it be correct then to export the full ttl dump, load it into Blazegraph, and then run the updater again?

So I guess the whole change history is not available in the Wikibase API after migrating the database from the previous install.

It is RecentChanges that you need to look at.
The age of the stuff in RC is configured by https://www.mediawiki.org/wiki/Manual:$wgRCMaxAge
Depending on the MW version it could be anywhere between 7 and 90 days. And of course this can be altered by your config.

Would it be correct then to export the full ttl dump, load it into Blazegraph, and then run the updater again?

If the changes are not in RecentChanges, then yes.

We have been able to put the ttl-dump into Blazegraph now with the following process:

  1. in the wdqs container, install curl:
# apk add --no-cache curl
  1. export ttl dump into directory shared in between containers; command on docker host:
# docker exec dockercomposefiles_wikibase_1 php extensions/Wikibase/repo/maintenance/dumpRdf.php > dumps/ttl-20180917.ttl
  1. inside wdqs container, import ttl file by directly instructing blazegraph to load it:
# curl "http://localhost:9999/bigdata/namespace/wdq/sparql"  --data-urlencode "update=DROP ALL; LOAD <file:///tmp/dumps/ttl-20180917.ttl>;"
  1. queries are now possible via the query service, simple example query, specific example query.

Big Issue

None of the query building helpers in WDQS work. The interface doesn't know about any properties or objects.

Questions

  • What is required to make the query building functions in WDQS work?
  • what do munge.sh and loadData.sh do apart from splitting up a potentially large TTL file into smaller chunks? (Since Rhizome's data is quite small at the moment, we wouldn't really need to split up the data.)

If you feel that it would benefit others do you think you could submit some changes to docs or add a script?

Big Issue

None of the query building helpers in WDQS work. The interface doesn't know about any properties or objects.

That sounds seperate to this task?

Questions

  • What is required to make the query building functions in WDQS work?

I believe you just need to point things to the correct mediawiki APIs

  • what do munge.sh and loadData.sh do apart from splitting up a potentially large TTL file into smaller chunks? (Since Rhizome's data is quite small at the moment, we wouldn't really need to split up the data.)

munge does things with triples.
What exactly it does I can't write here and would have to dig through the java code.

I believe munge.sh applies the WDQS data differences documented on the RDF Dump Format page (e. g. merge wdata: and wd:).

I believe munge.sh applies the WDQS data differences documented on the RDF Dump Format page (e. g. merge wdata: and wd:).

Thats a great list, I was unaware of those docs.

I believe munge.sh applies the WDQS data differences documented on the RDF Dump Format page (e. g. merge wdata: and wd:).

That is correct. It also historically cleaned up some bad data coming from Wikidata (e.g. broken dates) due to bugs, etc. but not sure how much of that still applies.

Relevant to this ticket.
I just wrote a blog post guiding through the process of changing concept URI of a wikibase and reloading data into a fresh query service.
https://addshore.com/2019/11/changing-the-concept-uri-of-an-existing-wikibase-with-data/

@despens do you see any more actionables here?

Addshore claimed this task.

I think so as the script to reset the timestamp exists, and the process for reloading data is documented.
I don't think we would be able to make things easier than that right now